<<

A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROL USING ASSOCIATIVE AND CONSEQUENCES FOR PARALLEL

A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

by

Man Yuan

December 2012 Dissertation written by

Man Yuan

B.S., Hefei University of Technology, China 2001

M.S., University of Western Ontario, Canada 2003

Ph.D., Kent State University, 2012

Approved by

Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee

Dr. Lothar Reichel , Members, Doctoral Dissertation Committee

Dr. Mikhail Nesterenko

Dr. Ye Zhao

Dr. Richmond Nettey

Accepted by

Dr. Javed I. Khan , Chair, Department of

Dr. Raymond Craig , Dean, College of Arts and Sciences

ii TABLE OF CONTENTS

LISTOFFIGURES ...... vii

LISTOFTABLES ...... ix

Acknowledgements ...... x

1 Introduction...... 1

2 BackgroundInformation ...... 7

2.1 Flynn’s Taxonomy and Classification ...... 7

2.1.1 Multiple Instruction Multiple Data Stream (MIMD) ... 7

2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)..... 8

2.2 Real-TimeSystems ...... 10

2.3 AirTrafficControl(ATC) ...... 13

2.3.1 TaskCharacteristics ...... 14

2.3.2 The worst-case environment of ATC ...... 14

2.3.3 ATCTasks ...... 15

2.4 AnAssociativeProcessorforATC...... 16

2.5 OpenMP...... 18

2.5.1 AvoidFalseSharing ...... 20

2.5.2 OptimizeBarrierUse...... 21

2.5.3 AvoidtheOrderedConstruct ...... 22 iii 2.5.4 AvoidLargeCriticalRegions ...... 22

2.5.5 MaximieParallelRegions ...... 22

2.5.6 Avoid Parallel Regions in Inner Loops ...... 23

2.5.7 ImproveLoadBalance ...... 24

2.5.8 Using Features to Improve Performance ...... 24

3 SurveyofLiterature...... 26

3.1 PreviousWorkonATC...... 26

3.2 Classical MIMD Real-Time Scheculing Theory ...... 29

3.3 OtherSimilarWork...... 29

4 APSolutiontoATC ...... 31

4.1 OverviewofAPSolution...... 31

4.2 AdvantagesofAPoverMIMDforATC...... 31

4.3 APproperties...... 33

5 Emulating the AP on the ClearSpeed CSX600 ...... 37

5.1 OverviewofClearSpeedCSX600...... 37

5.2 Emulating the AP on the ClearSpeed CSX600 ...... 39

6 ATCSystemDesign ...... 41

6.1 ATCDataFlow...... 41

6.2 StaticScheduling ...... 41

6.3 Scaling up from CSX600 to a Realistic Size AP ...... 43

iv 7 AP Solution for ATC Tasks Implemented on CSX600 ...... 46

7.1 ReportCorrelationandTracking ...... 46

7.2 CockpitDisplay...... 52

7.3 ControllerDisplayUpdate ...... 52

7.4 AutomaticVoiceAdvisory ...... 53

7.5 SporadicRequests...... 53

7.6 Conflict Detection and Resolution (CD&R) ...... 53

7.6.1 ConflictDetection ...... 53

7.6.2 ConflictResolution ...... 57

7.7 TerrainAvoidance ...... 58

7.8 FinalApproach(Runways)...... 60

7.9 Comparison of AP and MIMD solutions for ATC Tasks ...... 61

7.10 Implementation of Issues ...... 63

8 ExperimentalResults...... 65

8.1 ExperimentalSetup...... 65

8.2 ExperimentResults...... 67

8.2.1 Comparison of Performance on STI, MIMD and CSX600 . . . . . 67

8.2.2 Comparison of Performance on MIMD, CSX600 and STARAN . . 68

8.2.3 How close to the AP is our CSX600 emulation of the AP . . . . . 70

8.2.4 Comparison of Predictability on CSX600 and MIMD ...... 72

8.2.5 Comparison of Missing Deadlines on CSX600 and MIMD . . . .. 73

8.2.6 Timingsfor8Tasks...... 73

v 8.2.7 VideoandActualSTARANDemos ...... 76

9 Observations and Consequences of the Preceding Results ...... 78

10ConclusionsandFutureWork ...... 83

BIBLIOGRAPHY ...... 89

vi LIST OF FIGURES

1 SIMDModel...... 9

2 APArchitecture...... 17

3 Fork-JoinTaskModel ...... 19

4 APossibleAPArchitectureforATC ...... 34

5 CSX600acceleratorboard ...... 38

6 MTAParchitectureofCSX600...... 38

7 Static schedule of ATC tasks implemented in AP ...... 43

8 Track/Reportcorrelation...... 48

9 Searchboxsizeforflightmaneuver ...... 50

10 Conflictdetection...... 56

11 Timingoftrackingtask...... 68

12 Timingofterrainavoidancetask...... 68

13 TimingofCD&Rtask...... 69

14 Comparison of tracking task of MTI, CSX600 and STARAN...... 69

15 Comparison of terrain avoidance task of MTI, CSX600 and STARAN. . . 70

16 Comparison of CD&R task of MTI, CSX600 and STARAN...... 70

17 Comparison of display processing task of MTI, CSX600 and STARAN. . 71

18 Time of tracking for number of aircraft from 10 to 384...... 71

19 Time of CD&Rfornumber ofaircraftfrom10to384...... 72

vii 20 Comparison of MTI, STARAN and Super-CSX600 with at most one air-

craftperPE...... 72

21 Predictability of execution times for report correlation and tracking task. 73

22 Predictability of execution times for terrain avoidance task...... 73

23 Predictability of execution times for CD&R task...... 74

24 Number of iterations missing deadlines when scheduling tasks...... 74

25 OverallATCSystemDesign...... 98

viii LIST OF TABLES

1 Timings of Associative Functions ...... 40

2 StaticScheduleforATCTasks...... 42

3 Statically Scheduled Solution Time for Worst Case Environment of ATC 45

4 Data Structures for Radar Reports ...... 48

5 Data Structures for Tracks ...... 49

6 ComparisonofRequiredATCOperations...... 62

7 PerformanceofOneFlight/PE...... 74

8 Performance of Eight Tasks for Ten Flights/PE ...... 75

ix Acknowledgements

I would like to sincerely thank my advisor, Dr. Johnnie W. Baker, for his constant

support and encouragement throughout my Ph.D. study. He has been the advisor for

my research work throughout these years and also the mentor for my professional ca-

reer development. Without his encouragement and guidance, this dissertation would be

impossible.

My special thanks will go to many people who have helped me at various stages of this work: Will Meilander, Frank Drews, Sue Peti, and Marcy Curtiss etc. In particular,

Professor Will Meilander has contributed valuable information for my dissertation. He has read through all my publications and provided many useful comments.

I would also like to thank my dissertation committee members for their willingness to serve on the committee and their valuable time.

On a personal note, I would like to express my gratitude to my parents Zhen Yuan and Xiaohang Xu for their love and emotional support. Finally, I would like to express my thanks to my girl friend Sijin Zhang for her moral support and for the incredible amount of patience she had with me so that I can realize my dream eventually.

x CHAPTER 1

Introduction

The Air Traffic Control (ATC) system is a typical real-time system that continuously

monitors, examines, and manages space conditions for thousands of flights by processing

large volumes of data that are dynamically changing due to reports by sensors, pilots, and

controllers, and gives the best estimate of position, speed and heading of every aircraft in

the environment at all times. The ATC consists of multiple real-time tasks that

must be completed in time to meet their individual deadlines. By performing various

tasks, the ATC system keeps timely correct information concerning positions, velocities,

contiguous space conditions, etc., for all flights under control. Since any lateness or

failure of task completion could cause disastrous consequences, time constraints for tasks

are critical. [1–3]

In the past, solutions to this problem have been implemented on multicomputer sys- tems with records for various aircraft stored in the memory accessible to the processors in this system. The dispersed nature of this dynamic ATC system and the necessity of maintaining data integrity while providing rapid access to this data by multiple instruc- tion streams (MIS) increases the difficulty and complexity of handling air traffic control.

It is difficult for these MIMD implementations to satisfy reasonable predictability stan- dards that are critical for meeting the strict certification standards needed to ensure safety for critical software components. Massive efforts have been devoted to finding an efficient MIMD solution to the ATC problems since 1963 but none of them has been 1 2

satisfactory yet [4–7]. To a large degree, this is due to the fact that the software used by

MIMDs in solving real-time problems involves solutions to many difficult problems such

as dynamic scheduling, load balancing, etc [6]. It is currently widely accepted that large

real-time problems such as ATC do not have a polynomial time MIMD solution [8, 9].

Even though many researchers have put extensive efforts into producing a reasonably

good (e.g., using heuristic ) solution to the ATC problem, the current ATC

system has repeatedly failed to meet the requirements of the ATC automation system.

It is frequently been reported that the current ATC system periodically loses radar data

without any warnings and misses tracking of aircraft in the sky. This problem has even

occurred with Air Force One. [3–5,7]

Instead of using the traditional MIMD approach, we implement a prototype of the

ATC system on an enhanced SIMD hardware system called an associative processor(AP) or associative SIMD, where interactions are much simpler and more efficiently controlled, due in large part to avoiding the need to coordinate the interactions of multiple instruc- tion streams. AP supports some additional constant time operations, e.g., broadcast- ing, associative searches, AND/OR and maximum/minimum reductions (assuming word length is a constant) [4,10], details of which will be illustrated in section 2.4. It is shown that an efficient polynomial time solution to a real-time problem for Air Traffic Control can be obtained [3–7]. In addition, associative SIMD computers have been built which support the AP model and can support this polynomial time solution to the ATC and meet the required deadlines. The first AP system was designed by Kenneth Batcher and implemented in the Goodyear Aerospace STARAN computer, which was specifically de- signed to support ATC [11–13]. A second generation STARAN, called the ASPRO [14], 3

also an AP designed at Goodyear Aerospace, was used extensively by the Navy for Air-

borne Air Defense Systems applications for many years [15]. In addition, normal SIMDs

cannot handle ATC because they cannot input and output radar sensor data efficiently.

STARAN and ASPRO had overcome this I/O limitation by the flip network of Multidi-

mensional Accesss(MDA) memory, details are in Section 2.4.

We use ClearSpeed CSX600 to emulate an AP and implement eight key ATC tasks for our ATC prototype, namely report correlation and tracking, cockpit display, controller display update, sporadic requests, automatic voice advisory, terrain avoidance, conflict detection and resolution (CD&R) and final approach (runway optimization). The se- lection of the specific tasks was based on information in the following references: FAA

Grants for Aviation Research Program Solicitation [16], FAA’s NextGen Implementation

Plan [17], and the FAA 1963 ATC Specifications [18]. Reference [18] indicates that the tasks we selected were similar to the ones selected in 1963 for implementation by indus- tries interested in competing for the job of implementing ATC for FAA. Reference [16] describes the FAA’s efforts to improve the aircraft capacity of the airspace while main- taining high safety standards and aircraft safety technology for conflict detection and resolution. These indicate that the tasks of the type that we have selected are key to ATC implementation. Further, the NextGen plan [17] discusses the new standards for ATC including developing capabilities in traffic flow management, dynamic airspace configuration, separation assurance, super density operations, and airport surface opera- tions. The purpose is to achieve a safe, efficient and high-capacity airspace system. The following subtopics in [17] address current research that needs improvement: comprehen- sive analysis of uncertainties in the National Airspace System; air traffic management 4 functional allocation using advanced computing and networking; guarantee safety of air- to-air and air-to-ground. The type of activities discussed there will require the use of tasks similar to the ones that we have selected for implementation, which show that our eight ATC tasks have already captured most of the workload of ATC, and nothing major is missing that may impact the ATC performance. In our careful search of literature for publications on ATC, the ones that kept re-occurring in the ATC literature we found were all included in the tasks we choose for our prototype to use for benchmarking.

Several of our previous papers [4–6,19] have used the AP to manage ATC computation for an ATC sector. Similar SIMD parallel approaches have been used for collision avoid- ance algorithms between multiple agents for real-time simulations in S.Guy et. al. [20].

We used ClearSpeed CSX600 to emulate an AP in our previous work [21]. However, the assumed maximum number of aircraft being tracked is 4000 IFR (instrument flight rules) aircraft and 10000 VFR (visual flight rules) aircraft, for a total of 14000 aircraft [21].

Because the emulation tool CSX600 can only a small number of aircraft, an ideal AP system would have to be a lot larger than the CSX600. Our paper [21] has implemented two ATC tasks, i.e., report correlation tracking and conflict detection and resolution (CD&R) on CSX600, and compared the performance of only one task, report correlation and tracking, on both CSX600 and MIMD. However, in the comparison tests, only one task was executed repeatedly, without any competition from other tasks. Also, no adjustment was included for the greater combined computational speed of MIMD. In our most recent papers [22,23], we not only implement the 8 tasks in CSX600, but also schedule them on both CSX600 and an 8-core MIMD with the fastest host-only version implemented using OpenMP [24]. 5

Our results show that some deadlines are regularly missed by these tasks when imple- mented on a MIMD, while the AP implementation meets all deadlines for the tasks that can be statically scheduled. The results indicate that the AP emulation has much better , efficiency and predictability than the MIMD implementation. Moreover, the proposed AP solution will support accurate predictions of worst case execution times and will guarantee that all deadlines are met. In contrast, MIMD can only support average case timings, not worst case timings, and cannot guarantee that all deadlines are met all the time. Also, the software used by the AP is much simpler and smaller in size than the current corresponding ATC software, as the MIMD software must also support additional activities such as dynamic scheduling and load balancing etc, which are not needed by the AP. The solution used in AP is very similar to a sequential solution for this problem, both in style and in code size. As a result, the size of this software solution is only a very small fraction of the size of the MIMD solutions to the ATC problem. This results in a dramatic drop in the cost in both the cost of creating and in maintaining this software when compared to the MIMD solutions that have been given to the ATC problem. An important consequence of these features is that the V&V (Validation and Verification) process will be considerably simpler than for current ATC software. Moreover, the AP hardware is considerably cheaper, simpler, and easier to build than the MIMD hardware currently used to support ATC. Furthermore, the AP solution to ATC can be applied in many other real-time problems, dynamic problems, Unmanned Aircraft Sys- tems (UASs), and many other applications. Thus we should consider SIMD solutions to solve many problems for which SIMD or AP solutions can be sim- pler and more efficient than MIMD solutions. A large portion of our research has been 6

presented in our most recent journal [23].

This dissertation is organized into ten chapters. First, Chapter 2 introduces back- ground information and terminology used throughout this dissertation. Section 2.1 intro- duces the two most important parallel architectures: MIMD in Section 2.1.1 and SIMD in Section 2.1.2. Section 2.2 introduces terminology of real-time processing. Section 2.3 describes the air traffic control (ATC) problem including worst case assumptions, con- straints of the system and examples of ATC tasks. Section 2.4 gives an overview of asso- ciative processor(AP), its properties such as constant time associative operations etc. The basics of OpenMP and the techniques of how to improve performance of OpenMP pro- grams are introduced in Section 2.5. Second, Chapter 3 gives a survey of previous related research on ATC, and comparison of their pros and cons. Chapter 4 gives an overview of AP solution to ATC, discusses the advantages of AP over MIMD, and discusses some issues concerning the property of AP. Chaper 5 gives an overview of the CSX600 ar- chitecture and programming concepts, and discuss emulating the AP on the CSX600.

In Chapter 6, the overall system design and static scheduling is illustrated. Chapter 7 describes the algorithms for 8 key ATC real-time tasks implemented on CSX600, and how they can be scaled up to realistic size ATC. Experimental results are presented in

Chapter 8: comparison of worst case times of some most important ATC tasks for 8-core

MIMD, CSX600 and AP, comparison of predictability and missing deadlines, and timings for 8 tasks. The observations and consequences of the preceding results are in Chapter 9.

Chapter 10 gives the final conclusions and discusses some work planned for the future. CHAPTER 2

Background Information

This chapter introduces and discusses some terminologies and background informa-

tion used for this dissertation. Some basic knowledge of parallel algorithms have been

introduced in [25–27].

2.1 Flynn’s Taxonomy and Classification

Flynn’s taxonomy is the best-known classification scheme for parallel computers, in

which a computer’s category depends on parallelism it exhibits with its instruction stream

and data stream. A process is: a sequence of instructions (the instruction stream denoted

as I ) manipulates a sequence of operands (the data stream denoted as D), and the

instruction stream (I ) and the data stream (D) can be either single (S) or multiple

(M ). Hence Flynn’s classification results in four categories: SISD, MISD, MIMD and

SIMD [26,28]. Because SISD and MISD have nothing to do with this dissertation, they are not introduced.

2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD)

MIMD is generally considered to be the most important class and includes most com- puters currently being built. Processors are asynchronous, since they can independently execute different instructions on different data sets simultaneously. Communications are handled either through called multiprocessors or by use of

7 8

passing called multicomputers. [25, 26, 29, 30] MIMDs have been considered by most re-

searchers to include the most important and least restricted computers compared with

SIMD computers. However, there are numerous NP-hard problems for MIMDs in real- time applications [6,8,9], including dynamic scheduling, load balancing, race conditions, data dependency, non-determinism, , etc. See Garey and Johnson, ”Computers and Intractability: a Guide to the Theory of NP-completeness” [8], and related ref- erences [31–34]. MIMD parallel programming is normally optimized for average case performance and typically worst case is not considered [8,9].

2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)

Most early parallel computers had a SIMD-style design. Figure 1 shows the abstract model of SIMD. SIMD has the instruction stream (IS) that is also called the and stores the program to be executed. Each PE contains arithmetic logical unit (ALU) to perform arithmetic and logical operations, and its local memory to keep only data, not program. The IS compiles the program and broadcasts the executable commands to the PEs (or processing elements). The PEs are very simplistic and are essentially ALUs.

The active PEs execute the IS steps synchronously. All active processors execute the same instructions synchronously, but on different data. [25,26,29,30]

There are three basic types of parallel systems that have been called SIMD [26, 29].

The first is the traditional SIMD and includes the Goodyear Aerospace MPP, Thinking

Machines CM 2, and the MasPar. These parallel computers perform the same operation

on multiple data values simultaneously. The second type is called vector machines or

vector processing machines. They involve the use of pipelined processors and include 9

Figure 1: SIMD Model the large vector machines. The third and most recent SIMD type are systems that are sometime collectively called short vector machines. They evolved from desktop computers as they became more powerful and capable of supporting real-time gaming and video programming. Our focus here is strictly on the traditional SIMD type. Of these three types, only the associative SIMDs can process both sequences of data parallel and scalar operations efficiently and between them with no conversion or low level overhead.

MIMD computers are generally believed to be more powerful and more performance/cost effective compared with SIMD computers. Therefore, current research in various appli- cation areas including real-time systems has put a great deal of emphasis on use of

MIMDs [3,9,35–37]. We know that MIMDs have a lot of difficulties in application based on Section 2.1.1 above. However, use of SIMD computation can avoid essentially all of these problems [3, 6, 7, 12, 30]. Since only one control unit exists in a SIMD, there is no need for synchronization. The major advantage of SIMDs is their simplicity. Their pro- grams are easy to program, debug, and optimize. One of the complaints by programmers about parallel computers is that they are difficult to program. This is not true in SIMDs. 10

Because of the synchronous nature of SIMDs, the flow control of SIMDs is sequential and

at exactly one place in the program text. There is no need to consider numerous process

interactions. Additionally, there is only one program to write and debug and data are

usually distributed among PEs naturally. This dissertation will use an enhanced SIMD

to solve large scale real-time system and show their advantages over MIMD.

2.2 Real-Time Systems

A real-time system is distinguished from a traditional computing system in that its computations must meet specified timing requirements. A real-time system executes tasks to ensure not only their logical correctness but also their temporal correctness.

Otherwise, the system fails no matter how accurate a computation result is. [3,9,34–37]

A real-time task is ”an executable entity of work that, at minimum, is characterized by a worst-case computation time and a time constraint or deadline” [35]. A job is ”an

instance of a task or an invocation of a task” [35]. When a task is invoked, a job is

generated to execute this task with the given conditions. The release time of a task

is the time that the task (or a job of the task) is ready to be executed. A realtime

task can be either periodic, which is activated (released) in a regular interval (period),

or aperiodic, which is activated irregularly in some unknown and possibly unbounded

interval. Aperiodic tasks have either soft or no deadlines, and sporadic tasks have hard

deadlines. Typically, real-time scheduling can be static or dynamic. Static scheduling

refers to the case that the scheduling algorithm has complete knowledge a priori about

all incoming tasks and their constraints such as deadlines, computation times, shared

resource access, and future release times. In contrast, in dynamic scheduling, the system 11

has knowledge about the currently active tasks but it does not have knowledge about

new task arrivals. Thus the scheduling algorithm has to be designed to change its task

schedule over time.

A system period P is the time during which the set of all tasks must be completed

[9, 35–37]. The system deadline D is the constraint time for a system period. A task deadline d is the time constraint for an individual task. Lateness of a task is the time elapse between the task release and its execution. Deadlines can be hard, firm, or soft. A hard deadline is a time constraint for a task that failure to meet it will result in disastrous consequences. A firm deadline is a time constraint for a task that failure to meet it will produce useless results but cause no severe consequences. A soft deadline is the time constraint for a task that the result produced after this deadline will be degraded but may be still useful in a limited time period. Because it is not tolerable to miss deadlines in ATC systems, we consider only hard deadlines in our work. According to [9], a static schedule can be used if the summation of task times for the system period is less than the system deadline D, otherwise a static schedule is not feasible.

A primary concern for a real-time system is how to schedule these tasks to ensure they meet various constraints including deadlines. Depending on different criterion such as minimizing the sum of all task computation times and minimizing the maximum lateness, different scheduling algorithms can be designed [3, 9, 34–37]. An optimal scheduling algorithm is one that may fail to meet a deadline only if no other scheduling algorithm can meet the deadline [34,38,39]. There are three scheduling approaches. The first scheduling algorithm is Cyclic Executives which is static analysis and static scheduling [9, 35]. All tasks, times and priorities are given before system startup, and it is time-driven, i.e., 12

schedules are computed and hardcoded before system startup. We use this approach for

our AP solution to ATC. The second approach is fixed priority scheduling which is static

analysis and dynamic scheduling. All tasks, times and priorities are given before system

startup, and it is priority-driven, dynamic scheduling, i.e., the schedule is constructed

by the OS scheduler at run time, e.g., Rate Monotonic (RM) Scheduling [38, 39]. The

third one is dynamic priority scheduling, which assigns priorities based on the current

state of the system. The examples are: Least Completion Time (LCT), Earliest Deadline

First (EDF), and Least Laxity (Laxity is deadline minus computation time minus current

time) First (LLF) algorithims [9,35–37].

Although LCT, EDF, LLF and RM scheduling algorithms can be optimal [9,35,38,39], however, as real-time systems become larger and tasks become more sophisticated, it is impossible for only one processor to execute them to guarantee real-time properties. The current trend is to use MIMD systems to schedule real-time tasks [40]. Unfortunately, it has been proved for the long established theory that almost all real-time scheduling problems using MIMDs are NP-complete [8,9,32,39]. No optimal scheduling algorithms has ever been discovered for almost all cases with the MIMD systems. Researchers have been driven to using heuristics to design scheduling algorithms on MIMD systems.

However, these heuristics are usually expensive in that they consume a lot of computation time, and sometimes additional hardware support such as a scheduling chip is needed.

Moreover, use of heuristics will cause systems to be inherently unpredictable [3,8,39]. 13

2.3 Air Traffic Control (ATC)

The Air Traffic Control (ATC) system is a real-time system that continuously moni- tors, examines, and manages space conditions for thousands of flights by processing large volumes of data that are dynamically changing due to reports by sensors, pilots, and controllers, and gives the best estimate of position, speed and heading of every aircraft in the environment at all times [1–3]. The ATC software consists of multiple real-time tasks that must be completed in time to meet their individual deadlines. By performing various tasks, the ATC system keeps timely correct information concerning positions, ve- locities, contiguous space conditions, etc, for all flights under control. Since any lateness or failure of task completion could cause disastrous consequences, time constraints for tasks are critical.

Data inputs come mainly from radar reports and are stored in a real-time database.

In our working prototype of the ATC system, a set of radar reports arrive every 0.5 second. Multiple reports may return redundant data on some aircraft. All tasks must be completed before new report data arrive for the next cycle.

In order to present our work more clearly, we briefly describe our analytical prototype

of the ATC system and ATC tasks in this section. ATC task characteristics and the worst

case ATC environment that is based on real world application are listed first, then ATC

tasks and task examples are explained. Some details have been discussed in [4–6]. 14

2.3.1 Task Characteristics

Besides time constraints, there are other assumptions for ATC tasks in our ATC system. Before specific ATC tasks are described in Section 2.3.3, we give general char- acteristics of an ATC task [3,4].

• All tasks are periodic. Although each task has its own deadline, all of them must

be completed by the system deadline D, or the system period P.

• All deadlines are known at the task release time.

• Aperiodic jobs are handled by a special task in a particular time slot in every cycle.

• Tasks are independent in that there is no synchronization between them nor shared

resources. The static schedule fixes release times and deadlines for each task.

• Overhead costs for interrupt handling are included in each task cost.

• Task execution is non-preemptive.

• Task deadlines are all hard and critical. None can be preempted or deleted without

possible adverse effects.

2.3.2 The worst-case environment of ATC

The US nation air space (NAS) is divided into 20 airspace regions called ATC enroute

centers in the lower (or contiguous) U.S. states plus one each for Hawaii and Alaska.

Each center is divided into sectors. A controller may control one or more sectors in a

center. In the ATC center we consider, there are 600 controllers. The assumed maximum

number of aircraft being tracked by one air traffic control center is 4, 000 controlled IFR 15

(Instrument Flight Rules) aircraft in the current controller center and 10, 000 uncontrolled

VFR (Visual Flight Rules) aircraft and adjacent center controlled IFR, for a total of

14, 000 aircraft [3–5]. Because of the multiplicity of sensors, we assume there are 12, 000 radar reports coming into the system per second.

2.3.3 ATC Tasks

An ATC system has to process a large amount of data coming from radar sensors and provides accurate flight information in highly constrained time. Various processing procedures can be grouped to following eight major tasks [4, 5]: report correlation and tracking, cockpit display, controller display update, sporadic requests, automatic voice advisory, terrain avoidance, conflict detection and resolution (CD&R) and final approach

(runway optimization). The 8 tasks are used to represent the of ATC, as can be seen in references: FAA Grants for Aviation Research Program Solicitation [16], FAA’s

NextGen Implementation Plan [17], and the FAA 1963 ATC Specifications [18]. Refer- ence [18] indicates that the tasks we selected were similar to the ones selected in 1963 for implementation by industries interested in competing for the job of implementing ATC for FAA. Reference [16] states the FAA’s effort to improve the capacity of the airspace while maintaining high safety standards, aircraft safety technology for conflict detection and resolution. These indicate that the tasks of the type that we have selected are key to ATC implementation. Further, the NextGen plan [17] discusses the new standards for ATC including developing capabilities in traffic flow management, dynamic airspace configuration, separation assurance, super density operations, and airport surface opera- tions. The purpose is to achieve a safe, efficient and high-capacity airspace system. The 16

following subtopics in [17] address current research that need innovation: comprehen-

sive analysis of uncertainties in the National Airspace System; air traffic management

functional allocation using advanced computing and networking; guarantee safety of air-

to-air and air-to-ground. The type of activities discussed there will require use of tasks

similar to the ones that we have selected for implementation, which show that our eight

ATC tasks have already captured most of the workload of ATC, and nothing major is

missing that may impact the ATC performance. In our careful search of literature for

publications on ATC, the ones that kept re-occurring in the ATC literature we found

were all included in the tasks we choose for our benchmark.

2.4 An Associative Processor for ATC

An associative processor (AP) [4,13] is a SIMD system with several additional useful hardware enhancements that simplify supporting the real-time ATC system. The name

”associative” is due to computer’s ability to locate items in the memory of PEs by content rather than location. Figure 2 shows the architecture of AP. It has one control unit, or called instruction stream, and thousands of PEs(Cells). Each PE has its own arithmetic logical unit (ALU) and memory. The first AP system was designed by Kenneth Batcher and implemented in the Goodyear Aerospace STARAN computer, which was specifically designed to support ATC [11–13]. A second generation STARAN called the ASPRO [14], also an AP designed at Goodyear Aerospace, was used extensively by the Navy for

Airborne Air Defense Systems applications for over ten years [15]. 17

Figure 2: AP Architecture

The hardware enhancement that is required for an AP is a broadcast/reduction net- work such as described in [10]. This network supports the execution of the below opera- tions, often called the associative operations, in constant time. A list of the associative operationss follow [10,15,41,42]:

• Global MAX and MIN: finds all instances where a maximal (or minimal) value

is stored by a processor at a fixed memory location. The processors whose value

is maximal (respectively minimal) are active at the end of this operation and are

called responders. The remaining processors participating in this activity are called

non-responders.

• AND and OR: finds the global reduction of Boolean values stored in the same

memory location of each active processor.

• Associative search: finds all instances where a search pattern is matched by the

content stored by a processor at a fixed memory location. The active processors

whose content matches the search pattern are the responders and the processors

which did not match the search pattern are the non-responders. 18

• Any-Responder: the ANY operation determines if there is at least one responder

after an associative search.

• Pick-One: selects one responder from the set of the responding processors. It is

implemented on ClearSpeed using GET and NEXT operations.

• Broadcast data or instructions from the control unit to PEs.

We introduce a new term that is called jobset for the AP [3,4]. Observe that an AP is a set processor, as each of its instructions operates on a set of data. As a result, the AP can execute a set of jobs involving the same basic operations on different data simultaneously.

This set of jobs is called a jobset. Compared with the common understanding of a job as an instance of a task in the MP, in the AP, a task is considered to be a sequence of jobsets.

2.5 OpenMP

OpenMP is a parallel programming model for shared memory and multiprocessors [43–46]. It consists of a set of compiler directives and routines, and it is relatively easy to create multi-threaded applications in , and C++. In this dissertation, we focus on the fork-join programming model employed by the OpenMP system. To the best of our knowledge, prior work has not seriously studied the fork-join task model in the context of real-time systems [47]. Fork-Join is a popular parallel employed in systems such as Java and OpenMP. The basic fork- join tasks are shown in Figure 3.

Each basic fork-join task begins as a single master that executes sequentially 19

Figure 3: Fork-Join Task Model

until it encounters the first fork construct, where it splits into multiple parallel threads

which execute the parallelizable part of the computation. After the parallel execution

region, a join construct is used to synchronize and terminate the parallel threads, and

resume the master execution thread. This structure of fork and join can be repeated mul-

tiple times within a job execution. In this dissertation, we use the basic fork-join task

model as described above. Many real-time systems, such as radar tracking, autonomous

driving, and video surveillance, exhibit a data parallel nature that lends itself easily to the

fork-join model. As the problem sizes scale and processor speeds saturate, the only way

to meet task deadlines in such systems would be to parallelize the computation. We focus

on preemptive fixed-priority scheduling algorithms, and dynamic priority scheduling for

fork-join tasks is beyond the current scope of this work. We also restrict our attention to

tasks with the number of cores of our MIMD system. Although the number of threads

in a parallel region can be dynamically adjusted by the OMP NUM THREADS envi- ronment variable in specific implementations of OpenMP, we do not consider such a task model in this work.

While it is relatively easy to quickly write a correctly functioning program in OpenMP, 20

it is more difficult to write a program with a good performance. There are some tech-

niques to improve the performance [43,44].

2.5.1 Avoid False Sharing

An important efficiency concern is false sharing, which can also severely restrict scal- ability. It is a side effect of the -line of implemented in shared memory systems. The cache coherency mechanism keeps track of the status of cache lines by appending ”state bits” to indicate whether the data on the cache line is still valid or is outdated. Any time a cache line is modified, cache coherence software notifies other caches holding a copy of the same line that its line is invalid. If data from that line is needed, a new updated copy must be fetched. A problem is that state bits do not keep track of which part of the line is outdated, but indicates the whole line is outdated. As a result, a processor can not detect which part of the line is still valid and instead requests a new copy of the entire line. As a result, when two threads update different data elements in the same cache line, they interfere with each other. This effect is known as false sharing.

False sharing is likely to significantly impact performance under the following con- ditions: shared data is modified by multiple threads; the access pattern is such that multiple threads modify the same cache lines; these modifications occur in rapid succes- sion. An extreme case of false sharing is illustrated by the example shown in Algorithm 1.

Suppose all threads contain a copy of the vector a with 8 elements in it and thread 0 modifies a[0]. Everything in the cache line with a[0] is invalidated, so a[1],...,a[7] are not accessible until the cache is updated in all threads, even though their data has not 21

changed and is valid. So threads can not access a[i] for i > 0 until each of their cache

lines containing a[0], ··· , a[7] have been updated.

Algorithm 1 Example of false sharing 1: pragma omp parallel for shared(Nthreads, a) schedule(static, 1) 2: for int i = 0; i < Nthreads; i ++ do 3: a[i]+ = i; 4: end for

In this case, we can solve the problem by padding the array by dimensioning it

as a[n][8] and changing the indexing from a[i] to a[i][0]. Access to different elements a[i][0] are now separated by a cache line and updating an array element a[i][0] does not invalidate a[j][0] for i =6 j. Although this array padding works well, it is a low-level solution that depends on the size of the cache line and may not be portable. In a more efficient implementation, the variable a is declared and used as a private variable for each thread instead of having thread i to access the global array position a[i]. In general, using private data instead of shared data significantly reduces the risk of false sharing.

Unlike padding, this is a portable solution.

2.5.2 Optimize Barrier Use

Even efficiently implemented barriers are expensive. The nowait clause can be used to eliminate the barrier that results from several constructs, including the loop construct.

We must do this safely, especially consider the order of different reads and writes from same portion of memory. A recommended strategy is to first ensure that the OpenMP program works correctly and then avoid the nowait clause where possible by carefully inserting explicit barriers at points in the program where needed. 22

2.5.3 Avoid the Ordered Construct

The ordered construct ensures that the corresponding blocks of code within a parallel loop are executed in the order of the loop iterations. This construct is expensive to implement. The runtime system has to keep track of which iterations have finished and possibly keep threads in wait state until their result is needed, which slows program execution. The ordered construct can often be avoided, e.g., it may be better to wait and perform I/O outside of the parallelized loop.

2.5.4 Avoid Large Critical Regions

A critical region is used to ensure that no two threads execute a piece of code simul- taneously. It can be used when the actual order of which threads perform computation is not important. However, the more code in the critical region, the greater the time that threads needing this critical region will have to wait. Therefore the programmer should minimize the amount of code enclosed withing a critical region. If a critical region is very large, program performance will become poor. An alternative is to rewrite the piece of code and separate, as far as possible, those computations that cannot lead to data races from those operations that do need to be protected.

2.5.5 Maximie Parallel Regions

Indiscriminate use of parallel regions may lead to suboptimal performance. There are significant overheads associated with starting and terminating parallel regions. Large parallel regions offer more opportunities for using data in cache and provide a bigger context for other compiler optimizations. For example, if multiple parallel loops exist, we must choose whether to enclose each loop in an individual parallel region or create 23

one parallel region encompassing all of them.

2.5.6 Avoid Parallel Regions in Inner Loops

Another common technique to improve the performance is to move parallel regions out of the innermost loops. Otherwise, we repeatedly incur the overheads of the parallel construct. For example, in the loop nest shown in Algorithm 2, the overheads of the parallel region are incurred n2 times. A more efficient solution is indicated in Algorithm 3.

The pragma omp parallel for construct is split into its constituent directives and the pragma omp parallel construct has been moved to enclose the entire loop nest. The pragma omp for construct remains at the inner loop level. Depending on the amount of work performed in the innermost loop, the parallel construct overheads are minimized.

Algorithm 2 Parallel region embedded in a loop nest 1: for int i = 0; i < n; i ++ do 2: for int j = 0; j < n; j ++ do 3: pragma omp parallel for 4: for int k = 0; k < n; k ++ do 5: {. . .} 6: end for 7: end for 8: end for

Algorithm 3 Parallel region moved outside of the loop nest 1: pragma omp parallel 2: for int i = 0; i < n; i ++ do 3: for int j = 0; j < n; j ++ do 4: pragma omp for 5: for int k = 0; k < n; k ++ do 6: {. . .} 7: end for 8: end for 9: end for 24

2.5.7 Improve Load Balance

In some parallel algorithms, there is a wide variation in the amount of work threads have to do. When this happens, threads wait at the next synchronization point until the slowest thread arrives. One way to avoid this problem is to use the schedule clause with a non-static schedule. The problem is that the dynamic and guided workload distribution schedules have higher overheads than static scheme. However, if the load balance is severe enough, this cost is offset by the more flexible allocation of work to the threads.

It is a good idea to experiment with these schemes, as well as with various values for the chunk size. In general, when possible, a general rule for MIMD parallelism is to overlap computation and communication so that the total time taken is less than the sum of the times to do each of these. OpenMP uses dynamic schedule to overcome this problem and lead to a significant : the thread performing I/O joins the others once it has

finished reading data and shares in any computations that remain at that time. But it will not cause them to wait untill they have performed all of the work by the time the

I/O is done. After this, one thread writes out the results, another thread starts reading while the others can immediately move on to the computation.

2.5.8 Using Compiler Features to Improve Performance

Compilers use a variety of analysis to determine which of techniques can be used.

They also apply a variety of techniques to reduce the number of operations performed and

reorder code to better exploit the hardware. Once the correctness of a numerical result is

assured, experiment with compiler options to squeeze out an improved performance. We

used the compiler optimization flags ’-mfpmath=387,sse -msse3 -ffast-math -O3’ to enable 25

Intel’s Streaming SIMD Extensions (SSE), which is a SIMD instruction set extension to the ’s architecture. CHAPTER 3

Survey of Literature

3.1 Previous Work on ATC

The current airspace has a rigid, centralized structure, and aircraft are required to

follow predefined airways or instructions given by controllers [1,2,48]. The Federal Avi-

ation Administration (FAA) has put tremendous efforts on finding a predictable and

reliable system to achieve free flight which would allow pilots to choose the best path to

minimize fuel consumption and time delay rather than following pre-selected flight cor-

ridors [1,2,49]. This requires clear and unambiguous methods for maintaining safe sepa-

ration between aircraft. Therefore, conflict detection and resolution (CD&R) emerges as

a critical issue for the implementation of free flight. In the past, solutions to this prob- lem have been implemented on multicomputer systems with records for various aircraft distributed over the memory of the processors in this system. The distributed nature of this dynamic ATC system and the necessity of maintaining data integrity while providing rapid access to this data by multiple instruction streams (IS) increases the difficulty and complexity of handling air traffic control. It is difficult for these MIMD implementations to satisfy reasonable predictability standards that are critical for meeting the strict certi-

fication standards needed to ensure safety critical software components. Massive efforts have been devoted to finding an efficient MIMD solution to the ATC problems for more than forty years [4–6].

26 27

The performance of all CD&R algorithms available depends on aircraft state esti- mation according to the comprehensive survey of Kuchar and Yang [50]. The current

ATC system is based on a surveillance system that uses data from radar measurements to track aircraft. In the early versions of this system, human operators manually fol- lowed the blips on the video displays, but the increasing number of aircraft necessitated the development of automated tracking algorithms. The current algorithms in use for

ATC tracking are based on constant gain Kalman filters, known as α − β, α − β − γ

filters [51,52]. The major problem with the single Kalman filter is that it does not pre- dict well when the aircraft makes an unanticipated change of flight mode such as making a maneuver, accelerating, etc [53–56]. Many adaptive state estimation algorithms have been proposed [57–60]. The Interacting Multiple Model (IMM) algorithm [61, 62] runs two or more Kalman filters that are matched to different modes of the system in parallel.

It uses a weighted sum of the estimates from the bank of Kalman filters to compute the state estimate. IMM and its variants have been applied to single and multiple aircraft tracking problems in [57]. However, it becomes inaccurate for tracking multiple aircraft as the number of aircraft increases. Current MIMD implementations of this algorithm are computationally very intensive. Hwang et. al. [54] proposed that the flight mode likelihood function can be used to improve the estimation results of the IMM algorithm.

The likelihood function uses the mean of the residual produced by each Kalman filter.

A heuristic algorithm in [56] that evaluates correlation error values has been shown to provide better results than the Kalman filter [56].

To the best of our knowledge, all existing conflict detection algorithms are based on 28 the continuous state information of the aircraft (see Kuchar and Yang [50] for a compre- hensive survey of the CD&R algorithms). Menon et al. [63] formulate conflict resolution as a multi-participant optimal control problem. Using parameter optimization and state constrained trajectory optimization, they compute a conflict resolution trajectory for two different cost functions: deviation from the original trajectory and a linear combi- nation of total flight time and fuel consumption. Their method results in 3D optimal multiple-aircraft conflict resolution. In general, the optimization process is computation- ally intensive and is difficult to implement in real-time. In [64], Krozel et. al. propose one centralized strategy that is controller-oriented and two decentralized strategies that are user-oriented. In the centralized approach, a central agent analyzes the trajectories of the aircraft and determines resolutions. In the two decentralized strategies, each aircraft resolves its own conflicts as they are detected. In [49], Yang and Kuchar propose a con-

flict alerting logic based on sensor and trajectory uncertainties, with conflict probability based on Monte Carlo simulation. Chiang et. al. [65] propose CD&R algorithms from the perspective of . Paielli and Erzberger [66] and Prandini et. al. [67] propose analytic algorithms for computing probability of conflict. Many of the algorithms consider only two aircraft. For example, Krozel et. al. [64] show that neither their centralized nor decentralized CD&R algorithms can guarantee safety for multiple aircraft when the number of aircraft is growing. Furthermore, many algorithms propose optimization schemes that are not guaranteed to be completed within real-time deadlines.

Due to the increasing number of FAA problems, FAA is inviting proposals for new and efficient conflict detection and resolution technologies [68]. 29

3.2 Classical MIMD Real-Time Scheculing Theory

There are some classical multiprocessor real-time scheduling theory [9, 37, 40]. John

Stankovic et.al [9]: ”...complexity results show that most real-time schedul- ing is NP-hard.” Mark Klein et. al.: [38]”One guiding principle in real-time system resource management is predictability. The ability to determine for a given set of tasks whether the system will be able to meet all the timing requirements of those tasks.”;

”...most realistic problems incorporating practical issues... are NP-hard.” Garey, Gra- ham and Johnson: [31]”...all but a few schedule optimization problems are considered insoluble... . For these scheduling problems, no efficient optimization algorithm has been found, and indeed, none is expected.”

John Stankovic et.al [9, 32, 34] and Garey et.al [8, 31] have identified a number of difficulties in scheduling periodic real-time tasks on multiprocessor, some good instances include dynamic scheduling, load balancing, race conditions, data dependencies, shared resource management, sorting, indexing, and cache and memory coherency problems, etc.

Garey et.al [8,31] have shown them to be multiprocessor NP-hard problems. In [8], the definition of multiprocessor is a parallel computer that uses or shared memory, so basically it is a MIMD. To avoid confusion, in this paper, we will treat multiprocessor and MIMD as being identical terms.

3.3 Other Similar Work

There are also some related works excluding ATC. Most of the results of real-time scheduling on MIMD are focused on the sequential programming model, where the prob- lem is to schedule many sequential real-time tasks on multiple processor cores. Parallel 30 programming models introduce a new dimension to this problem, where jobs may be split into parallel execution segments at specific points [47]. S.Guy et. al. [20] had used

SIMD parallel approaches for collision avoidance algorithms between multiple agents for real-time simulations. K. Park et. al. [69] have used CUDA to evaluate performance of image processing algorithms. The GPU has many SIMD PE groups on its chips and their approach has many similar ideas as ours. CHAPTER 4

AP Solution to ATC

4.1 Overview of AP Solution

Instead, we implement the ATC system on associative processor(AP), where interac- tions are much simpler and more efficiently controlled, due in large part to avoiding the need to coordinate the interactions of multiple instruction streams. Our previous papers have detailed the solution [3–6, 19, 21, 70]. Similar SIMD parallel approaches have been used for collision avoidance algorithms between multiple agents for real-time simulations in S.Guy et. al. [20].

All records for each aircraft will be stored in a single processor. Initially, assume each

processor will store the records for at most one aircraft since the memory size and speed

of processors in a large SIMD is typically small, due to cost restrictions. For ATC tasks,

an AP with n processors can execute n instances of the same task in essentially the same

time as it takes to execute one instance of this task. As long as there are no more than

one aircraft per processor, the running time for the AP remains essential the same as the

number of aircraft increase.

4.2 Advantages of AP over MIMD for ATC

Since SIMDs have only one instruction stream (IS) or control unit, control-type com-

munication between processors is completely eliminated. Communication between the

IS and processors occurs in constant time using its broadcast/reduction network. Data 31 32 communication between processors is completely deterministic and a tight upper bound can be calculated for the worst case. As a result, a numerical worst case time required for communications can be accurately calculated. The AP has some important advantages over MIMD including low overhead synchronization, deterministic hardware, much faster communication, predictable worst-case running time, much wider ”memory to processor” , and elimination of the following: race conditions, data dependencies, shared resource management, sorting, indexing, and cache and memory coherency problems, etc.

When solving a problem, MIMD systems often use software to solve additional prob- lems repeatedly which is not needed in a sequential solution of the original problem.

Examples of these types of this ”additional software” are dynamic scheduling, load bal- ancing, shared resource management, memory and cache coherency management, pre- emption, synchronization, priority inversion handling, sorting, indexing, multi-tasking and multi-threading management software, etc [6]. Several of these types of difficulties were identified in scheduling periodic real-time tasks on MIMD in [47]. A number of these are problems that have been shown to be multiprocessor NP-hard problems(e.g., [8]).

In [8], the definition of multiprocessor is a parallel computer that uses message passing or shared memory, so basically it is a MIMD. To avoid confusion, in this paper, we will treat multiprocessor and MIMD as interchangeable terminologies. SIMDs do not ordi- narily need to use this additional software. Most sequential solutions to a problem can usually be used to create a similar AP solution with roughly the same number of lines of code. AP solutions are often shorter and simpler than the sequential solutions, as the constant-time associative search AP property can be used to eliminate the use of sorting and linked lists to organize and locate items. Also, the additional AP properties often 33

allow simpler and more efficient solutions for problems than with a SIMD that is not an

AP.

4.3 AP properties

We discuss some issues regarding AP properties in this section. One is whether the methods used in data transfer of the ClearSpeed is similar to data transfer between PEs and control unit memory in AP. The issue really comes down to whether the AP we plan to use can the necessary data transfers in sufficient time, as there are no transfer properties assumed for AP. According to [11, 71–78], the AP model is a general purpose computational model and does not make any assumptions regarding how the interconnection network is used. Often, the use of the interconnection network in an algorithm can be avoided by using the above associative operations, leaving the running time of this algorithm independent of the network used. Also, the AP does not make any assumptions regarding how data is transferred from the AP to outside buffers or between the control unit memory and the parallel memory. Clearly, these transfer times depend on the data size, and this increases as the application size (e.g., the number of aircraft) increases. While it is very efficient to transfer data between the mono and PE memory in ClearSpeed, we do not know if this rate of transfer would scale if a much larger ClearSpeed accelerator is built. We must point out the important fact that most and probably all of the non-AP SIMDs that have been proposed and built could not handle the ATC requirement because of I/O limitations. However, the previously built

AP machines STARAN [11–13] and ASPRO [14] overcame the I/O limitation by the

MDA (multidimensional access) memory [4,11–13,41] and the flip network(see Figure 4). 34

Figure 4: A Possible AP Architecture for ATC

The flip network is a corner turning network that provides access to a slice of memory

in a SIMD PE for I/O purposes [11, 41, 73]. The flip network in STARAN and ASPRO

was the key to providing high speed I/O movement. The placement of the flip network

between processors and their memory allowed very fast movement of data between an

outside buffer and the ASPRO/STARAN PE parallel memory. Corner-turned data and

the assignment of one record per processor allows multiple PEs to work together to

transfer large amounts of data rapidly between PEs and memories. The flip network and

MDA allow efficient movement of a record in one PE to an outside buffer. More about

this can be found in [41]. Moreover, Jerry Potter describes an alternate design in [41]

that allows multiple records to be transferred to an outside buffer. So the transfer of data

speed is not a bottleneck for building an associative processor to accommodate realistic

size ATC problems. See references [4, 11, 71–75, 78, 79] to find more information about

how to implement AP hardware.

Another issue involving AP properties is how the constant time operations can be supported in AP and how can we be sure they actually run in constant time. The ATC problem is essentially a dynamic database problem. Much of the data in this database changes rapidly, with many parts being updated every one-half second. In order to be able 35 to update and process this data, we need to have some functions that allow us to access, update, change, and add or delete records in this dynamic database very rapidly. In order to support these rapid database operations, we require that these operations execute in constant time. These are the additional functions we require a SIMD to possess in order for it to be called an associative processor. Here ”constant time” is defined to be the time that the sequential RAM model requires for a PE to access its memory, the time for

AND or OR operations, the time for addition or multiplication of word length numbers, etc. However, there is an additional issue here that does not arise for sequential constant time operations, namely that these functions may involve reductions of vectors with a component value from each processor. Clearly, if the number of processors is allowed to increase without bound, such constant time reductions are impossible. However, if we limit the number of processors to a practical size (e.g., bound by the number of atoms in the observable universe), it is shown in [10] that all of the constant time functions required for an associative processor are possible. However, these functions cannot use a typical interconnection network, as use of these will require non-constant time. Instead these constant time functions can be supported by use of a binary (or n-ary for n > 2) broadcast-reduction tree with nodes that have very limited computational capabilities. In fact, this is the way these constant time operations are supported in both STARAN and

ASPRO, which use a 4-nary tree for broadcasts and another 4-nary tree for reductions.

The broadcasting/reduction network on the STARAN is constructed using a group of resolver circuits [11,71,72,75]. It is important to keep in mind that all of the above basic operations are implemented in hardware. These features make the AP model a unique parallel computation model that is powerful and feasible. More information about this 36 can be found in [4,5,10,11].

The key to AP hardware design is [76–81]: 1) high memory-to-PE bandwidth, and

2) low level synchronous operation supported by the i) elimination of branches in low level loops, and ii) the elimination of low level barrier synchronism. In the STARAN and the ASPRO, these abilities were supported by corner turning the data and assigning one record per processor, the multi-dimensional array memory (and flip network) and mask register hardware. Specifically, the mask register allows additional parallel searching and processing on subsets of the original search with no branching for special cases. After the search phase of a search, process, retrieve (or SPR) cycle is complete, associative

SIMDs can either continue data parallel processing or select a single record to process sequentially with essentially no overhead. These attributes allow associative SIMDs to process all the records in a file using . More information about this and other issues addressed in this section can be found in the book titled Associative

Computing by Jerry Potter [41]. CHAPTER 5

Emulating the AP on the ClearSpeed CSX600

5.1 Overview of ClearSpeed CSX600

The ClearSpeed accelerator board shown in Figure 5 is a PCI-X card equipped with two CSX600 . The CSX600 board is a multi-core processor with two CSX600 coprocessors, each with 96 processing elements (PEs) connected in the form of a one- dimensional array. At present, we are only using one of the two coprocessors in order to obtain a more SIMD-like environment. This multi-core section is called a multi- threaded array processor (MTAP) core, and the architecture is shown in Figure 6. The programmer only has to provide a single instruction stream, and the instructions and data are dispatched to the execution units that have two parts: one is mono unit that functions as a control unit and processes sequential instructions, and the other is poly unit that has 96 PEs. At each step, all active PEs execute the same command synchronously on their individual data. Each PE has its own local memory of 6 Kbytes, a dual 64-bit

FPU, its own ALU, integer MAC, registers and I/O. The PEs operate at a clock speed of 210 MHz. The aggregate bandwidth of all PEs is specified to be 96 Gbytes/s, which is for on-chip memory. Since the parallel bandwidth between the PEs and their memory is

(number of PEs) × (memory-PE bandwidth of each PE), this bandwidth increases as the number of PEs increase. This provides an extremely wide bandwidth for SIMDs with a large number of PEs. This allows SIMDs to avoid the von Neumann bottleneck, since a large number of PEs can access their memory in the same step without any slowdown due 37 38

Figure 5: CSX600 accelerator board

Figure 6: MTAP architecture of CSX600

to message passing or shared memory access time. Further information on the hardware

architecture can be found in the documentation [82].

The ClearSpeed accelerator provides the Cn language as the programming interface for the CSX600 processors. It is very similar to the standard C programming language.

The main difference is that it introduces two types of variables, namely mono and poly variables. The mono variables are equivalent to common C variables and are used by the control unit. A poly value is a parallel variable, with the ith value stored in the ith processor; moreover, all these values are stored in the same memory location in their respective processor. Further information on the associated software can be found in the documentation [83, 84]. Here we will use three of the library functions for data transfer on the ClearSpeed. The command memcpym2p copies from mono to poly memory and the command memcpyp2m copies from poly to mono memory. The third one, swazzle, 39

exchanges data between adjacent PEs using the swazzle network, which is a ring network

connecting all the processors together. More details of usages of library functions can be

found in the documentation [84,85].

5.2 Emulating the AP on the ClearSpeed CSX600

Because the CSX600 is SIMD, so only the associative functions introduced in Section 2.4 need to be emulated. These associative functions have been implemented mostly by Dr. Kevin Schaffer on the CSX600 efficiently, in both the Cn language in- troduced in Section 5.1 and assembly language supported by CSX600 [19, 21–23]: AND and OR reductions across a Boolean poly variable; Associative search across a poly vari-

able; AnyResponder (following an associative search); MAX and MIN reductions across

a integer poly variable; PickOne (following use of AnyResponder). The source codes for

the ASC library for the ClearSpeed Cn language are posted on [86], and [79–81] have

explained the library more in detail. To evaluate their running time, we store 30 records

in each PE and perform each associative operation once for each of these 30 records. The

timing results in microseconds (µsec or 10−6 second) are shown in Table 1. The NA in the

table means that either they cannot be implemented in assembly or they are assembler

commands. Although they are not constant time, they are very efficient, and establish

that we can efficiently emulate an AP using CSX600. The reason that these functions are

not constant time on ClearSpeed is that they are not supported by a broadcast-reduction

network, but instead with the swazzle (or ring) network. Clearly, the time to use this

network increases linearly as the number of processors increases. 40

Table 1: Timings of Associative Functions Associative functions Timings in Cn(microsecond) Timings in assembly(microsecond) max 5.257 3.654 min 5.257 3.654 AND 7.024 NA OR 7.364 NA associative search 29.2 NA any 0.281 NA nany 15.147 8.245 get 13.116 7.876 next 13.032 7.816 broadcast 100 NA CHAPTER 6

ATC System Design

6.1 ATC Data Flow

The overall system design for an ATC system is shown in Figure 25. The executive box controls the single instruction stream of AP using static scheduling. All control paths are from ATC in the executive box to all of the tasks, e.g., report correlation and tracking, etc. Controller input simulates sporadic requests, e.g., weather change, controller input, etc. Radar reports data are simulated by data from data lines to two modems and transferred from the host to the CSX600 PEs. Tracks are simulated from

flight plans in the PEs. The radar reports and tracks are used for report correlation and tracking task. The outputs of tracking task are used for cockpit display, controller display update, terrain avoidance, conflict detection and resolution (CD&R) and final optimization. The results of terrain avoidance, CD&R and final approach are used for cockpit display and controller display update. The results of terrain avoidance and CD&R are used for automatic voice advisory that transfers results to automatic voice advisory driver in the host and produces voice output. The resolution advisories of CD&R task are sent to controllers.

6.2 Static Scheduling

The eight ATC real-time tasks can be statically scheduled on the CSX600 using the static schedule for the AP discussed in [4, 21–23]. The eight tasks and the periods 41 42

Period 0.5 sec 1 sec 4 sec 8 sec 1 T1 T2,T3,T4 2 T1 T5 3 T1 T2,T3,T4 4 T1 5 T1 T2,T3,T4 6 T1 T6 7 T1 T2,T3,T4 8 T1 T7 9 T1 T2,T3,T4 10 T1 T5 11 T1 T2,T3,T4 12 T1 13 T1 T2,T3,T4 14 T1 T8 15 T1 T2,T3,T4 16 T1 Table 2: Static Schedule for ATC Tasks

used for each are (1) report correlation and tracking is executed every 0.5 second; (2) cockpit display, (3) controller display update and (4) sporadic requests are executed every second; (5) automatic voice advisory is executed every 4 seconds; (6) terrain avoidance,

(7) conflict detection and resolution, and (8) final approach are executed every 8 seconds.

An 8 second period is split into 16 one-half second periods. Task 1 is executed during each half-second period. Tasks 2, 3 and 4 are executed during the 1st, 3rd, 5th, 7th,

9th, 11th, 13th, and 15th half-second periods. Task 5 is executed in the 2nd and 10th

half-second periods. Task 6 is executed in the 6th, task 7 is executed in the 8th and task 8 is executed in the 14th half-second period. Table 2 shows the static schedule more intuitively.

Figure 7 illustrates the static scheduling of ATC tasks. Time slots for individual tasks within a system major period (8 seconds) are carefully tailored. Each of them provides 43

Figure 7: Static schedule of ATC tasks implemented in AP sufficient time for the worst-case execution of the complete set of system tasks in the

ATC system. The periodic tasks are run at their release times and each is completed by its deadline.

6.3 Scaling up from CSX600 to a Realistic Size AP

We use the present ClearSpeed System to show that our ATC solution is feasible.

However, our purpose is to propose building an AP whose size is appropriate for ATC, not a larger CSX600 board or multiple CSX600 boards because some of the disadvantages in MIMD systems mentioned in subsection 4.2 may show up in a system with multiple

CSX600 boards. As described in Chapter 1, our ideal AP has at least 14, 000 PEs with only one aircraft per PE. Moreover, the memory size and speed of the PEs and of the control unit can be chosen to optimize the ability of this large AP to handle the realistic size ATC system. Although we are unable to prove it is possible to build this ideal

AP using our CSX600 simulation, STARAN and ASPRO have already met the goal of supporting 14, 000 aircraft simultaneously [4, 5, 11]. A similar processor, the MPP

[11,71–75], with 16, 384 PEs, was delivered in 1982. While the MPP was not an AP as it 44

did not have the hardware reduction network, this feature could easily have been included.

Even larger SIMD machines have been built. In particular, Thinking Machines CM-2

was delivered in 1987 and had 64K processors. Paracel developed a parallel processor

which was generally believed to be a SIMD that had one million processors.

Figure 25 shows the overall system design and data flow based on AP architecture.

Our solutions to ATC tasks for the CSX600 in Chapter 7 are easy to implement on the AP with 14, 000 PEs. References [4, 5, 11] have timing results of previous AP with

16, 000 PEs. For example, Table 3 in [4] shows that the STARAN AP executed a similar set of ATC procedures in 4.52 seconds, leaving 3.48 unused seconds remaining. That table is simplified to be Table 3 in this paper, which provides worst case execution times for statically scheduled ATC tasks with 14, 000 total aircraft, 12, 000 sensor reports/sec,

6, 000 controllers and 8 second major period. The data transfer of large records in real- time is not their bottleneck because an AP can bring data in and out efficiently (please see details in Section 2.4). Additional reasons that the AP model can solve the ATC problem efficiently are its constant time associative operations, SIMD execution style, and extremely wide memory bandwidth. These have eliminated a number of difficulties that have to be handled in MIMD architectures. So an AP with 14, 000 PEs using modern technology can be built and can easily meet all of the deadlines, as this has already been done in the past with older technology. 45

Table 3: Statically Scheduled Solution Time for Worst Case Environment of ATC

Tasks Period(sec) Proc Time(sec) Report Correlation & Tracking 0.5 1.44 Cockpit Display 1.0 0.72 Controller Display Update 1.0 0.72 Aperiodic Requests(200/sec) 1.0 0.4 Automatic Voice Advisory 4.0 0.36 Terrain Avoidance 8.0 0.32 Conflict Detection & Resolution 8.0 0.36 Final Approach(100 runways) 8.0 0.2 Summation of tasks in a period 4.52 CHAPTER 7

AP Solution for ATC Tasks Implemented on CSX600

This chapter describes the algorithms for each of the eight real-time tasks, report correlation and tracking, cockpit display, controller display update, sporadic requests, automatic voice advisory, terrain avoidance, conflict detection and resolution (CD&R) and final approach (runway optimization). The solutions are implemented on the CSX600 architecture, but it is easy to scale up from 96 PEs to an AP with 14, 000 PEs using similar algorithms. The best algorithm will always depend on the exact architecture of the AP.

For instance, the use of the swazzle network in CSX600 is more efficient here but in a true AP, ordinarily each radar report would be broadcast from the host to all PEs simultaneously, as this can be done in constant time on a true AP.

7.1 Report Correlation and Tracking

Report correlation and tracking is a task that correlates aircraft data about position as reported by radar and the predicted data of established tracks for the aircraft in the system [3–5]. A track is the best estimate of position and velocity of each aircraft under observation. This task is present in many command and control real-time systems and is executed every 0.5 second, so it is a major limitation in ATC performance. If total time consumed is considered, this is easily the ATC task that consumes the most time, as it is performed much more frequently than the other tasks. It is challenging because each report has to be compared with each track, and some aircraft are changing flight mode. 46 47

The main idea of this task is from [3–5, 87]. Since all these data are stored in a

shared relational database, the input data are two database relations: aircraft data from

radar reports(Relation R) and the predicted positional data of established tracks for the aircraft (Relation T ). The correlated radar reports are used to smooth the position and velocity of the tracks to obtain the next estimate of position and velocity. Given an unordered set of tracks, each report record must be evaluated with every track record in the system to assure a match (correlation) is not missed. Multiple matches are treated different from unique matches, and any report records that do not match a track record are entered as new tentative tracks.

It is challenging because each report has to be compared with each track, and some aircraft are changing flight mode. We use the SIMD architecture to do the task. First, we create a box for each report and each track to accommodate uncertainties of report and track. The two database relations R and T contain information about the x and y coordinates of points in 3-D space. The objective of the task is to determine the join of the two relations as the intersection of two boxes, which is a many-to-many join. Each box from R is evaluated for intersection with every box in T until all boxes from R have

been compared with all boxes in T.

As presented in [19, 21], the data structures of R and T are shown in Table 4 and

Table 5, in which positions of next period are predicted by smoothed current positions

and velocities. Let T have columns X, Y , j, X1, Y1, and q, and R have columns X, Y , r

and k. Here (X,Y ) is the position of aircraft on the records of T or R; j and r are used

to give the sizes of boxes developed for aircrafts in T and R, respectively. (X1,Y1) in T

is the reported position for the track record that is based on the current correlated radar 48

Table 4: Data Structures for Radar Reports Attribute Type Comments report id int Report identity r float Report box size X float X position Y float Y position

Hr(tk) int Altitude Match count int Number of correlated tracks Match id int ID of the correlated track k int Correlation flag

Figure 8: Track/Report correlation report. Both q and k are flags set during the correlation procedure. As shown in Figure 8, a box is developed around each point (x, y) with the four corner points (X ± r, Y ± r) in

T. A similar box is developed around each point (x, y) in R with the four corner points

(X ± j, Y ± j). r > 0 is based on uncertainties in the radar report, and j > 0 is based on uncertainties of each track to compensate turning or larger errors in the radar report.

Initially j =0.5 nautical mile for all tracks, and r =0.5 nautical mile for all radar reports.

Each box of the radar report in R is compared with each box of the track record in T. If an intersect is found between one report box and one track box, for example, r2 and t2 in Figure 8, then the report data is entered into X1 and Y1 of the correlated record in T and a correlation flag is set in column k, the match count of this report is incremented, 49

Table 5: Data Structures for Tracks Attribute Type Comments ID int Flight identity q int Track state(seven values) C int Error measures(3 values) j float Track box size report id int ID of correlated report X float Current X position Y float Current Y position

Ht int Current altitude

Vx(tk) float Current X velocity

Vy(tk) float Current Y velocity X1 float X position of correlated report

Y1 float Y position of correlated report

its ID is entered into the correlated track’s report id, and the ID of the track is entered

into the radar report’s match id. We calculate the distance between them, which is the

track’s shortest distance. If two tracks correlate to the same radar report, i.e., the radar

report’s match count > 1, then this radar report is discarded. If a second radar report

correlates with the same track, calculate the distance between the track and radar report

and call it the track’s current distance; if it is less than shortest distance, update the shortest distance to be this distance, and record this report’s position as a candidate report position for this track. Next, another radar report is broadcast to all PEs to compare to all of the track records using the same procedure above. After all report boxes have been compared with all track boxes, set error measure C = 1 for the tracks that have correlated reports. The details of the smoothing process have been explained in [4,5,19,21,87].

When all report boxes have been compared with all track boxes, a flight that is not 50

Figure 9: Search box size for flight maneuver

produced by noise might not correlate to any reports because the flight that it corresponds

to is accelerating, turning, maneuvering or by greater noise in the report. We increase

track box size for any track that has not correlated with a radar report to increase its

probability to intersect a radar report box as shown in Figure 9. First we double the box

sizes of tracks that do not correlate any reports, i.e, j = j × 2. Next, we apply the same algorithm to compare them with uncorrelated reports whose match count are 0. The

error measures C are set to 2 for tracks that correlate reports in this round. The process is repeated for all unmatched radar reports, which have the ”not match flag” of column q set in R. When no intersections are found, the process is repeated with j = j × 3.

If there still remain unmatched radar reports, new tentative tracks are started for the reports in order to detect arrival of any new flights. If reports are due to noise, they usually will be dropped in a few periods (several seconds) based on further evaluation of other radar reports. The details of this task are shown in Algorithm 4. This task is done both accurately and within deadline because of the SIMD features of AP.

If a track created for a potentially new aircraft does not correlate with any aircraft after several passes through Algorithm 4, it is viewed as being due to noise and dropped.

Note that if this algorithm were to be executed on an AP, it would need to be modified so 51

Algorithm 4 Algorithm for Aircraft Tracking 1: Radar reports are transferred from host to mono memory, then distributed from mono to PEs. 2: for i =1 → 96 in parallel do 3: Boxes are created around each radar report and each track in each PE to accom- modate report and track uncertainties. 4: Check intersection of each report box with every track box in each PE. 5: If there is an intersection, the radar report and the track are correlated. The match count of this report is incremented, which indicates that it correlates with one track, and its ID and positions are entered into the correlated track’s record. 6: All radar reports in each PE are transferred to next PE using the ring/swazzle network, and steps 3 to 6 are repeated. 7: end for 8: After the 96 iterations, all reports have been compared with all tracks. A track that is not produced by noise might not correlate to any reports because the aircraft that it corresponds to is maneuvering. 9: Double the box sizes of tracks that have not correlated with any reports to increase their probability of intersecting a report box and repeat steps of above to compare them with uncorrelated reports. 10: Triple the original box sizes of tracks that have not correlated yet, and run the algorithm again. 11: After 3 rounds, if there are still any uncorrelated reports, they are used to start new tracks.

that each radar report would be broadcast in constant time to all PEs and processed prior

to broadcasting the next radar report to the PEs. The modification of the algorithms

in this section to run on an AP will be easy but will depend on the exact architecture

of the AP. It is noted that, in the AP solution, each report record is tested with every

track record in one set operation that requires constant time. That is, the overall AP

time is O(n) where n is the number of reports in this period. On the other hand, the

MIMD process is O(n2) because it has |R|×|T | processes, assuming O(1) processors can access and update needed data from the dynamic data base and work on this problem simultaneously. In the worst case situation we anticipate 12, 000 reports against 14, 000 tracks. For the AP this is 12, 000 operations which is O(n), while in the MIMD it is

R(T − 1)/2 or 8.4 × 107 which is O(n2). It is noted that the number does not include 52

many operations that are needed by MIMD but not by AP: dynamic scheduling, load

balancing, data distribution, processor assignment, of data access, etc.

7.2 Cockpit Display

First the associative operation PickOne is used to select one aircraft. Next, the broadcast operation is used to broadcast the x, y and altitude coordinates of the plane picked in the previous step. For each of its aircraft records, each processor computes the x-distance, y-distance and altitude distance between the location of its aircraft and the location of the aircraft broadcast. Then the processor identifies the aircraft that are approaching this aircraft. This is done by using the conflict detection algorithm covered in Section 7.6.1, find all aircraft that will be within 2 × 2 nm in x and y and within

1000 feet in altitude in 2 minutes. Next, transfer these selected aircraft’s identity, x, y positions, altitude, velocity, heading, conflict information, etc to the display .

Finally, use the conflict resolution algorithm in section 7.6 to obtain a conflict advisory and transfer it to the server. This algorithm uses the SIMD architecture to parallelize the computation and to improve its efficiency.

7.3 Controller Display Update

This task is similar to cockpit display. We transfer the updated flight identity, posi- tions, altitude, speed, heading, etc from PEs to the ClearSpeed server, which plays the role of the controller display in this simulation. It uses the SIMD architecture to speed up the display process. 53

7.4 Automatic Voice Advisory

Automatic Voice Advisory (AVA) automatically advises an uncontrolled flight (VFR) of near term conditions of other aircraft and terrain by voice. This task is simulated by having the ClearSpeed server print advisories of conflict detection and resolution, terrain avoidance tasks, etc. For example, if there is an aircraft that is approaching the aircraft called, the message might be ”aircraft at 4 miles ahead, 4, 500 feet above, in 1 minute”; if the aircraft called is heading for a terrain, the message might be ”terrain, 4 miles, 3, 100

feet ahead”. We use an AP style of computation to do this efficiency.

7.5 Sporadic Requests

Sporadic requests include information requests or changes in data. For example, aircraft have to avoid an area that has bad weather, aircraft make maneuvers to avoid bad weather, or controllers make a request for runway usage, etc. This task is executed once every second. Although the requests are not processed immediately, they are processed very quickly. We simulate this task as follows. We first process the next unprocessed sporadic task (assuming there are more than one). If it is to divert aircraft so they will miss a storm area, all affected aircraft will be processed using the associative operation

PickOne to select one aircraft to redirect at a time.

7.6 Conflict Detection and Resolution (CD&R)

7.6.1 Conflict Detection

This paper considers a conflict to occur when two aircraft are predicted to be within

a distance of three nautical miles in x and y and within 1, 000 feet in altitude. To

assure timely evaluation we let the detection cycle be eight seconds, and we determine 54

the possibility of a future conflict between any pairs of aircraft within a 5 minute ”look

ahead” period (i.e., 300 seconds). [3–5]

All IFR flights(defined in section 2.3.2) are evaluated for their future relative space positions between each other; and each of these IFR flights is also evaluated against all VFR flights(defined in section 2.3.2) for their future space positions. In the worst case situation we anticipate 4, 000 IFR records and 10, 000 VFR records according to section 2.3.2. This means that every 8 seconds we must evaluate each of the 4, 000 IFR

flights with the other 13, 999 IFR and VFR flights in the worst environment. The process is essentially a recursive join on the track relation where the best estimate of each flights future position is projected as an envelope into future time. The envelope has some uncertainty in its future position that is a function of the track state. It is modified by adding 1.5 miles to each x, y edge of the future position (to provide a 3 mile minimum miss distance) and 1, 000 feet in the altitude (to provide a 2, 000 feet minimum miss height). For each future space envelope, its data is broadcast to all remaining PEs first.

Then an intersection of this envelope with every other space envelope in the environment can be checked simultaneously in constant time. First, all future flight envelopes are generated simultaneously, which takes O(1) steps. Then, the first ”trial envelope” of the 4, 000 IFR future flight envelopes is compared for possible conflict with all the other

13, 999 flight envelopes for a look-ahead period of 5 minutes. Thus the equivalence of maximum 13, 999 jobs is completed simultaneously in this jobset. This occurs because each of the 14, 000 records in the track table is simultaneously available to each of the

14, 000 PEs that are active in the AP. The next operation selects the second trial envelope and repeats the conflict tests against the remaining 13, 998 tracks. When a trial envelope 55

has been tested, it is marked ”done”, and future trial envelopes will exclude all prior trial

tracks. When the last of the 4, 000 trial envelopes is tested, the conflict detection will

have finished.

We implement the algorithm on CSX600 to emulate AP solution. The input data is from the track records in the PEs. We copy each track’s ID, 3D position Xt, Yt and

Ht, and velocity Vxt and Vyt at time t to the following variables, respectively, ID, Xc,

Yc, Hc, Vxc and Vyc in the poly structure trial in PEs. Initialize their time till which is

the aircraft’s earliest collision time with other aircraft to 300.00. The intuitive idea is to compare each trial aircraft with all the other ones, but we use the CSX600 architecture to parallelize the computation. The details of the conflict detection algorithm are shown in Algorithm 5 (see Figure 10), known as Batcher’s algorithm.

Algorithm 5 Algorithm for Conflict Detection 1: for i =1 → 96 in parallel do 2: if for each trial and track record in each PE, their flight IDs are different and altitudes are within 1000 feet then 3: Project their positions into 5 minutes ahead, add 1.5 to each x and y coordinate to provide a 3.0 minimum miss distance in each dimension. The x dimension case is shown in Figure 10. 4: Calculate the min x, max x, min y and max y for minimum and maximum intersection times in x and y dimensions, as shown in equations 1, 2, 3 and 4. 5: Find the largest minimum time time min and smallest maximum time time max across the two dimensions using equations 5 and 6. 6: If time min is less than time max, there is a potential conflict between the aircraft whose ID is trial.ID and another aircraft whose ID is track.ID. 7: If time min is less than trial.time till, then trial.time till is updated to time min. 8: end if 9: All trial records in each PE are passed along the swazzle/ring network to the next PE and the above steps are repeated to calculate trial.time till(trial, track). 10: end for 11: After 96 iterations, all trial records have been compared with all track records. The time till of each trial is its soonest collision time with another track. 56

Figure 10: Conflict detection

The following formulas are used in Algorithm 5:

|trial.X − track.X | − 3 min x = c t (1) |trial.Vxc − track.Vxt|

|trial.X − track.X | +3 max x = c t (2) |trial.Vxc − track.Vxt|

|trial.Y − track.Y | − 3 min y = c t (3) |trial.Vyc − track.Vyt|

|trial.Y − track.Y | +3 max y = c t (4) |trial.Vyc − track.Vyt|

time min = max{min x, min y} (5)

time max = min{max x, max y} (6)

In the AP solution, each IFR record is tested with every IFR except itself and every

VFR record in one set operation that requires constant time, according to the descriptions above. So the overall AP process requires only 4, 000 operations at most, which is O(n) steps where n is the number of IFR flights in this period. On the other hand, the MIMD 57

process is O(n2) because it has IF R × (IF R − 1)/2+ IF R × V F R processes, assuming

O(1) processors can access and update needed data from the dynamic data base and work on this problem simultaneously. In the MIMD it is 4, 000 × 3, 999/2+4, 000 × 10, 000 =

40, 799, 800 which is O(n2). The AP will complete the entire set of jobs in at most

4, 000 steps due to the simultaneous execution of all jobs in each jobset. It is noted that the number does not include many operations that are needed by MIMD but not by

AP: dynamic scheduling, load balancing, data distribution, processor assignment, mutual exclusion of data access, etc.

7.6.2 Conflict Resolution

We use the SIMD architecture of CSX600 to do conflict resolution accurately and efficiently. Conflict resolution will occur as quickly as possible after conflict detection has been executed, so that each aircraft will have an updated time till value. In practice,

conflicts should be reasonably rare. First, we find which aircraft have the shortest conflict

time. This is the trial aircraft that will make the initial heading change. Next, the PEs

will evaluate in parallel different trial trajectories for the trial aircraft. These trajectories

will be created by changing the direction of the current trajectory of the trial aircraft both

clockwise and counterclockwise in increments of 5 degrees, up to a maximum increment

change of 30 degrees. The various trajectories are rotated through all processors using the

ring (or swazzle) network, and the conflict detection algorithm is used to determine the

longest time before each trajectory collides with another aircraft. Finally, the maximum

of the collision times of all trajectories is determined and used as the best heading change

that the trial aircraft can make. If more than one of the best heading change are found, 58

the one with the minimal degree is used. If two have the same degree changes (e.g., +10

and −10 degree change), one of them is arbitrarily chosen. Although conflicts cannot be

fully resolved theoretically, it runs very well in our simulation and during our tests, all

conflicts are resolved after several rounds. In an actual implementation, any unresolved

conflicts could be resolved by changing the altitude of a plane that still has a conflict

after Algorithm 6 is executed.

The algorithm for conflict resolution is described in Algorithm 6. The input data are the tracks and trial records in PEs. Each PE can have 1 to 17 tracks and trial records.

Algorithm 6 is based on the CSX600, not an AP. The AP could handle collision avoidance

in a much simpler manner, due to its ability to do a constant-time broadcast [3–6]. For an

AP, the conflict resolution can be included as part of conflict detection. Proceeding one

aircraft at a time, each selected aircraft’s trajectory information is broadcast (in constant

time) to all other aircraft and any potential conflict is immediately corrected as follows.

The heading of the initial aircraft is altered by perhaps an increment of 10 degrees and

immediately rechecked for a conflict. This procedure continues until a conflict free path

with all other aircraft is found with the heading being altered a maximum of 30 degrees.

This method is more efficient on an AP since the broadcast of trajectory information of

one aircraft’s trajectory information can be done in constant time.

7.7 Terrain Avoidance

Terrains are lines that make a box shape that encloses a terrain height, e.g., a TV

tower is a 1.0 by 1.0 nm box with a height equal to 3, 100 feet. All terrains and tracks

are entered in each PE. Terrain avoidance is as challenging as the report correlation and 59

Algorithm 6 Algorithm for Conflict Resolution 1: In parallel, find the minimum time till of all trial records sequentially in each PE. 2: Compute the global minimum time till by taking the minimum of the local time till value in each PE. 3: Use the pick-one command to select a processor whose local time till is equal to the global time till, and transfer the trajectory record of an aircraft in this PE whose time till is minimal to the mono memory. 4: Create array projectedpath[11] in mono memory, copy the best trial aircraft’s ID and positions to each of the records of projectedpath[11], initialize collision time, the soonest collision time with other aircraft to ∞. 5: Each of the projectedpath[i] represents a path where the best aircraft makes a dif- ferent heading change from left to right 5 to 30 degrees in increments of 5 degrees, alternating between left and right (e.g., 5, −5, 10, −10 degrees, etc). This will allow numerous different possible paths for the trial aircraft to be evaluated in parallel. 6: Transfer the projectedpath[0], ··· , projectedpath[11] from mono memory to the PEs in parallel with each PE receiving one projectedpath[i]. 7: for i =1 → 96 do 8: Each projectedpath is compared to all aircraft records in each PE with a different ID (so that the trial aircraft will not compare to itself) unless their altitudes differ by more than 1000 feet. 9: Calculate minimum and maximum intersection times in x and y dimension using the earlier conflict detection equations 1, 2, 3 and 4. 10: Use equations 5 and 6 to get time min and time max. If time min is less than time max, there is a potential conflict between the aircraft whose ID is the trial ID and this path. 11: Check whether time min is less than the collision time of this projectedpath, if so, this projectedpath.collision time is updated to time min. If no conflict is found, the time is not updated and nothing needs to be done. 12: The projectedpath records in each PE are then passed to the next PE along the ring network to compare with the records in the neighbor PE. 13: end for 14: After 96 iterations, all projectedpath records have been compared to all the other aircraft. 15: Find the maximum projectedpath.collision time across all PEs. This path is the best scenario. 16: Change the best trial aircraft’s x and y velocity according to the best scenario path, display the resolution advisory and change the flight plan in the host. 60 tracking task. The terrain avoidance algorithm in Algorithm 7 is similar to the conflict detection algorithm. The challenge is the computational intensiveness and we use the

CSX600 architecture again to speed up this computation.

Algorithm 7 Algorithm for Terrain Avoidance 1: for i =1 → 96 do 2: if for each terrain and track record in each PE, the track record’s height is lower than the terrain record’s then 3: Project the track record’s position to 2 minutes in the future, add 1.5 to each x and y edge of the future position to provide a 3.0 minimum miss distance. The terrain records are 1.0 by 1.0 nm boxes. 4: Calculate the minimum and maximum intersection times in both x and y di- mensions. 5: Set time min to be the larger of the two minimum intersection times in both x and y dimensions in step 4. Similarly, set the record time max to be the larger of the two maximum intersection times in both x and y dimensions in step 4. 6: if time min < time max then 7: There is a potential conflict between the track and the terrain. 8: end if 9: end if 10: All track records in each PE are passed to next PE and steps 2 to 8 are repeated. 11: end for 12: After 96 iterations, all track records have been compared with all terrain records for terrain avoidance.

7.8 Final Approach (Runways)

The final approach task is to optimize runway usage. Each flight has a flight plan that specifies its departure terminal, planned departure time, its destination terminal, and planned arrival time. The runways that occur in the region being managed by an

ATC system could be distributed among the processors. Each processor manages the information for the runways assigned to it. Here, we assume that there are 96 runways in the sector being managed by this ATC system and assign one runway to each processor.

The PE assigned to a runway collects aircraft departure times from this runway and arrival times to this runway and sorts the times. The PEs will instruct the aircraft to 61

increase or decrease their speed to optimize runway usage and also optimize fuel cost.

The last step is currently done by controllers manually.

7.9 Comparison of AP and MIMD solutions for ATC Tasks

We assume that there is a different PE for each flight and that the data for each of the n flights is stored in its PE. We apply static scheduling in our AP solution to the

ATC problem, which is fundamentally different from the heuristic scheduling normally used in MIMD solutions [3–5]. The reasons that static scheduling is possible are: the simultaneous execution of thousands of jobs in each jobset, the wider data access band- width (200 to 500 times increase), and the ability to predict the worst case execution time for the AP, which have been explained in section 2.4. However, like most other real-time systems, the ATC problem cannot be efficiently scheduled using MIMD. One of the main reasons is that an MIMD approach uses dynamic scheduling to schedule all the tasks. The data keeps changing, so the MIMD system cannot determine, a priori, how many records will actually participate in the computation of a task or how much actual computation time a task in a cycle will require. In the AP, since tasks are executed as jobsets, there is no need to differentiate between a jobset with one record and a jobset with thousands of records, since both will require the same amount of time. Thus, the number of records in a jobset is simply a ”don’t care” parameter. All set operation times are based on the worst case assumption. This makes it possible to schedule ATC tasks statically in the AP solution.

Table 6 shows the comparison of the complexities of AP and MIMD solutions for

ATC tasks [4, 5]. It is noted that we do not include the MIMD software for dynamic 62

Table 6: Comparison of Required ATC Operations

Tasks MIMD AP Report Correlation & Tracking O(n2) O(n) Track Smooth and Predict O(n) O(1) Flight Plan Update and Conformance O(n) O(1) Cockpit Display O(n2) O(n) Controller Display Update O(n2) O(n) Aperiodic Requests(200/sec) O(n) O(1) Automatic Voice Advisory O(n) O(1) Terrain Avoidance O(n2) O(n) Conflict Detection O(n2) O(n) Conflict Resolution O(n2) O(n) Final Approach O(n2) O(n) Coordinate Transform O(n) O(1)

scheduling, load balancing, data management overhead, etc, and we assume that the

number of processors in MIMD is much smaller than the number of PEs in AP.

Furthermore, the MIMD solutions include costs of running a dynamic scheduling algorithm, resource sharing due to concurrency, processor assignment, data distribution, concurrency control, mutual exclusion of data access, indexing and sorting, etc [6]. These costs not only complicate ATC software and algorithm design, but also dramatically increase the size, complexity, and processing time of the system. On the other hand, our

AP approach provides a much simpler solution to the ATC problem: the static scheduling assures the predictability of system performance, and the design procedure has been greatly simplified by the table-driven scheduler shown in Table 2 and Figure 7. The static schedule is both reliable and adaptable and can easily be modified to incorporate system changes as needed. 63

7.10 Implementation of Algorithm Issues

One issue that needs attention is the implementation of these algorithms on real machines. Why does conflict detection have 96 iterations? Why are there 96 runways in

final approach? If we replace 96 runways with 100, will our approach still outperform?

Since our ClearSpeed accelerator has 96 processors on a SIMD chip, the execution of

ATC tasks is faster if at most one record of each type is stored on each processor. For example, if there are at most 96 aircraft being tracked at any time, then these can be processed simultaneously in data parallel fashion. However, if there are 100 planes being tracked, then at least 4 processors will have to manage the records for two planes each.

For example, when executing the tracking algorithm, all planes with two records will first check their first plane’s location against a radar location and then the four PEs managing two aircraft records will check their second plane’s location against the radar location.

During the time the 4 processors check their second plane’s location against the radar location, the other 92 processors without a second aircraft record are inactive. The result is that having even one processor contain two aircraft records will about double the time required to process the aircraft records. Likewise, having two runway records stored on one processor will typically double the amount of time to process the runway records. It is more efficient to have at most one record of each type on each processor so that all records of the same type can be simultaneously processed at the same time. Of course, if the processor speeds are fast enough and they have sufficient memory, then it may be reasonable to assign more than one record of each type to processors. More generally, if the maximum number of aircraft records in the processors is k then the time to execute 64 the tracking algorithm will be about k times larger than if there were at most 1 record per processor. As a result, when a new aircraft record is created, this new record should be assigned to a processor with a minimum number of current records of that same type.

This can be easily and efficiently accomplished using the AP associative function MIN.

As each processor can keep the maximum number of each type of records it stores and the constant time functions can be used to quickly locate a processor with a minimum number of records of that type where the new record can be stored. CHAPTER 8

Experimental Results

This chapter describes the results of a set of preliminary experimental results that were conducted to achieve four different goals. First, the experiments provide a proof-of- concept for the proposed ATC system implementation based on the CSX600. Second, the performance and scalability of the proposed approach will be evaluated by performing a comparison between the CSX600-based implementation and the fastest host only version using OpenMP [24] on a state-of-the-art multiprocessor server system featuring a total of

8 system cores. Similar experiments using OpenMP have been used in image processing algorithms on GPUs [69]. Third, these experiments will provide some initial evidence for the claim that the proposed AP-based ATC system implementation exhibits greater efficiency and a huge increase in the degree of predictability when compared to the

MIMD-based solution. Fourth, we show that our model of ATC that consists of eight selected tasks can meet the deadlines for the hard real-time ATC tasks.

8.1 Experimental Setup

We are creating a solution for a prototype of ATC since our implementation cannot manage the number of aircraft in a real-world situation. In order to have information about flights that we can control, we simulate the real-world situation by generating aircraft flights in a three dimensional airspace of 1, 024 by 1, 024 nautical miles(nm) and 1, 000 to 10, 000 feet in altitude. The initial positions and realistic velocities of the 65 66

aircraft are generated randomly and trajectories of aircraft consist of a constant velocity

mode and a coordinated turn mode. A random amount of error is added to the location

of each plane to create the radar reports. Since we control this entire process, we can

generate different numbers of aircraft to test algorithms and test the limits on the number

of aircraft that can be processed fast enough to meet deadlines. Unlike live FAA flight

data, when two aircraft are on a collision course, we can alter the flight path of one

aircraft to eliminate this problem.

The emulation of the AP solution of ATC uses the ClearSpeed CSX600 accelerator board and is implemented using the Cn language as described in Section 5.1. The al- ternative implementation was on an Intel Dual Processor Xeon E5410 Quad Core 2.33

GHz system with 32 GB of main memory and 2 × 6Mb of L2 cache (for each CPU). This system has a total of 8 cores. The implementation was done in C using the gcc compiler version 4.1.3. Both machines’ power are similar because their peak performance in doing some computations [46] are almost the same, e.g., Mandelbrot set [43, 83]. So our comparison is fair. In this section, MIMD will usually be used to refer to this specific machine. Occasionally, it will refer to a generic MIMD, but these instances should be clear from the context. A single threaded implementation on a single core of the multicore is called STI. We denote the multi-threaded implementation on the multicore by MTI. We used the compiler optimization flags ’-mfpmath=387,sse -msse3

-ffast-math -O3’ to enable Intel’s Streaming SIMD Extensions (SSE) for both the single- threaded as well as the multi-threaded version of the code. We have carefully tuned the

OpenMP codes to achieve a good performance, e.g., avoid false sharing, cache ping pong, maximize parallel regions, ensure the efficient use of cache, avoid poor load balance and 67 high contention problems, etc. The multi-threads were specifically designed to avoid locking whenever possible. Additional information about the techniques used to obtain highly efficient OpenMP programs are given in Section 2.5.

8.2 Experiment Results

8.2.1 Comparison of Performance on STI, MIMD and CSX600

We scheduled the tasks of report correlation and tracking, terrain avoidance and

CD&R on both machines. The report correlation and tracking is executed every 0.5 second, the terrain avoidance every 1 second, and the CD&R task every 2 seconds. The deadline for each task occurs at the end of each of its periods. None of the tasks can start their executions before their release times and all of them must complete their executions before their deadlines. Each approach was executed for a varying number of aircraft, ranging from 96 to 1824 in increments of 96. For each number of aircraft in the given range, each approach was executed for 100 times. The major period is 2 seconds and contains one CD&R period, two terrain avoidance periods, and four report correlation and tracking periods. If any of the three tasks misses its deadlines during a major period, we say that the execution has missed its deadline during this period.

We execute the three tasks under all three approaches, CSX600, MTI and STI.

The comparisons of maximum execution times are shown in Figure 11, Figure 12 and

Figure 13. From these results we can see that MTI takes more time than the CSX600 and the additional time required for the MTI implementation increases more rapidly than the CSX600 as the number of aircraft increases. This is because MIMD solution has more overhead, more costs in inconsistency, dynamic scheduling and load balancing etc. 68

 !"#$ !#$'%( &  !"#$ % 



      



  )  )  01 2#342!23"

Figure 11: Timing of tracking task.

  !"#$ % !"#$  

  &  !#$'%(        



  )  )  01 2#342!23"

Figure 12: Timing of terrain avoidance task.

Because there is too much difference between the data of STI and the other two, it is very

hard to distinguish the bottom two curves in these three figures. All of these graphs show

a significant speedup of the performance of the MTI implementation over the sequential

STI implementation so the parallel code for MTI is well-written. Section 8.2.2 shows the

difference between MTI and SIMD more clearly in its three figures.

8.2.2 Comparison of Performance on MIMD, CSX600 and STARAN

Timing results for tracking and correlation, CD&R, terrain avoidance and display processing tasks for the STARAN are given in [4, 5, 11]. It only takes 0.11 second to

execute the tracking task for 2, 000 aircraft for the STARAN. The AP in the following

four figures stands for STARAN specifically. Figure 14 compares timings of the tracking

task for the MIMD, CSX600 and STARAN. It only takes 0.1 second to execute terrain avoidance task for 2, 000 aircraft in the STARAN. Figure 15 compares timings of terrain 69

 !"#$$&  "#'%(   !"#$%   

          012 "3$42) 23!  ) 

Figure 13: Timing of CD&R task.

 !"#$%& " & "#%&'(  !"#$%& " 

    



     )0"1 %2 1#12$

Figure 14: Comparison of tracking task of MTI, CSX600 and STARAN.

avoidance task for the MIMD, CSX600 and STARAN. It only takes 0.28 second to execute

CD&R task for 2, 000 aircraft in the STARAN. Figure 16 compares timings of CD&R task for the MIMD, CSX600 and STARAN. It only takes 0.16 second to execute display processing (controller display update) task for 2, 000 aircraft for the STARAN. Figure 17 compares timings of this task for the MIMD, CSX600 and STARAN. Although we do not have a modern AP system to test, with today’s advanced architecture, we believe that it will execute the ATC tasks more quickly and accurately and at least as predictably, compared to the STARAN.

The comparisons between MTI and CSX600 in Figures 14, 15, 16, and 17 are fair, as both are small systems and are doing the entire job. However, the comparisons between the CSX600 and the STARAN are not fair, as the STARAN has only one aircraft per PE, while the CSX600 has an increasing number of aircraft. The CSX600 curves in the three graphs above make it appear that this system is more like a MIMD than STARAN, due to the fact that the number of records per processor is increasing. The next section 8.2.3 70

 !"#$%& !"#$%& " & "#%&('  !"#$%& "' 

    



     )0"1 %2 1#12$

Figure 15: Comparison of terrain avoidance task of MTI, CSX600 and STARAN.

 %&'!#!'!#$&')("#$  !"#$ %&'!#( 

    

 



       0 1#2!&3!2$23%

Figure 16: Comparison of CD&R task of MTI, CSX600 and STARAN. will explain how to compare STARAN with CSX600 fairly and demonstrate how closely

our CSX600 emulates the STARAN.

8.2.3 How close to the AP is our CSX600 emulation of the AP

This section illustrates how closely our CSX600 emulation of the AP performance is to AP performance. Figure 18 and Figure 19 shows the timings of tracking and CD&R tasks for the total number of aircraft increasing from 10 to 384, i.e., number of aircraft per PE is 1, 2, 3 and 4 in CSX600. We can see that the execution time is basically linear when the maximum number of aircraft per PE does not change, and there is a gap when the maximum number of aircraft per PE increases. Moreover, the slope of each of the line segments is reasonably close to the slope of the STARAN over that interval. These results show that our CSX600 emulation is closely tied to STARAN, and it emulates the solution of STARAN the best when the maximum number of aircraft in each PE is 1.

As explained in previous section 8.2.2, the comparison between the STARAN and the

CSX600 is unfair, so we provide a fairer way to compare them. We will use the CSX600 71

 "#$ %&'!#!'!#$&')("#$  !"#$ %&'!#( 

    

 



       0 1#2!&3!2$23%

Figure 17: Comparison of display processing task of MTI, CSX600 and STARAN.

Figure 18: Time of tracking for number of aircraft from 10 to 384. to emulate a Super-CSX600 with enough processors to match the upper bound of the number of planes so that there is at most one aircraft per PE on it. The CSX600 is still paying a price for having to do extra moves between mono memory and parallel memory in emulating the Super-CSX600. The CD&R execution time for the Super-

CSX600 is obtained by adding the non-parallel computation time to the quotient of the parallel execution time divided by the maximum number of aircraft per PE. The results in Figure 20 show that the Super-CSX600 is basically linear and reasonably similar to the STARAN, but the cost increases faster for the former due to greater overhead, such as data transfer between mono memory and processors, the non-constant cost of the associative functions, and disadvantages of hardware compared to STARAN, etc.

There is a significant difference between AP and Super-CSX600, but it is dwarfed by the difference between their curves and the MTI curves. 72

Figure 19: Time of CD&R for number of aircraft from 10 to 384.

 !"#$%&'(%)(%'( #&'01



     

 



 2)3(&4(%#54$     

Figure 20: Comparison of MTI, STARAN and Super-CSX600 with at most one aircraft per PE.

8.2.4 Comparison of Predictability on CSX600 and MIMD

The preceding results demonstrate that the performance of CSX600 is better than that of STI and MTI. We next focus on the predictability of the execution times, which

is an important factor in guaranteeing that all the ATC processing can be performed

within the specified time bounds. We measure the Coefficient of Variation (COV),

which is a common normalized measure of dispersion, and is defined as the ratio of

the standard deviation to the mean. The smaller the value is, the lower the variance

is and so the higher the predictability is. Unlike the standard deviation, the COV is

dimensionless, so we think that it is a better tool to use to compare predictability than

standard deviation. The COV of the three tasks are shown in Figures 21, 22 and 23. Note

that the y-axis in these figures uses a logarithmic scale. The results clearly show that the COV values for CSX600 are several orders of magnitude below the ones for MTI. In addition, the increased variability of MIMD will not be compensated for by the running 73

  !!"#$#  %23#$ 40! #51 ''()0#1 %& !!"#$#  





    





 7809$!!@$$# 6  6 

Figure 21: Predictability of execution times for report correlation and tracking task.

  !"!#$% %&'(!'$!# )01'(!$23!"%"'45 %&'(!'$!#)6$77893'#5  !"!#$%%                  @A3B(% %C!("(' $

Figure 22: Predictability of execution times for terrain avoidance task.

time. As shown in Figures 14, 15 and 16, the MIMD running time increases much more

rapidly than CSX600 and AP. Again this is because of the advantages of AP over MIMD

mentioned in Section 4.2 above. These scenarios are true for not only less than 8, 000

aircraft, but also for large scales such as 14, 000 aircraft.

8.2.5 Comparison of Missing Deadlines on CSX600 and MIMD

Last, we compare the number of major periods that have missed their deadlines for

both CSX600 and MTI. The scheduling was described at the beginning of this section.

The results are shown in Figure 24. When the number of aircraft increases, the number of deadlines missed by the MTI execution increases dramatically. However, the CSX600 does not miss any deadline during each major period.

8.2.6 Timings for 8 Tasks

In this section we will show that ClearSpeed can perform all the required tasks in our ATC prototype and meet all deadlines for each major period. Table 7 shows the 74

    !"#" $%&"#'( ")0

  !"#" $12234("0   !"#" $12234("0 

    



 56(7#   8##"   

Figure 23: Predictability of execution times for CD&R task.

  !"#$%&'%(#$)(0&12%022013%4#)5601#2          7    7 7  !"#$%&'%80$9$)'(

Figure 24: Number of iterations missing deadlines when scheduling tasks.

performance of one flight per PE, which is the closest scenario to the AP. The execution

time (secs) is the time that is spent to execute this task once. The processing time (secs)

is the total time that is spent for this task in an 8 second period. We can see that all

tasks can be done within their deadlines. The total used time is 0.25508 seconds, which

is only 3.19% of the available time of 8 second.

Table 8 lists the worst case timings of eight tasks for 10 aircraft per PE running on the CSX600, which is the maximum number that can be scheduled because each PE only

Table 7: Performance of One Flight/PE Tasks Exec Time(Seconds) Proc Time (Seconds) Report Correlation & Tracking 0.00552 0.08832 Cockpit Display 0.00272 0.02177 Controller Display Update 0.00276 0.02209 Sporadic Requests 0.00155 0.01244 Automatic Voice Advisory 0.00544 0.01088 Terrain Avoidance 0.00782 0.0782 Conflict Detection & Resolution 0.01301 0.01301 Final Approach(96 runways) 0.00837 0.00837 Total 0.25508 75

Table 8: Performance of Eight Tasks for Ten Flights/PE Tasks Exec Time(Seconds) Proc Time(Seconds) Report Correlation & Tracking 0.3409 5.4544 Cockpit Display 0.1705 1.364 Controller Display Update 0.1825 1.46 Sporadic Requests 0.0987 0.7896 Automatic Voice Advisory 0.1586 0.3172 Terrain Avoidance 0.2386 0.2386 Conflict Detection & Resolution 0.5182 0.5182 Final Approach(96 runways) 0.2598 0.2598 Total 10.4018 has 6K of memory. The total processing time is 10.4018 seconds while the maximum time required by ATC is typically 10 seconds [2,4,5,17,68]. Since a 10.4 second cycle is only slightly larger than a 10 second cycle, the performance results are relatively good and demonstrate the scalability of this implementation on the CSX600. We can conclude that the CSX600 implementation can meet all deadlines that can be statically scheduled.

The ClearSpeed CSX700 provides a larger SIMD system with 192 processors and should be able to double the number of aircraft supported by the CSX600. With a faster 250

MHz clock speed, the CSX700 should either meet or come close to meeting a 10 second deadline for execution of the major period.

The reason that we use 96 in Table 3 is that 96 is the maximum we can use without increasing the CSX600 running time of this task by a factor of about 2. Also, this larger number demonstrates the advantage of using a SIMD for a more computationally intensive problem and should be enough to include all of the runways in airfields under the control of an ATC center. The bound of 100 runways is used in both [4] and [5].

We have also tested executions for 10 runways, which is more practical for an airport size application (as opposed to an entire ATC sector). The result is almost the same. 76

However, CSX600 is low power consumption, small size and weight, so it is not a big

waste of resources.

8.2.7 Video and Actual STARAN Demos

In a contract with FAA, Goodyear Aerospace gave a demonstration in 1971 in Knoxville,

TN. This unit used a content-addressable memory built with magnetic wire to demon- strate automatic tracking, conflict detection and resolution, terrain avoidance, and auto- matic voice advisory using only about 4500 instructions. This unit was a predecessor to the STARAN.

As a result of the successful performance of the experimental magnetic SIMD at

Knoxville, Goodyear developed a production AP named STARAN. The STARAN sys- tem had array modules containing 256 PEs and 1028 bits of storage per PE. The system was programmed by five people in five months and consisted of 4, 017 AP assembly in- structions, and about 1, 600 instructions for the control processor [13]. It was delivered for demonstration in 1972 at the International Air Exposition at Dulles Field, and per- formed as a control center processor. The Edwards AFB radar was used as radar input.

All radar data was tracked automatically.

A video made from the 16 mm film taken in 1972 at the International Air Exposition at Dulles Field is available for viewing at [88]. The video shows automatic, primary and beacon, radar tracking, conflict detection and resolution, and display processing. The demonstration uses 256 flight plans to develop the primary and beacon radar reports.

Initially, a ten second major period was used, but later the system clock was sped up to process major periods in one tenth of a second. The speedup of the system clock by 100 77 times real time is the equivalent of 25, 600 tracks in real time. Both the real-time and the 100 times speedup are captured on the video. What was demonstrated by STARAN in 1972 cannot be done in any ATC system in the world today. CHAPTER 9

Observations and Consequences of the Preceding Results

This section focuses on the theoretical aspects of the paper, i.e., the contributions of our AP approach to the bigger picture of high performance computing, parallel and etc, which are very popular now. We will show that the idea of using AP for solving ATC problems is useful for many other data intensive problems, and other computation methods in many areas can benefit from it.

A recent paper in IEEE Computer [89] observes that the widely accepted RAM model for sequential computation is the reason for the huge progress that has occurred in se- quential computation over the past 60 years and observes that a single widely acceptable model for parallel computation could result in similar successes in parallel computation.

It describes several features that this parallel model and its target architectures should satisfy, including encouraging parallel solutions to applications that are easy to use, en- ergy efficient, scalable, and highly portable. Moreover, the pitfalls and problems that are identified as desirable to avoid in software production are data dependences, race condi- tions, load balancing, nondeterminism, false sharing, deadlocks, and non-predictability while paradigms that support construction of easy-to-use, productive, energy-efficient, scalable, and portable parallel applications are desirable. Unfortunately, most program- mers do not have a good understanding of the latest developments in hardware design, which is an essential skill needed for multicore and particularly many-core programming.

78 79

Sequential programming is a deterministic and predictable process that arises intu-

itively from the way programmers solve problems using algorithms. Parallel programming

should be as simple and intuitive as sequential programming. The general programmers

should not be required to have an in-depth understanding of the latest developments in

hardware design in order to write efficient programs. The current complexity of designing

high quality parallel programs makes the process of redesigning and rewriting applica-

tions both time-consuming and expensive. The result is that most legacy applications

are still waiting to be parallelized. These and related issues are discussed further in [89].

The AP programming has all of the advantages mentioned above, moreover, it is

very easy for programmers because there is only one instruction stream every time. It is

unnecessary to worry about the extra work of MIMD, so AP programming is not as error

prone as multicore and particularly many-core programming. Since not many SIMDs

have been built since the 1980’s, most of today’s computer professionals have very limited

knowledge about how to design software, algorithms, or programs for SIMDs. Essentially

none of today’s computer professionals even know what an AP is. They do not think in

terms of SIMD or AP solutions to problems, and are unaware of how to solve problems

using these machines.

The associative computational model ASC introduced in [15] was designed to support algorithm development for the AP and satisfies all of the above desired characteristics.

While the name ASC in this paper originally applied to both APs and multi-APs (which allowed multiple instruction streams), the name ASC was later restricted to the single instruction stream AP parallel computer and the model for multi-APs was called MASC.

Comparisons between the power of these models and other well-known computational 80

models are given in [42,90,91]. A possible implementation of a multi-threaded associative

processor is also investigated in [79].

Next, we consider whether the AP is primarily useful in solving only ATC-type prob- lems, or whether it can also solve a wide range of other problems. A language (also called

ASC) with a primer, compiler, and Windows emulator has been designed for ASC [41,92].

Many algorithms have been designed for the ASC model and software projects for AP platform (often using the ASC language) in order to demonstrate the versatility of the

ASC model and the AP system. These include graph algorithms [15,41,93], convex hull

[94–98], string matching [99,100], sequence alignment [101,102], NP-Complete [103,104], [105, 106], parallel [107–109], supporting functional and logic-based languages [110–112], and supporting expert system languages [113–115] and computer ar- chitecture [76–81]. In [11], Batcher lists fast Fourier transforming, sonar post-processing, string search, file processing, air traffic control, image processing, data management, posi- tion locating and reporting, bulk filtering, and radar processing as STARAN applications, and for the first five of these, provides a comparison of the STARAN’s performance with several other systems. Additional applications are discussed in [41], the MPP was specif- ically designed and used for image processing. This wide range of AP algorithms and software demonstrate the versatility of the AP to solve a wide range of applications.

The extra problems that MIMDs have to solve as part of their solution to the original problem require a lot of extra work. In addition to the extra problems discussed earlier, the extra work that MIMDs have to do to support the integrity and the ACID properties of a database requires substantially additional work. This requires additional software support that typically involves many of the same types of support software discussed 81

earlier such as synchronization, dynamic scheduling, critical sections of code, sorting,

indexing, etc. Also this extra work may involve a substantial increase in communications.

All of these problems become much more difficult to handle when the database is dynamic,

as records typically have to be added or transferred frequently, and massive updates occur

frequently, e.g., every 0.5 second. In contrast, the AP (and SIMDs in general) stores all information about an object such as an aircraft in the local memory of the processor this object is assigned to. This processor executes essentially all of the computations regarding that object, so updates to the database involving this object only require changes to the local memory of that processor.

The graphs in Chapter 8 of this paper show that the magnitude of this extra work dramatically slows down the time required for MIMDs on ATC type problems. However, there is nothing highly unusual about the solution to ATC problems, except the hard deadlines. While the heavy real-time nature of the ATC problem and the extremely dynamic database that must be maintained exacerbates the extra work that MIMDs are required to do to solve this problem, this just primarily increases the magnitude of this extra work. It is still the case that for most non-trivial applications, MIMDs have to use a huge amount of extra software and do a huge amount of extra work when compared to

SIMDs.

The hardware cost of building a large AP should be much cheaper than building a

MIMD that has the capability of solving the same problems due to the simplicity of the AP hardware. While many applications may be reasonable to solve on a MIMD, on important large scale problems that will be expensive and challenging to develop a

MIMD solution, it may be considerably cheaper and much more satisfactory to have 82 an appropriate size AP built to handle the problem. The history of the ATC problem provides an excellent justification for considering this alternate approach. The ongoing

MIMD ATC solution is widely known for its repeated shortcomings, in spite of the fact that a very large scale project to redevelop this system has been launched about once every ten years for the past 50 years. CHAPTER 10

Conclusions and Future Work

In this dissertation, we provided an efficient and scalable solution for Air Traffic

Control (ATC) problems on the associative processor (AP) system that is emulated on the ClearSpeed CSX600 system. We established the feasibility of this solution by showing that if this solution is mapped onto the Clearspeed CSX600 emulator of the AP, then this solution works on the AP system. Without many of the details of the AP solution, we were able to recreate a large portion of this earlier implementation.

The contributions of this paper are summarized as follows. First, the AP imple- mentation has much better scalability and efficiency than the MIMD implementation.

This implementation used static scheduling, which is possible due to the deterministic

SIMD architecture. This AP ATC software also avoids the inclusion of solutions to the following types of problems (or calls to library functions which solve these problems): load balancing, shared resource management, synchronization, management, sorting, indexing, cache and memory coherency management, false sharing, priority in- version handling, race conditions, data dependencies, and deadlocks, etc. In particular, this solution avoids use of solutions or approximate solutions to any of the numerous multiprocessor NP-hard problems of the type given in [8]. The solution used is very sim- ilar to a sequential solution for this problem, both in style and in code size. As a result, the size of this software solution is only a very small fraction of the size of the MIMD solutions to the ATC problem. This results in a dramatic drop in the cost in both the 83 84

cost of creating and in maintaining this software when compared to the MIMD solutions

that have been given to the ATC problem.

Second, the proposed AP solution will support accurate and meaningful predictions of worst case execution times and will guarantee all deadlines are met. In contrast, MIMD systems optimize average case running time and have highly unpredictable worst case running time. These contributions can provide major help in meeting the goals of FAA’s

NextGen Plan: fly more aircraft, more safely, more precisely, more efficiently and use less fuel [17].

Third, the AP hardware easily scales to handle larger problem sizes by either increas- ing the maximum number of records stored in each processor (which will slow down the run-time) or by building a larger AP computer with more processors, which is easy to do.

Based on the simplicity of the architecture of the STARAN and ASPRO, which are the only APs that have been built, building a larger AP system should be both easy to do and much cheaper to build than a MIMD system of comparable size. The CM-2 Connection

Machine was a SIMD computer built by Thinking Machines in the late 1980’s that had over 64K processors. Paracel developed a parallel processor which was generally believed to be a SIMD that had one million processors. It would seem reasonable to expect that an AP with a million processors could easily be built currently. Fourth, the result is that the Validation and Verification (V&V) is simpler than for current MIMD software.

The ATC problem has similar requirements to most embedded real-time problems with periodic tasks and hard deadlines, such as command and control problems in mil- itary, autonomous driving, video surveillance, image processing, simulations of complex 85

physical systems (e.g., weather forecasting, molecular modeling), and massive data pro-

cessing, etc. The AP’s ability to easily handle the ATC problem would also enable it

to easily handle many other real-time problems, both large and small. As a result, this

research is relevant not only to the ATC problem, but also to numerous other important

applications that involve real-time problems with hard deadlines. For example, command

and control systems such as an air defense system would be natural candidates. Other

examples may include embedded real-time systems as well.

The ATC problem is a large dynamic database problem. The AP excels in handling the ATC since it was designed to handle real-time dynamic database activities rapidly.

For example, locating records with a particular property, reading a value from this record, changing a value in this record, and determining whether a record with a certain property exists are actions that can be done in constant time. By use of the flip network, large pieces of records can be moved into parallel memory or copied from parallel memory in constant time, making it possible to enter and to ship these records elsewhere rapidly.

The AP’s capability of handling dynamic databases makes it very useful for numerous other applications.

A natural application area is controlling fleets of Unmanned Aircraft Systems (UASs).

In fact, the ATC system developed here with the ClearSpeed CSX600 could be expanded to manage flight control for a small fleet of 480 to 960 UASs (depending on the number of tasks performed and the size of the records stored, etc) in applications such as: (1) patrolling areas such as international borders and reporting unusual activities; (2) early identification of forest fires and during actual fires, maintaining an updated information about regions that are burnt, threatened, etc; and (3) surveillance of agriculture crops and 86

performing functions like spraying when needed, turning water on at various locations

when plants need water, etc. The ClearSpeed CSX700 provides a larger SIMD system

with 192 processors and roughly a 20% faster clock time for processors should more than

double the capabilities of the CSX600. As ClearSpeed SIMD accelerators are well-known

for their very small power consumption, size, and weight, it should be possible to build

an AP that also has similar characteristics. For example, it should be possible to build

an AP that requires less power than a light bulb and have a sufficiently low weight that

would enable it to be easily deployed in the field and still be able to control 1000 UASs in

the immediate vicinity. Since UASs will soon be permitted to enter airspace controlled

by FAA, many avionics experts expect the total aircraft in the skies to rapidly increase

in numbers as the anticipated civilian use of UASs explodes. Since the demands on our

ATC system are likely to increase at a much more rapid pace than in the past, the FAA

should quickly initiate an investigation into the use of APs for ATC and how to best

integrate the APs into the FAA system.

From Sections 9 and 10 above, we know that AP approach has novelty and contribu- tions in many other applications. However, we discuss its limitation now. It is obvious that AP is ideal for solutions that have a lot of data parallelism in them, especially large scale, while MIMD fits distributed applications. For example, at the end of Section 8.2.6, it takes the same time to process 10 runways and 96 ones. A natural way to overcome this limitation is to combine AP and MIMD in the same system. The ClearSpeed has been used extensively as an accelerator to MIMD systems. In several cases, supercom- puters increased their ranking in the top 500 by adding multiple ClearSpeed accelerators to their system. MIMD processors could hand off problems that the SIMD accelerator 87

could compute more efficiently and perform other work while waiting for ClearSpeed to

return the solution. It may be useful to investigate whether the use of both AP and

MIMD hardware in the same system as co-equal partners could also prove to be benefi-

cial. Either system could serve as the main system, but would have the option of handing

off problems to the other system that it could not handle efficiently. Such a system would

need to be able to convert from one mode to the other efficiently or else transfer data

from one system to the other efficiently. Perhaps such combination might be useful in

building a system that could handle a wider range of problems efficiently. It also might

be useful in large systems (e.g., exascale) which would be able to handle a variety of very

large applications.

A possibly important extension to our current research would be to consider an im- plementation of ATC on NVIDIA GPU hardware using CUDA, which has many SIMD

PE groups on its chips. The NVIDIA technology including the latest FERMI chip has a lot in common with the MTAP approach of ClearSpeed. Implementing the CSX600

ATC algorithms on this architecture may provide another useful platform to use in this project and would provide useful information about its ability to provide another useful platform to use for the types of real-time applications mentioned earlier. Furthermore, using CUDA GPU, we can investigate whether the shortcomings or bottlenecks of AP can be improved. For example, GPUs have evolved into highly parallel multicore sys- tems allowing very efficient manipulation of large blocks of data. This design is more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. We will investigate whether ATC can run well using CUDA 88 and whether the bottlenecks can be improved. Another potential project is to investi- gate implementing our prototype on other parallel systems, e.g., a Cray Systems, such as their , IBM’s processor, and Convex with FPGA reconfigurable hardware, etc. BIBLIOGRAPHY

[1] S. Kahne and I. Frolow, “Air traffic management: Evolution with technology,” IEEE Control Systems Magazine, vol. 16, no. 4, pp. 12–21, November 1996. [2] M. Nolan, Fundamentals of Air Traffic Control, 3rd ed. Wadsworth: Brooks/Cole, 1998. [3] M. Jin, “Evaluating the power of the parallel masc model using simulations and real-time applications,” Ph.D. dissertation, Department of Computer Science, Kent State University, August 2004. [4] W. Meilander, M. Jin, and J. Baker, “Tractable real-time air traffic control au- tomation,” in Proc. of the 14th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), Cambridge, MA, November 2002, pp. 483–488. [5] W. Meilander, J. Baker, and M. Jin, “Predictable real-time scheduling for air traffic control,” in Fifteenth International Conference on Systems Engineering, August 2002, pp. 533–539. [6] ——, “Importance of computation reconsidered,” in Proc. of the 17th In- ternational Parallel and Distributed Processing Symposium (IEEE Workshop on Processing), Nice, France, April 2003. [7] W. Meilander, J. Potter, K. Liszka, and J. Baker, “Real-time scheduling in com- mand and control,” in Proc. of the 1999 Midwest Workshop on Parallel Processing, August 1999. [8] M. Garey and D. Johnson, Computers and Intractability: a Guide to the Theory of NP-completeness. New York: W.H. Freeman, 1979. [9] M. N. J.A. Stankovic, M. Spuri and G. Buttazzo, “Implications of classical schedul- ing results for real-time systems,” IEEE Computer, pp. 16–25, June 1995. [10] M. Jin, J. Baker, and K. Batcher, “Timings for associative operations on the masc model,” in Proc. of the 15th International Parallel and Distributed Processing Sym- posium (IEEE Workshop on Massively Parallel Processing), San Francisco, CA, April 2001, pp. 193–200. [11] K. Batcher, “Staran parallel processor system hardware,” in National Computer Conference and Exposition (AFIPS74), New York, NY, May 1974, pp. 405–410. [12] W. Meilander, “Staran an associative approach to multiprocessing,” in Multipro- cessor Systems, Infotech State of the Art Reports, Infotech International, 1976, pp. 347–372. 89 90

[13] J. A. Rudolph, “A production implementation of an associative array processor - staran,” in The Fall Joint Computer Conference (FJCC), Los Angeles, CA, De- cember 1972. [14] W. Meilander, “Aspro-vme hardware/architecture,” June 1992, eR3418-5 LORAL Defense Systems. [15] J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, and C. Asthagiri, “Asc: An associative-computing paradigm,” Computer, vol. 27, no. 11, pp. 19–25, 1994. [16] “Faa grants for aviation research program solicitation no. faa-06-01,” 2011. [Online]. Available: http://www.tc.faa.gov/logistics/grants/solicitation/97solict.doc [17] “Faa’s nextgen implementation plan,” 2011. [Online]. Available: http: //www.faa.gov/nextgen/media/ng2011\ implementation\ plan.pdf [18] “Federal aviation agency 1963 atc specifications,” 1963. [Online]. Available: http://www.cs.kent.edu/∼parallel/papers/FAA1963ATCSpecifications.pdf [19] M. Yuan, J. Baker, F. Drews, and W. Meilander, “Efficient implementation of air traffic control (atc) using the csx620 system,” in Proc. of the 21st IASTED International Conference on Parallel and Distributed Computing and Sys- tems (PDCS), Cambridge, MA, November 2009, pp. 353–360. [20] S. Guy, J. Chhugani, C. Kim, N. Satish, M. Lin, D. Manocha, and P. Dubey, “Clearpath: highly parallel collision avoidance for multi-agent simulation,” in ACM SIGGRAPH/Eurographics Symposium on (SCA). ACM, Au- gust 2009, pp. 177–187. [21] M. Yuan, J. Baker, F. Drews, L. Neiman, and W. Meilander, “An efficient asso- ciative processor solution to an air traffic control problem,” in Large Scale Parallel Processing (LSPP) IEEE Workshop at the International Parallel and Distributed Processing Symposium (IPDPS), Atlanta, GA, April 2010. [22] M. Yuan, J. Baker, W. Meilander, and K. Schaffer, “Scalable and efficient asso- ciative processor solution to guarantee real-time requirements for air traffic con- trol systems,” in Large Scale Parallel Processing (LSPP) IEEE Workshop at the International Parallel and Distributed Processing Symposium (IPDPS), Shanghai, China, May 2012, pp. 1682–1689. [23] M. Yuan, J. Baker, and W. Meilander, “Comparisons of air traffic control implemen- tations on an associative processor with a and consequences for parallel com- puting,” Journal of Parallel and Distributed Computing(JPDC), 2012, accepted, to appear. [24] “Openmp website,” 2010. [Online]. Available: http://openmp.org/wp/ [25] S. Akl, Parallel Computing: Models and Methods. New York: Prentice Hall, 1997. [26] M. Quinn, Parallel Programming in C with MPI and OpenMP. McGraw-Hill, 2004. 91

[27] T. Cormen, C. Leisterson, and R. Rivest, Introduction to Algorithms, 1st ed. Mc- Graw Hill and MIT Press, 1990, chapter 30 on Parallel Algorithms. [28] M. Flynn, “Some computer organizations and their effectiveness,” IEEE Transac- tions on Computers, vol. C, no. 21, pp. 948–960, 1972. [29] M. Quinn, Parallel Computing: Theory and Practice. McGraw-Hill, 1994. [30] W. Chantamas, “A multiple associative computing model to support the execution of data parallel branches using the manager-worker paradigm,” Ph.D. dissertation, Department of Computer Science, Kent State University, December 2009. [31] M. Garey, R. Graham, and D. Johnson, “Performance guarantees for scheduling algorithms,” Operations Research, vol. 26, no. 1, pp. 3–21, Jan-Feb 1978. [32] J. Stankovic, “Misconceptions about real-time computing,” IEEE Computer, vol. 21, no. 10, pp. 17–25, Oct. 1988. [33] J. Stankovic, S. Son, and J. Hansson, “Misconceptions about real-time databases,” Computer, June 1999. [34] J. Stankovic, M. Spuri, K. Ramamritham, and G. Buttazzo, Deadline Scheduling for Real-time Systems. Kluwer Academic Publishers, 1998. [35] J. W. Liu, Real-Time Systems. Prentice Hall, 2000. [36] G. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algo- rithms and Applications, 2nd ed. New York: Springer Science, 2005. [37] C. Murthy and G. Manimaran, Resource Management in Real-time Systems and Networks. MIT Press, 2001. [38] M. Klein, J. Lehoczky, and R. Rakumar, “Rate-monotonic analysis for real-time industrial computing,” IEEE Computer, pp. 24–33, 1994. [39] C. L. Liu and J. Layland, “Scheduling algorithms for multiprogramming in a hard- real-time environment,” Journal of the ACM, vol. 20, no. 10, 1973. [40] J. Anderson, J. Calandrino, and U. Devi, “Real-time scheduling on multicore plat- forms,” in Proc. of the 12th IEEE Real-Time and Embedded Technology and Appli- cations Symposium, San Jose, CA, April 2006, pp. 179–190. [41] J. Potter, Associative Computing: A Programming Paradigm for Massively Parallel Computers. New York: Plenum Press, 1992. [42] J. Trahan, M. Jin, W. Chantamas, and J. Baker, “Relating the power of the multiple associative computing model (masc) to that of reconfigurable -based models,” Journal of Parallel and Distributed Computing, Elsevier Publishers, vol. 70, pp. 458–466, May 2010. [43] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald, Parallel Programming in OpenMP, 1st ed. Morgan Kaufmann, 2000. 92

[44] B. Chapman, G. Jost, and R. Pas, Using OpenMP Portable Shared Memory Parallel Programming. MIT Press, 2007. [45] H. Casanova, A. Legrand, and Y. Roberts, Parallel Algorithms. CRC Press, 2009. [46] B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall, 1999. [47] K. Lakshmanan, S. Kato, and R. Rajkumer, “Scheduling parallel real-time tasks on multi-core processors,” in Proc. of the 31st IEEE Real-Time Systems Symposium (RTSS’10), San Diego, CA, December 2010, pp. 259–268. [48] I. Hwang, “Air traffic surveillance and control using hybrid estimation and protocol- based conflict resolution,” Ph.D. dissertation, Stanford University, 2003. [49] L. Yang and J. Kuchar, “Prototype conflict alerting logic for free flight,” AIAA Journal of Guidance, Control, and Dynamics, vol. 20, no. 4, pp. 768–773, July- August 1997. [50] J. Kuchar and L. Yang, “A review of conflict detection and resolution modeling methods,” IEEE Transactions on Intelligent Transportation Systems, vol. 1, no. 4, pp. 179–189, 2000. [51] Y. Bar-Shalom and T. Fortmann, Tracking and Data Association. Academic Press, 1988. [52] H. Blom, R. Hogendoorn, and B. vanDoorn, “Design of a multisensor tracking system for advanced air traffic control,” in Multitarget-Multisensor Tracking: Ap- plication and Advances, Y.Bar-Shalom, Ed., vol. 2. Artech House, 1990, pp. 31–63. [53] I. Hwang, H. Balakrishnan, K. Roy, and C. Tomlin, “Multiple-target tracking and identity management in clutter for air traffic control,” in Proceedings of the AACC American Control Conference, Boston, MA, June 2004. [54] I. Hwang, J. Hwang, and C. Tomlin, “Flight-mode-based aircraft conflict detection using a residual-mean interacting multiple model algorithm,” in Proceedings of the AIAA Guidance, Navigation, and Control Conference, Austin Texas, August 2003. [55] I. Hwang and C. Tomlin, “Protocol-based conflict resolution for finite information horizon,” in Proceedings of the AACC American Control Conference, Anchorage, Alaska, May 2002. [56] K. Liu, “Composition of kalman and heuristic tracking algorithms for air traffic control,” Master’s thesis, Department of Computer Science, Kent State University, August 1999. [57] E. Mazor, A. Averbuch, Y. Bar-Shalom, and J. Dayan, “Interacting multiple model methods in tracking: A survey,” in IEEE Transactions on Aerospace and Electronic Systems, vol. 34(1), 1998, pp. 103–123. 93

[58] D. Lainiotis, “Partitioning: A unifying framework for adaptive systems i: Estima- tion,” in Proceedings of the IEEE, vol. 64, August 1976, pp. 1126–1142. [59] Y. Bar-Shalom and X. Li, Estimation and Tracking: Principles, Techniques and Software. Boston, Massachusetts: Artech House, 1993. [60] D. Sworder and J. Boyd, “Estimation problems in hybrid systems,” in Cambridge University Press, 1999. [61] H. Blom and Y. Bar-Sharlom, “The interacting multiple model algorithm for sys- tems with markovian switching coefficients,” IEEE Transactions on Automatic Control, vol. 33, no. 8, pp. 780–783, August 1988. [62] X. Li and Y. Bar-Shalom, “Design of an interacting multiple model algorithm for air traffic control tracking,” IEEE Transactions on Control Systems Technology, vol. 1, no. 3, pp. 186–194, September 1993. [63] P. Menon, G. Sweriduk, and B. Sridhar, “Optimal strategies for free flight air traffic conflict resolution,” Journal of Guidance, Control, and Dynamics, vol. 22, no. 2, pp. 202–211, 1999. [64] J. Krozel, M. Peters, K. Bilimoria, C. Lee, and J. Mitchell, “System performance characteristics of centralized and decentralized air traffic separation strategies,” in Fourth USA/Europe Air Traffic Management Research and Development Seminar, 2001. [65] Y.-J. Chiang, J. Klosowski, C. Lee, and J. Mitchell, “Geometric algorithms for conflict detection/resolution in air traffic management,” in 36th IEEE Conference on Decision and Control, San Diego, CA, December 1997, pp. 1835–1840. [66] R. Paielli and H. Erzberger, “Conflict probability estimation for free flight,” Journal of Guidance, Control, and Dynamics, vol. 20, no. 3, pp. 588–596, 1997. [67] M. Prandini, J. Hu, J. Lygeros, and S. Sastry, “A probabilistic approach to air- craft conflict detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 1, no. 4, pp. 199–219, 2000. [68] “Faa grants for aviation research program solicitation,” 2011. [Online]. Available: http://www.tc.faa.gov/logistics/grants/ [69] K. Park, N. Singhal, M. Lee, S. Cho, and C. Kim, “Design and performance evalu- ation of image processing algorithms on gpus,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 91–104, January 2011. [70] S. Reddaway, W. Meilander, J. Baker, and J. Kidman, “Overview of air traffic control using an simd cots system,” in Proc. of the International Parallel and Dis- tributed Processing Symposium (IPDPS’05), Denver, CO, April 2005. [71] K. Batcher, “Staran/radcap hardware architecture,” in Sagamore Computer Conf. on Parallel Processing, 1973, pp. 147–152. 94

[72] ——, “The multi-dimensional access memory in staran,” in Sagamore Computer Conf. on Parallel Processing, 1975, pp. 167–168. [73] ——, “The flip network in staran,” in International Conf. on Parallel Processing, 1976, pp. 65–71. [74] ——, “Staran series e,” in International Conf. on Parallel Processing, 1977, pp. 140–143. [75] ——, “The multi-dimensional access memory in staran,” IEEE Transactions on Computers, vol. C-26, no. 2, pp. 174–177, Feb 1977. [76] H. Wang and R. Walker, “Implementing a scalable asc processor,” in Proc. of the 17th International Parallel and Distributed Processing Symposium (Workshop in Massively Parallel Processing), Nice, France, April 2003, pp. abstract on page 267, full text on CDROM. [77] ——, “Implementing a multiple-instruction stream associative masc processor,” in Proc. of the 18th International Conference on Parallel and Distributed Computing and Systems(PDCS), Dallas, Texas, November 2006, pp. 460–465. [78] R. Walker, J. Potter, Y. Wang, and M. Wu, “Implementing associative processing: Rethinking earlier architectural decisions,” in Proceedings of the 15th International Parallel and Distributed Processing Symposium (Workshop on Massively Parallel Processing), San Francisco, CA, April 2001, pp. abstract on p. 195, full text on accompanying CDROM. [79] K. Schaffer and R. Walker, “A prototype multithreaded associative simd proces- sor,” in Proceedings of the 21st International Parallel and Distributed Processing Symposium (IPDPS)-Workshop on Advances in Parallel and Distributed Comput- ing Models (APDCM), Long Beach, CA, March 2007, p. 228. [80] ——, “Using hardware multithreading to overcome broadcast/reduction latency in an associative simd processor (extended version),” Parallel Processing Letters, vol. 18, no. 4, pp. 491–509, Dec 2008. [81] ——, “Using hardware multithreading to overcome broadcast/reduction latency in an associative simd processor,” in Proc. 22nd Int’l Parallel and Distributed Process- ing Symp.(IPDPS)-Workshop on Large-Scale Parallel Processing (LSPP), Miami, FL, April 2008, p. 289. [82] “Clearspeed technology plc. clearspeed whitepaper: Csx processor architecture,” 2007. [Online]. Available: http://www.clearspeed.com/docs/resources/ [83] “Clearspeed technology plc. clearspeed whitepaper: Clearspeed software descrip- tion,” 2007. [Online]. Available: https://support.clearspeed.com/documents/ [84] “Clearspeed technology plc. csx600 runtime software user guide,” 2007. [Online]. Available: https://support.clearspeed.com/documents/ 95

[85] J. Gustafson and B. Greer, “Clearspeed whitepaper: Accelerating the intel math kernel library,” Tech. Rep., 2007. [Online]. Available: http: //www.clearspeed.com/docs/resources/ClearSpeedIntelWhitepaperFeb07.pdf [86] K. Schaffer, “Asc library for clearspeed,” 2012. [Online]. Available: http: //www.cs.kent.edu/∼kschaffe/asc/ [87] E. Eddey and W. Meilander, “Application of an associative processor to aircraft tracking,” in Proceedings of the Sagamore Computer Conference on Parallel Pro- cessing. Springer-Verlag, Aug 1974, pp. 417–428. [88] “The video of the performance of the staran at dulles is posted at,” 2012, this is a reproduction of part of the original 16mm film made in the 1980’s. If a complete professional restoration is feasible, it will also be posted here. [Online]. Available: http://www.cs.kent.edu/∼jbaker/ATC/andXXXXatElseviersite. [89] A. Marowka, “Back to thin-core massively parallel processors,” IEEE Computer Journal, vol. 44, no. 12, pp. 49–54, December 2011. [90] W. Chantamas, J. Baker, and M. Scherger, “An extension of the asc language com- piler to support multiple instruction streams in the masc model using the manager- worker paradigm,” in Proc. of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2006), June 2006, pp. 521–527. [91] W. Chantamas and J. Baker, “A multiple associative model to support branches in data parallel applications using the manager-worker paradigm,” in Proc. of the 19th International Parallel and Distributed Processing Symposium (WMPP Workshop), April 2005, pp. 266–273. [92] J. Potter, ASC Software, 1992, includes a Primer, Windows Compiler, and Windows Emulator, can be downloaded at:. [Online]. Available: http: //www.cs.kent.edu/∼parallel/ [93] M. Jin and J. Baker, “Two graph algorithms on an associative computing model,” in International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas, June 2007, p. 7 pages. [94] M. Atwah and J. Baker, “An associative dynamic convex hull algorithm,” in Proc. of the Tenth IASTED International Conference on Parallel and Distributed Com- puting and Systems, Las Vegas, NV, October 1998, pp. 250–254. [95] ——, “An associative implementation of a parallel convex hull algorithm,” in Proc. of the 15th International Parallel and Distributed Processing Symposium (IEEE Workshop on Massively Parallel Processing), San Francisco, CA, April 2001, pp. abstract on page 64, full text on CDROM. [96] ——, “An associative static and dynamic convex hull algorithm,” in Proc. of the 16th International Parallel and Distributed Processing Symposium (IEEE Work- shop on Massively Parallel Processing), Ft. Lauderdale, FL, April 2002, pp. ab- stract on page 249, full text on CDROM. 96

[97] M. Atwah, J. Baker, and S. Akl, “An associative implementation of graham’s con- vex hull algorithm,” in Proc. of the Seventh IASTED International Conference on Parallel and Distributed Computing and Systems, Washington D.C., October 1995, pp. 273–276. [98] ——, “An associative implementation of classical convex hull algorithm,” in Proc. of the Eighth IASTED International Conference on Parallel and Distributed Com- puting and Systems, Chicago, IL, October 1996, pp. 435–438. [99] M. Esenwein, “String matching algorithms for an associative computer,” Master’s thesis, Department of Computer Science, Kent State University, 1995. [100] M. Esenwein and J. Baker, “Vlcd string matching for associative computing and multiple broadcast mesh,” in Proc. of the IASTED International Conference on Parallel and Distributed Computing and Systems, October 1997, pp. 69–74. [101] S. Steinfadt and J. Baker, “Swamp: Smith-waterman using associative massive par- allelism,” in IEEE Workshop on Parallel and Distributed Scientific and Engineer- ing Computing, 2008 International Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 2008. [102] S. Steinfadt, M. Scherger, and J. Baker, “A local sequence alignment algorithm using an associative model of parallel computation,” in Proc. of IASTED Compu- tational and Systems Biology (CASB 2006), Dallas, TX, Nov 2006, pp. 38–43. [103] D. Ulm and J. Baker, “Solving a 2d knapsack problem on an associative computer augmented with a linear network,” in Proc. of the International Conference on Parallel and Distributed Processing Techniques and Applications, Sunnyvale, CA, Aug 1996, pp. 29–32. [104] J. Lee, “Developing parallel simd algorithms for the traveling salesman problem,” Master’s thesis, Department of Computer Science, Kent State University, November 1989. [105] P. Berra, “Some problems in associative processor applications to database man- agement,” in Proceedings of the National Computer Conference and Exposition, May 1974, pp. 1–5. [106] P. Berra and E. Oliver, “The role of associative array processors in database ma- chine architecture,” IEEE Transactions on Computers, no. 4, pp. 53–61, 1979. [107] C. Asthagiri and J. Potter, “Associative parallel lexing,” in Proceedings of the 6th International Parallel Processing Symposium(IPPS), March 1992, pp. 466–469. [108] ——, “Parallel compilation on associative processors,” in Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Tech- niques(PACT), North-Holland Publishing Co, The Netherlands, 1994, pp. 315–318. [109] ——, “Parallel context-sensitive compilation,” Software-Practice and Experience, vol. 24, no. 9, pp. 801–822, Sept 1994. 97

[110] A. Bansal and J. Potter, “Exploiting data parallelism for efficient execution of logic programs with large knowledge bases,” in Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence, 1990, pp. 674–681. [111] B. Reed, “An implementation of lisp on a simd parallel processor,” in First annual aerospace applications of AI, Dayton, OH, 1985, pp. 81–90. [112] G. Steele and W. Hillis, “ lisp: Fine-grained parallel symbolic processing,” in Proceedings of the 1986 ACM conference on LISP and (LFP). New York, NY: ACM, 1986, pp. 279 – 297. [113] T. Hasten, “An ops5 implementation on a simd computer,” Master’s thesis, De- partment of Computer Science, Kent State University, 1987. [114] J. Potter, M. Rivett, and T. Hasten, “Rule-based systems on simd computers,” in Proceedings of ROBEXS, 1987, pp. 198–204. [115] B. Reed, “The aspro parallel inference engine (p.i.e.): A real time production rule system, Tech. Rep. 85-6048, 1985. 98

Figure 25: Overall ATC System Design.