On-line Control Architecture for Enabling Real-time Traffic System Operations

Srinivas Peeta∗ & Pengcheng Zhang

School of Civil Engineering, Purdue University, West Lafayette, IN 47907, U.S.A.

Abstract: Advances in information technology and inexpensive high-end computational power are motivating a new generation of methodological paradigms for the efficient information-based real- time operation of large-scale traffic systems equipped with sensor technologies. Critical to their effectiveness are the control architectures that provide a blueprint for the efficient transmission and processing of large amounts of real-time data, and consistency-checking and fault tolerance mechanisms to ensure seamless automated functioning. However, the lack of low-cost, high- performance, and easy-to-build computing environments is a key impediment to the widespread deployment of such architectures in the real-time traffic operations domain. This paper proposes an Internet-based on-line control architecture that uses a Beowulf cluster as its computational backbone and provides an automated mechanism for real-time route guidance to drivers. To investigate this concept, the computationally intensive optimization modules are implemented on a low-cost sixteen-processor Beowulf cluster and a commercially available , and the performance of these systems on representative computations is measured. The results highlight the effectiveness of the cluster in generating substantial computational performance , and suggest that its performance is comparable to that of the more expensive supercomputer.

1 INTRODUCTION

The utilization of advanced technologies in intelligent transportation systems (ITS) enables the implementation of closed-loop control architectures to improve the efficiency, accessibility, reliability and safety of transportation systems. A key ITS technology, advanced traveler information systems (ATIS), envisages the provision of route guidance information to drivers using dynamic traffic assignment (DTA) models for vehicular networks equipped with sensor and information dissemination devices. Hence, such systems have access to real-time traffic flow data from the sensors and can relay useful traffic information to drivers after processing this data. This motivates the use of closed-loop control architectures for deployment. They are preferable here to open-loop architectures due to the presence of several sources of randomness that significantly affect the control outputs, including origin-destination demand, driver behavior, supply conditions, and traffic flow interactions (Peeta and Yang, 2003). The control architecture provides a blueprint 2 for the efficient transmission and processing of time-dependent data and its usage across components, consistency checking modules and fault tolerance mechanisms to ensure seamless automated functioning, and information supply strategies to enhance system performance. Due to the need to large amounts of traffic data and generate information supply strategies in real- time, the associated control architectures are computationally intensive, and can represent a key barrier to their deployment.

Previous efforts to address the computational barriers to the deployment of real-time control strategies in transportation networks have targeted both the algorithmic logic and the computing environment. In the context of real-time route guidance, the algorithmic aspects, addressed under the DTA umbrella, have focused on: (i) developing centralized iterative frameworks with truncated horizons (Peeta and Mahmassani, 1995), (ii) decentralizing the solution procedure by decomposing the traffic network into small zones (Hawas and Mahmassani, 1997; Chiu and Mahmassani, 2002),

(iii) developing reactive solution strategies (Hawas and Mahmassani, 1997; Pavlis and

Papageorgiou, 1999; Peeta and Yang, 2003), (iv) developing computationally more efficient solutions (Mahmassani et al., 1998a), and (v) using hybrid models that combine computationally intensive off-line solutions with efficient real-time strategies (Peeta and Zhou, 2002). Centralized iterative frameworks utilize real-time system measurements along with detailed predictions of future network states for deployment decisions. They typically predict and use the projected

(experienced) travel times rather than current (instantaneous) travel times in their algorithmic logic.

Hence, they are generally computationally intensive and difficult to implement in real-time. By contrast, decentralized solution strategies typically limit their analysis to a small area by using approximate estimates of the traffic conditions beyond the area under consideration. Although this significantly reduces the computational complexity, it may underestimate the effects of congestion

∗ To whom correspondence should be addressed. E-mail: [email protected].

3 outside that area. Reactive route guidance strategies attempt to address this issue by considering only current measurements rather than future conditions or historical data. Unlike iterative strategies, they typically use instantaneous travel times as approximations to experienced travel times. For the same reason, iterative strategies are generally more accurate than reactive strategies except under certain scenarios (Pavlis and Papageorgiou, 1999) when the instantaneous travel time approximation may be reasonable. Hybrid strategies (Peeta and Zhou, 2002) exploit the advantages of iterative and reactive frameworks by combining the computationally intensive off-line solutions with efficient on-line reactive strategies. They use historical data to generate robust initial solutions off-line that are efficiently updated in real-time based on the unfolding traffic conditions.

There have been several efforts to address the computational aspects related to real-time traffic operations by focusing on the computational environments, both in terms of the hardware and the operating configuration. They include: (i) implementing the models on expensive specialized high performance computing hardware (for example, Chang et al., (1994) and Habbal et al., (1994) on the Connection Machine; Peeta and Mahmassani (1995) and Ziliaskopoulos et al., (1997) on the

CRAY), (ii) configuring a cluster of individual workstations into a distributed system (Peeta and

Chen, 1999) and (iii) using efficient enabling environments such as CORBA (Mahmassani et al.,

1998b; Ziliaskopoulos and Waller, 2000). Although and/or other sophisticated computing architectures may address the computational needs of real-time route guidance algorithms, they are usually prohibitively expensive and typically beyond the budget of most local transportation agencies. Also, even if such agencies could access these computing systems, the associated high costs preclude dedicated usage for their specific operations only. Hence, these specialized hardware architectures are typically not customizable and may not be optimally configured for use by these algorithms. In addition, they typically entail system-specific skills for monitoring and maintenance, and specialized software to operate. The advances in information

4 technology and the growing popularity of open source software, coupled with low-cost high-end computing power, provide opportunities to explore new paradigms for affordable and efficient high performance computing to enable the deployment of on-line control strategies in a wide range of transportation networks.

This paper presents a real-time information-based traffic system control architecture for route guidance that uses a Beowulf cluster as its central computing unit. Section 2 introduces the control architecture, the control flow logic, and the associated computing and data storage paradigms.

Section 3 first demonstrates the performance capabilities of the Beowulf cluster through a series of experiments. It then analyzes the performance of the control architecture in an off-line mode by using different parallel paradigms to execute an important component, the path processing algorithm. Concluding comments are presented in Section 4.

2 ON-LINE TRAFFIC SYSTEM CONTROL ARCHITECTURE

Figure 1 shows the conceptual structure to enable the proposed information-based real-time traffic control architecture for route guidance. Here, one or more transportation agencies utilize the computing power of a single centrally located unit to generate information dissemination strategies that provide routing information to the network users. While the notion of a remotely located computing unit is not essential for the proposed architecture, there are some advantages to this configuration. First, it enables the system operators (for example, the traffic control centers) within a geographic region to pool resources so as to cut costs. Second, the maintenance of the computing unit can be performed by on-site specialists, leading to greater reliability in terms of the functioning of the hardware. Third, changes in the operational logic and/or traffic control strategies can be seamlessly performed by dedicated on-site developers of the algorithmic logic and software.

As shown in Figure 1, real-time traffic data obtained from the installed sensors at each on-line network is fed through the Internet or other communication media to the control algorithms located

5 on the central computing unit for generating route guidance strategies. While sensor data can be first sent to the traffic control center (TCC) and then channeled to the computing unit, there are two disadvantages to such a paradigm. First, it introduces unnecessary delays without tangible gains.

Second, it introduces another layer of reliability concerns in terms of the potential failure of the communication link between the TCC and the computing unit. For this reason, it may be desirable to have redundancy in the communication links. For example, in addition to the Internet, a dedicated telephone line or a satellite link can be used to enhance communication reliability. The control algorithms located at the computing unit generate information dissemination strategies for implementation and send them to the TCCs. The TCCs disseminate this information to the network users using information dissemination media such as variable message signs, radio, and in-vehicle navigation systems. The sensor data and the traffic control strategies are then archived in the corresponding TCC databases. This serves an important fault tolerance role when the communication links between the TCC and the computing unit fail, or when the computing unit is non-functional. In such scenarios, data from the TCC database is used to generate quick, but possibly less robust, fallback information dissemination strategies.

The conceptual structure for the proposed control architecture can be mapped to the classical closed-loop control framework. The on-line traffic network is the object of the control, and is denoted as the plant. The plant typically represents a complex system. The TCC is the controller in this context, and seeks to guide the plant to desirable states by providing control inputs (information dissemination strategies). The controller utilizes output measurements from the plant (the field traffic data) as a feedback signal to adjust the control inputs.

The conceptual structure illustrated in Figure 1 can be generalized using an on-line architecture to exert control in large-scale traffic systems equipped with advanced sensor and information

6 dissemination systems. The components of the on-line architecture and the associated logic are shown in Figure 2, and are described hereafter.

2.1 Components

Traffic Control Center: The TCC is the controller in the control framework, and is the hub for system operations. It aims at enhancing the network performance by providing routing information to network users. It is a key functional component of the architecture because the effectiveness of the control strategies deployed by it can significantly influence the on-line network conditions. TCC also monitors the traffic system in real-time and identifies problematic locations. It can also perform the function of collecting and archiving field data into databases for future use. A special case of the proposed on-line control architecture is when the data processing to generate deployable control strategies is performed at the TCC. In the current context, the TCC obtains a set of feasible control strategies from the computing unit and determines the information dissemination strategy to be deployed. Also, by monitoring the system continuously, it is used to detect traffic incidents and trigger the response operations.

On-Line Traffic Network: This is the plant in the control framework. The on-line traffic network conditions are performance indicators and the basis for feedback control. Real-time data from the traffic network represents a key element for enabling the on-line control architecture. It plays the role of feedback in the closed-loop control paradigm adopted here.

Traffic System Control Models: They are used to generate the control inputs for the control framework. These models represent the collection of tools necessary to process the real-time traffic data and generate information dissemination strategies. They typically include DTA algorithms and relevant support modules. As discussed in Section 1, DTA models can be computationally intensive for real-time deployment. This is because they typically entail large-scale systems optimization with an embedded traffic simulation model. A traffic simulation model is necessary to robustly replicate

7 vehicle interactions and flow conditions. Hence, the traffic system control models represent the computationally intensive components of the architecture, and are the focal point for the design of efficient computing paradigms. Section 3 analyzes some computing paradigms to enable real-time functionality for the control models.

Information Dissemination Strategies: They are the control inputs in the control framework.

They represent the strategies implemented by the TCC in real-time from among the alternatives suggested by the control models. These strategies are the primary control mechanism for the TCC.

However, as the architecture is not mechanism specific, other control mechanisms such as signal control can be seamlessly incorporated. The accuracy of the information dissemination strategies is critical to the effectiveness of the proposed architecture. This motivates the need for the feedback control loop and the consistency-checking module.

Virtual System Simulator: This component runs in real-time and aims to faithfully replicate actual traffic conditions. It emulates vehicular movements and the behavior of motorists in a traffic network to generate estimated values for the various system performance measures. It has two functions in the context of the on-line control architecture. First, it is used to estimate the traffic system conditions and compare them to the field data for consistency-checking. Second, it provides redundancy for generating traffic flow data if the field sensors malfunction and/or the communication links fail. Hence, it is an essential element for fault tolerance and serves as a replacement of the real traffic network under failure modes.

Calibration and Consistency-Checking: This component aims at bridging the gaps between the actual traffic data measurements and the traffic conditions predicted by the models embedded in the control logic. In addition to modeling limitations, these gaps can arise due to several inherently stochastic on-line factors that can significantly influence the performance of dynamic traffic networks, such as: inaccurate prediction of time-dependent origin-destination (O-D) demand,

8 unpredicted incidents, randomness in user behavior and/or user class characteristics, noise in measurements, and failure of system components. Hence, ensuring consistency using feedback data is essential to developing a robust automated architecture. Therefore, re-calibration of the prediction models on-line may be necessary to enhance the prediction accuracy of the evolving traffic conditions. Peeta and Bulusu (1999) propose a generalized singular value decomposition approach to ensure operational consistency of on-line DTA models within a stage-based rolling horizon framework.

Fault Tolerance: Faults arise due to malfunctioning detectors and/or the failure of communication links between the detector sites and the computing unit. In order to ensure the seamless automated functioning of the architecture and the accuracy of the prescribed control strategies, the control architecture requires a capability to automatically identify such faults in real- time and trigger a fall-back strategy. Peeta and Anastassopoulos (2002) develop a fault tolerant mechanism to detect such faults automatically by examining streaming field data.

2.2 The Flow Logic

Figure 2 illustrates the flow logic for the architecture. It can be differentiated based on the normal and failure modes. The normal mode refers to the situation when all components of the on-line architecture are functioning properly. In this mode, the traffic conditions in response to the control strategies implemented by the TCC are measured using the installed sensors. The real-time traffic data and the details of the information dissemination strategies employed are sent via communication links (such as the Internet) to the computing unit, which may be located at a different site. Simultaneously, they are also archived in the TCC’s database for possible future use.

The computing unit is the location where the major computational components such as the virtual system simulator, calibration and consistency-checking modules, fault tolerance modules, and the traffic control models are executed. The field data is filtered through the fault tolerance module to

9 determine if the system is operating in the normal or failure model. If the normal mode is confirmed, the data is transmitted to the consistency-checking module to update parameters in the simulator and control models. The system state is updated by executing the virtual simulator using the modified parameters. The filtered field data and the simulator output data are transmitted to the traffic control models to generate the information dissemination strategies. These strategies are transmitted to the TCC using the communication links. The TCC deploys specific information dissemination strategies to complete a typical flow cycle under normal circumstances.

The failure mode is caused due to the failure of the communication links, sensors, and/or the computing hardware. When the communication link (Internet) transmitting real-time data to the computing unit from the traffic network fails, data is obtained directly from the TCC database through a dedicated data communication link. Another failure scenario is when sensors malfunction, leading to the transmittal of inaccurate traffic data. Then, the control models use data from the virtual system simulator to determine real-time traffic control strategies. The calibration and consistency-checking modules are bypassed as correct field data is unavailable.

2.3 Computing Paradigm

Due to the significant computational intensity of the various components executed by the computing unit, and the need for real-time or sub-real time solutions for the TCC, a computing environment capable of efficient and cost-effective high performance computing is desirable. The Beowulf cluster (Sterling et al., 1999) represents such a computing environment. A Beowulf cluster consists of several nodes that are connected through an Ethernet switch and run system software that facilitates data communication between them. The key advantages of a Beowulf cluster are its low price and greater problem customization flexibility compared to typical commercial with specialized communication architectures. The lowering of commodity computing hardware prices and the substantial increase in their capability levels have fostered the ability to generate

10 comparable computing performance at 1/100th to 1/20th the typical supercomputer costs using the

Beowulf paradigm. The use of robust open-source operation systems such as further aids in keeping costs lower. Another attractive feature of a Beowulf cluster is that existing or even old computing hardware components in a transportation agency can be beneficially configured into the cluster to enhance its computational capabilities. Additionally, the different computing units within its architecture can have different computing speeds and memory capabilities. When load balancing issues are accounted for, this flexibility provides the cluster the capability to customize itself to the specific problem being addressed, and hence fosters efficient utilization. Also, Beowulf clusters typically have a single gateway to the external network, called the master node. This ensures an additional layer of security while precluding unnecessary external data traffic on its internal network. In the context of large-scale traffic systems, a Beowulf cluster can be configured in centralized or decentralized control architectures with equal ease. Thereby, it enables individual traffic operators with smaller operational scope (such as local traffic agencies) to install mini

Beowulf clusters at their locations or allows several of them to operate remotely using a centrally located large-scale cluster. For a large transportation agency, the use of a centrally located cluster can enhance operational efficiency and reduce costs by obviating the need for hardware, software, space requirements, and maintenance at each individual location.

The functions of the on-line control architecture that can be concurrently and/or sequentially performed on the Beowulf cluster over the typical time cycle of the computing unit are illustrated in

Figure 3. The four layers of rectangles represent specific functions executed on the cluster. All distinct functions in a layer are executed sequentially. However, tasks of two different layers may be executed concurrently if they share the same timeline.

The first layer is data related. It includes data retrieval, formatting, and storage. Since the master node is the cluster’s only gateway to the outside world (for security reasons), it performs the

11 function of continually receiving field traffic data. The data received is formatted for use by the virtual system simulator and the traffic control models. This task may be assigned to one or more

Beowulf nodes. In order to maximize efficiency of the cluster, it is advisable not to pre-assign tasks to individual nodes. Instead, tasks can be distributed dynamically based on the current load on each node.

The second layer of concurrency includes loading balancing (task distribution), calibration and consistency-checking, traffic system control models including an embedded traffic simulator, output formatting and visualization, and output transmitting. Typically performed by the master node, load balancing focuses on the mechanism of assigning individual tasks to various cluster nodes so that the maximum execution time is minimized. Apart from optimization at the cluster level, tasks can also be optimized at the individual node level. This is possible if a node has multi-processing capabilities when it contains more than one processor or a single hyperthreading-enabled processor.

Visualization is the animation component of the application and is performed by a specific node using the output data. The other functions in this layer include output formatting and transmission.

Simultaneously, the output data containing the dissemination strategies to be deployed on-line is encrypted and transmitted through the Internet back to the TCC.

The traffic system control models are computationally the most intensive components of the architecture. This motivates another level of concurrency within the second layer to enable the parallel execution of these models. Processes can be distributed across cluster nodes based on the algorithmic characteristics and the network structure. Section 3 discusses the results of experiments conducted by distributing the various modules of the control models to the different Beowulf nodes.

The third layer of concurrency is fault tolerance. The last layer, the virtual system simulator, is another function that is performed by the cluster at all times. Unlike the traffic simulator embedded in the control models, this simulator seeks to continuously replicate the actual traffic network

12 conditions. In the normal mode, it supplements the field data for the control models. In the failure mode, it plays a more pro-active and critical role by providing predicted time-dependent data to the control models while the TCC database can only provide historical data.

2.4 Data Storage Paradigms

Data storage and transmission characteristics are important for the efficient functioning of the on- line control architecture, especially the traffic control models. These models often require multiple input/output (I/O) operations on files at various stages of the execution cycle. In a Beowulf cluster, where some or all of the constituent nodes have their own hard drives, the efficiency of these I/O operations depends significantly on the location and the manner in which files are stored and data is transmitted. There are several potential storage options due to the flexibility of the Beowulf architecture.

Figure 4 depicts three different data storage and transmission schemes. As is often the case with a Beowulf cluster, there is no single most efficient scheme of data storage and transmission. This is indicative of its customizability. The efficiency of a scheme depends on the: (i) size of data files being stored; (ii) variance in the file sizes; (iii) connection speed; (iv) number of cluster nodes; and

(v) the nature of file operations. For example, in Figure 4(a), separate copies of all files are stored on each node so that each processor in the cluster has access to all data from its local hard drive.

Problems with this scheme are the associated increase in data storage requirements and the need for frequent data synchronization across nodes. However, it was observed through simple experiments that this scheme performs well for small applications where file synchronization is relatively simple and file sizes are so small that any parallel implementation of I/O operations provides only marginal benefits. By contrast, in Figure 4(b) a single processor is responsible for data storage, retrieval, and transmission to other nodes. While this scheme is straightforward to implement, several processors are typically idle during the data I/O phase thereby fostering inefficiencies. In addition, it has extra

13 overheads caused by data transmission among nodes. Therefore, it is efficient only for networks with high data transmission rates. Scheme 3, shown in Figure 4(c), uses each processor to read a portion of the data. The processors exchange their data using message passing techniques. Thereby, this scheme is a hybrid of the other two schemes. The study experiments discussed in Section 3 use

Scheme 3 as the data storage paradigm.

There are some factors that aid in choosing between Schemes 2 and 3. If the files are uniform in size and the cluster is small or medium-sized, Scheme 3 may be more efficient than Scheme 2.

Uniformity in file size ensures that the parallelization of I/O operations is simple and does not require elaborate load balancing techniques. Also, the incorporation of only a few nodes in the cluster ensures that the network traffic does not clog the switch due to the large amount of message passing among nodes. If file sizes are significantly heterogeneous and/or the cluster size is large, a hybrid scheme involving subsets of nodes dedicated to data retrieval, storage, and transmission may be more effective.

3 EXPERIMENTAL ANALYSIS

Experiments are conducted on a Beowulf cluster constructed by the authors in 1999 called Super- computing Cluster for On-line and Real-time Control of Highways using Information Technology

(SCORCH-IT), which plays the role of the computing unit in the proposed information-based on- line traffic system control architecture. SCORCH-IT has eight nodes, each of which consists of two

550 MHz Intel P-III processors and 512 MB memory. The number of processors per node depends on many factors such as inter-processor communication needs and budgetary constraints. Typically more processors per node imply less communication overhead, but at a higher cost. Since computation and I/O are the primary factors for the problem addressed, we choose a dual-processor configuration for each node.

14

The experiments focus on two primary objectives that analyze the effectiveness of this computing environment for the on-line control architecture. The first set of experiments test the efficiency of the configuration of SCORCH-IT, and compares its performance to that of a commercially available supercomputer. Discussed in Section 3.2, the parallel performance of SCORCH-IT is analyzed in terms of the computational, I/O, and communication times to provide preliminary insights for a full-scale implementation of the traffic control architecture in an off-line mode. An off-line mode implies that traffic data is obtained from the virtual system simulator as a proxy for the field data. In addition, the calibration/consistency checking component and the fault tolerance module are bypassed as both of them address on-line functions. The off-line implementation, discussed in Section 3.3, constitutes the second set of experiments. It explores alternative parallel paradigms and provides insights for a full-scale on-line implementation of the control architecture.

The two sets of experiments are performed using a DTA algorithm (Peeta and Mahmassani,

1995), which is the key component of the traffic control models. The first set of experiments is conducted using the shortest path module in the DTA algorithm. The second set of experiments is performed using the entire DTA algorithm. The DTA algorithm and its components are discussed in

Section 3.1.

The following performance measures are used in these experiments: (i) Total execution time:

This is the total execution time of a program. It includes the time for computation, inter-process communication, and I/O. When averaged over several runs to account for system randomness, it is labeled average total execution time. (ii) Computational time: Computational time is the time spent only on computing. When averaged over several runs, it is labeled average computational time. (iii)

Data input time: This is the time spent in reading data files from the hard drive. When averaged over several runs, it is labeled average data input time. (iv) Data output time: This is the time spent

15 in writing output data files onto the hard drive. When averaged over several runs, it is labeled average data output time. (v) Communication time: This is the time spent on message-passing between processors. When averaged over several runs, it is labeled average communication time.

(vi) : The speedup S(P) for a system with P processors indicates how much faster a problem will be solved using parallelization compared to that of a single processor. It is defined as the ratio T(1)/T(P) where T(1) is the average total time on a single processor and T(P) is the average total time on a system with P processors. (vii) Efficiency: The efficiency E(P) is the fraction of time a typical processor is busy. It is measured as the ratio S(P)/P. Ideally, E(P)=1, implying that on average no processor is idle during execution.

3.1 Dynamic Traffic Assignment Algorithm

The DTA algorithm used in this study is labeled the Multiple User Classes Time-Dependent Traffic

Assignment (MUCTDTA) algorithm (Peeta and Mahmassani, 1995). The MUCTDTA algorithm, conceptually illustrated in Figure 5 as adapted for this study, provides real-time route guidance by determining the time-dependent paths to be provided to users so as to optimize some system-wide and/or individual user class objectives. User classes are defined based on information accessibility, information supply strategy, and user response behavior to the supplied information. In this study, without loss of generality, only two of the potential multiple user classes are considered for the

MUCTDTA algorithm. Drivers have access to information in both these classes. The first is the system optimal (SO) class whose drivers are provided paths based on the system optimum objective. The second is the user equilibrium (UE) class where drivers follow user optimum paths.

The MUCTDTA algorithm uses driver information such as origin, destination, departure time, and user class for the computational period, to determine a path for each driver based on the individual user class objectives and a global system optimal objective.

16

The MUCTDTA algorithm belongs to the class of simulation-based DTA approaches (Peeta and

Mahmassani, 1995). In such approaches a traffic flow simulator is used to replicate the complex traffic flow dynamics resulting from the interactions of vehicles and the route choice decisions of drivers. Consequently, the simulation component of the MUCTDTA algorithm is computationally very intensive. This is significant because the MUCTDTA algorithm is an iterative procedure in which the simulator is used in every iteration, as illustrated in Figure 5. Hence, the DTA algorithm is a logical candidate to analyze alternative computing paradigms. In this study, the DYNASMART

(Jayakrishnan et al., 1994) simulation model is used as the simulator in the MUCTDTA algorithm.

Another key component of the MUCTDTA algorithm, from a computational perspective, is the set of shortest path modules. There are two such modules: the marginal shortest path problem for the

SO class and the shortest path problem for the UE class. As illustrated in the figure, they constitute sub-problems of the DTA problem. Their computational significance is because they are solved several times within each iteration of the DTA algorithm based on the number of destinations and the number of discretized time intervals that represent the period of interest. Hence, while each run of the shortest path module (that is, for a single destination and a specific time interval) is computationally insignificant compared to that of the traffic simulator, the total computational time for an iteration involving the shortest path modules can be important. Hence, the first set of experiments focus on the shortest path algorithm (SPA), used to determine the UE class paths, to generate preliminary insights on the performance of the SCORCH-IT Beowulf cluster. The second set of experiments then address the computationally intensive MUCTDTA algorithm in an off-line mode for deployment in the on-line architecture. As illustrated by the algorithmic logic, the SO and

UE components of the MUCTDTA algorithm can be executed in parallel. This forms the basis for the parallel paradigms analyzed in Section 3.3.

17

3.2 Performance Tests for the Computing Unit

The SPA is used to analyze the effectiveness and scalability of SCORCH-IT under factors (number of processors and problem size) that influence its computational performance. Its performance is also benchmarked against that of a commercially available general purpose scalable parallel supercomputer called the IBM SP2. The traffic network used in these experiments consists of 178 nodes, 441 links, and 20 destinations, as shown in Figure 6. It includes a bi-directional freeway through the entire length of the network, and an urban street network consisting of arterials and local roads on both sides of the freeway.

3.2.1 Shortest Path Algorithm. The SPA (Ziliaskopoulos and Mahmassani, 1994) computes the time-dependent shortest paths from every traffic network node to a destination node at discrete time intervals using the time-dependent link travel times obtained from the traffic simulator for the period of interest. This computation is an inherently sequential procedure that needs to be repeated for every destination node. While parallel shortest path algorithms exist (Ziliaskopoulos et al.,

1997) that parallelize the algorithmic logic, such parallelization is not the focus of the experiments here. This is partly because the DTA algorithm is mostly sequential, favoring coarse-grained parallelization. Also, such parallelizations entail inherent overhead costs in terms of communication times and load imbalances. Instead, the objective is to exploit the inherently distributed configuration of a Beowulf cluster to enable real-time deployment capabilities. As the SPA is executed for each destination node, the execution of the SPA for groups of destinations is distributed to different processors to evaluate the performance of the Beowulf cluster.

The SPA, illustrated in Figure 7, consists of three distinct components: (i) READ, (ii)

COMPUTE, and (iii) OUTPUT. In READ, data on the network topology and signal timings is read from the static input files. In addition, dynamic input data on the discretized time-dependent link travel times for the period of interest is obtained from the traffic simulator. The data is then stored

18 into large arrays locally in the computer node memory to facilitate faster data retrieval and manipulation. I/O is generally not amenable to parallelization in computationally intensive applications. However, in I/O intensive modules, significant savings in the total execution time can be obtained by assigning the available processors to read different parts of an input file and then exchanging the accessed data through message passing or . The COMPUTE component determines the time-dependent shortest paths from all traffic nodes to the current destination node. Since the shortest paths are computed by destination, there is no data exchange or update across different destination nodes. Thereby, the SPA procedure for destination nodes is amenable to distributed computing at a coarse level. The shortest paths for each destination are written to separate output files through the OUTPUT component.

The following two factors were used as control variables in the experiments: 1) Period of

Interest. It is the duration of time for which time-dependent shortest paths are desired. It is a proxy for the problem size. The durations considered are 30, 60 and 100 minutes, and the paths are computed every 5 minutes; 2) Number of Processors. Between 1 and 16 processors (1, 2, 4, 8, 12, and 16) are used in the study experiments.

3.2.2 Experiment 1. The first objective of experiment 1 is to test the performance of SCORCH-IT in terms of I/O, communication and computational times for the parallel execution of SPA over different numbers of processors. The second objective is to benchmark the performance of

SCORCH-IT against that of the IBM SP2. The SP2 consists of 64 nodes. Each node has four

375MHz POWER 3-II processors and 4 GB memory. The nodes are interconnected by 2Gbps multistage, packet-switched connection for the inter-processor communication. Each node contains its own copy of the system software specifically designed for the IBM SP2, allowing the parallel capabilities of the machine to be effectively exploited. In 1999, such commercial supercomputers typically cost $500,000~1,000,000 compared to $10,000 for SCORCH-IT. Hence, even if

19

SCORCH-IT is scaled to 64 nodes, the disparity in costs would be significant. While SCORCH-IT is equipped with slightly faster processors, the IBM SP2 has superior I/O and communication capabilities due to its specialized architecture.

Figure 8 shows the distribution logic of the SPA implemented on SCORCH-IT and IBM SP2.

The algorithmic parameters and distribution scheme of the SPA implementation are identical under both computing environments. However, SCORCH-IT has only two processors per node while IBM

SP2 has four processors per node. This can generate an advantage for IBM SP2 due to the faster intra-node communication rates. Hence, to ensure equitable performance comparisons, only two processors per IBM SP2 node are used in all experiments.

As shown in the figure, the task distribution scheme of the SPA on n processors allocates each process to one processor. Here, a “process” refers to a stream of consequential instructions for a single task (for example, the calculation of the shortest path for a destination node is a single process). The data storage paradigm Scheme 3 discussed in Section 2.4 is deployed here. First, each processor reads its assigned portion of the input data from the local hard drive of the associated computer node, and broadcasts this data to all the other processors. After completion of data broadcasting, each processor aggregates the data received and stores them into arrays in the local computer node memory. Next, each process is assigned a destination for the computation of time- dependent shortest paths. A variable D is used as a pointer to identify the next available destination node for which the shortest paths need to be computed. This ensures that each destination node is processed only once, and that this procedure terminates when all destination nodes are processed.

Once a processor is assigned the next available destination node D, it will increase the value of D by

1 and broadcast this information to all other processors to ensure consistency in concurrency. After the shortest paths for this destination are computed, they are written to an output file on the local hard drive of the associated computer node. The procedure is repeated until the time-dependent

20 shortest path computations for all traffic destination nodes have been completed. This is not necessarily the best parallelization scheme as some processors may finish faster than others and remain idle during parts of the total program execution time. A better scheme involves the dynamic allocation of tasks to processors when they become idle. This may result in an unequal distribution of tasks across processors, but ensures more efficiency.

Figure 9 illustrates the computational performances of SCORCH-IT and SP2. Since the

COMPUTE component can be easily distributed at a coarse level for each destination, its execution time is inversely proportional to the number of processors used on both machines. Small deviations from this relationship were observed, and are attributed to the load imbalance of task distribution.

The I/O component of the SPA was parallelized; however the associated time is not significant except for the case where all 16 processors are used. In the 16 processor scenario, both I/O and communication times are perceptible percentages of the total execution time. The figure also indicates that SCORCH-IT provides substantial speedups and performs comparably to the IBM SP2 when different numbers of processors are utilized. It can also be observed that the marginal total execution time savings decreases with an increase in the number of processors, especially for IBM

SP2. This is because the data communication needs and the load imbalance increase with the number of processors used. The dynamic load balancing mechanism deployed in SCORCH-IT alleviates such overheads to some extent, leading to better performance compared to the IBM SP2 as the number of processors increase.

Figure 9 also indicates that IBM SP2 spends marginally less time than SCORCH-IT for I/O and communication activities. This is due to the larger memory available to each processor and faster inter-node switch architecture. By contrast, SCORCH-IT performs better than IBM SP2 in terms of computational times due to its faster processors. More generally, in problems where the solution implementation logic has substantial data exchange needs, the IBM SP2 may have an advantage in

21 terms of the total execution time. However, this can be offset by using a faster Ethernet switch in the Beowulf cluster to enhance the inter-node communication rate. This issue is examined by transmitting artificial data of different sizes across computer nodes concurrently with the SPA execution. Two switches are used to test the performance of SCORCH-IT in terms of the total execution time. The “slower” switch has a 100Mbps communication rate, and the “faster” one has a

1Gbps rate. Figure 10 illustrates the performance of SCORCH-IT and the IBM SP2 under the various communication data sizes. The IBM SP2 exhibits little sensitivity to the data size because of its higher switch speed (2Gbps). However, the performance of SCORCH-IT with the slower switch is very sensitive to the increased data communication. Thereby, the effect of the computational time representing a large fraction of the total execution time is offset by the increasing communication overheads. This results in the overall execution time for SCORCH-IT under the slower switch to be higher than that of IBM SP2 beyond a threshold data size. That is, the computational time gains for

SCORCH-IT due to its faster processors are overshadowed by the communication time efficiencies for IBM SP2. However, when the faster switch is used in SCORCH-IT, the ten-fold gain in data communication rates along with the faster processor speeds more than compensates for the significantly higher IBM SP2 data switch rate. Thereby, not only is the faster switch equipped

SCORCH-IT less sensitive to the communication data size, it performs perceptibly better than IBM

SP2 as shown in Figure 10.

3.2.3 Experiment 2. This experiment analyzes SCORCH-IT performance in executing the distributed SPA under varying problem sizes. As discussed earlier, the period of interest is used as a proxy for problem size. The three periods of interest (30, 60 and 100 minutes) were considered, and the SPA was executed using all 16 SCORCH-IT processors and the slower switch. Figure 11 indicates that the fraction of the computational time as part of the total execution time increases with the period of interest. This results in more tasks being distributed to multiple processors.

22

Further, larger problem sizes facilitate better load balancing. Hence, the computational time component displays a sub-linear trend. This is also advantageous from a parallelization standpoint as speedups increase linearly from 7.53 to 12.53. The I/O time component dominates the communication time component as the period of interest increases.

Another issue associated with the increase in problem size is the influence of memory size on computational performance. Each cluster node of SCORCH-IT is currently equipped with 512 MB of memory which is sufficient to handle larger problem sizes than those tested for the SPA.

However, for more general applications and/or when the cluster is utilized for several different operations simultaneously, memory expansion may be computationally beneficial.

3.3 Off-line Implementation of the Traffic Control Architecture

Experiments are conducted to analyze the integrity of the on-line control architecture by using data transmission to integrate the closed-loop control system. In addition, the performance of two alternative parallel paradigms for the control models, computationally the most intensive component of the architecture, is analyzed to derive insights for the on-line implementation. As discussed earlier, due to the unavailability of field traffic flow data, the virtual traffic simulator is used as a proxy for a real-world traffic system to generate data for executing the control models.

Also, since the analysis is performed in the off-line mode, the fault tolerance and consistency- checking components are bypassed, and the focus is on generating insights for the real-time implementation of the DTA algorithm. Two coarse-grained parallel versions of the MUCTDTA algorithm are executed on SCORCH-IT to evaluate and benchmark the computational performance of its components in an off-line mode, and to compare the relative efficacy of the different distributing paradigms.

Under the off-line test mode, a virtual traffic simulator is embedded on a remotely-located PC or workstation as a proxy for the real traffic network. The network structure and the real-time O-D

23 demand data for the traffic network are the inputs for the simulator. The simulator generates data such as time-dependent link travel times, vehicle locations, and link traffic volumes. This data is transmitted to SCORCH-IT, which is located remotely from the simulator PC, through an intranet connection link with a 10Mbps bandwidth. The MUCTDTA algorithm located on SCORCH-IT uses the virtual real-time traffic data as input, and generates traffic control strategies for deployment by the TCC. These strategies are then sent to the virtual traffic simulator through the same intranet connection link to close the control loop. This procedure is repeated using a rolling horizon implementation for the period of interest.

The traffic network used for the experiments in Section 3.2 is used here as well. The period of interest is set to 60 minutes, and the traffic demand is generated over the first 30 minutes. The total number of vehicles generated for the period of interest is 14153 vehicles. The SCORCH-IT configuration uses the faster switch for communications between its computer nodes.

3.3.1. Data Transmission Tests. SCORCH-IT and the PC hosting the virtual traffic network simulator are connected through a 10Mbps intranet. However, the actual data transmission speed is subject to a certain degree of randomness caused by the varying loads on the local-area network. To generate a robust estimate of the performance in terms of data transmission times, a 10MB data file was transmitted over forty-eight 30-minute intervals for three days between SCORCH-IT and the

PC with the virtual traffic system simulator. The results show that the actual transmission speed varies from about 0.5Mbps to 4Mbps, depending on the amount of total data traffic on the intranet.

In the off-line implementation of the traffic control architecture discussed in Section 3.3.3, data is exchanged between the virtual traffic simulator and the control model component every 5 minutes as part of the rolling horizon procedure. The data transmitted from the virtual traffic simulator includes the time-dependent traffic flow data such as speeds, densities, volumes, and vehicle types.

The data transmitted from the control model component to the virtual traffic simulator is the set of

24 the suggested information dissemination strategies. The average size of the data file transmitted for the test traffic network in these experiments is around 2MB. The experimental results indicate that the transmission times of these files vary from 5 seconds to 30 seconds in each direction for each 5- minute time period. Considering that the new data is needed every 5 minutes, this transmission rate can ensure the real-time deployment of the on-line control architecture using the intranet. However, the results cannot be generalized unless dedicated or low-load data transmission links are available in an actual implementation. Also, the data transmission times depend on the problem scale and the communication link bandwidth.

3.3.2 Comparison of Alternative Parallel Paradigms for the Off-line Implementation of the Traffic

Control Architecture. As discussed in Section 3.1 and illustrated in Figure 5, the MUCTDTA algorithm excluding the traffic simulator can be split into 2 major components, the SO and the UE user class modules. There is no data dependence between these two components. Hence, they can be executed independently of each other. There are two primary subtasks in the SO component: (i) computing the time-dependent marginal travel times for each link, and (ii) computing the time- dependent marginal shortest paths for the SO class vehicles. They are executed sequentially and account for over 80% of the total execution time for the SO component in the current experiments.

The major subtask of the UE component is the computation of the time-dependent shortest paths for the UE class vehicles. As discussed in Section 3.2, the time-dependent shortest path and marginal shortest path algorithms are themselves parallelizable. As indicated by Figure 5, the various subtasks other than these two algorithms inside each user class component must be executed sequentially, while the components themselves can be executed concurrently. Hence, the

MUCTDTA algorithm contains both parallelizable and sequential tasks, and so do the SO and UE components.

25

To ensure the correctness of logic and the synchronization of various components in the parallel implementation of a program, parallel programs must effectively coordinate more than one process.

Here, two parallel programming paradigms, SPMD (Single-Program-Multiple-Data) and MIMD

(Multiple-Instruction-Multiple-Data) are tested for the MUCTDTA algorithm. Under the SPMD paradigm, different processors execute the same program at any time point, but with different data sets. Under the MIMD paradigm, different processors can simultaneously execute different instructions/programs on different data sets. Each process runs under the control of its own instruction sequence, but is not totally independent of other processes because they may access or modify the same copy of data. Hence, synchronization is enforced by synchronization mechanisms such as a lock, which permits only one process to access data at any instant.

Since the execution of the SO and UE components lacks data dependence, synchronization issues do not exist between these components. However, synchronization needs to be ensured within each component. For the parallel implementation of the MUCTDTA algorithm using the

SPMD paradigm, the algorithm is executed in the same order as in its sequential form. The SO and

UE components are executed one after the other. The only parallelizable subtasks, the shortest and marginal shortest path modules, are executed using the 16 processors of SCORCH-IT. At any instant, all processors execute the same program though they may not be synchronized for every instruction in the program. The flow chart of the SPMD paradigm for the MUCTDTA algorithm is illustrated in Figure 12(a).

In the MIMD paradigm implementation, the SO and UE components are executed at the same time but on different sets of processors. After the input data is read, the 16 processors are divided into two groups based on the expected load to ensure less processor idle time and higher efficiency.

Figure 12(b) illustrates this implementation, where m processors are used to execute the SO modules and (n-m) processors are used for the UE modules; n being the total number of processors.

26

This is a coarse-grained parallelization and is efficient because the two components are independent of each other. The marginal travel time computation and marginal shortest path modules in the SO component, and the shortest path module in the UE component are further distributed to m and (n- m) processors, respectively.

In the experiments, a loading factor parameter representing the vehicular traffic congestion level is used as a proxy for the problem size to compare the two parallel paradigms. The loading factor is an indicator for the number of vehicles generated in the traffic network during the period of interest compared to the benchmark case of a unit loading factor. A higher loading factor implies that more vehicles are generated between the various O-D pairs for the period of interest, and represents a higher computational burden for the MUCTDTA algorithm. Figure 13 indicates that as the problem size increases parallelization of the MUCTDTA algorithm leads to an increasing computational advantage compared to the sequential implementation where only one processor on a single node is used. Hence, the sequential implementation is akin to executing the program on a traditional single- processor PC. This trend is observed for both the SPMD and MIMD implementations. Between the two parallel paradigms, the MIMD implementation has lower execution times. This is primarily due to the different numbers of processors used by a single parallel program in the two paradigms.

Under the SPMD paradigm, since all 16 processors are used for a single algorithm, some processors are idle due to the heterogeneity in the assigned tasks and the data communication needs. In the

MIMD paradigm, processor idle time is substantially reduced by executing the SO and UE components in parallel. Also, since the SO component requires more computational time due the additional marginal travel time computation module, it is allocated more processors (m=10). While the MIMD execution time savings over the SPMD implementation are perceptible, greater savings can be achieved if the fraction of parallelizable modules within each component is larger.

27

4 CONCLUDING COMMENTS

As ITS architectures mature, deployment entails greater attention. System integration, real-time tractability, and performance robustness are key elements of an effective deployment paradigm.

This paper presents an information-based on-line architecture for real-time traffic system control under the classical closed-loop control framework as a blueprint for deploying route guidance strategies in networks equipped with advanced traffic management and information systems. The major components of this control architecture are: (i) traffic control center, (ii) on-line traffic network, (iii) traffic system control models, (iv) information dissemination strategies, (v) virtual traffic system simulator, (vi) consistency-checking and calibration, and (vii) fault tolerance. A

Beowulf cluster is proposed as the computing paradigm to address the significant computational needs of this control architecture. Additionally, the Beowulf cluster serves as a viable alternative to expensive commercially available customized computing environments. Besides providing supercomputing power at affordable costs, it also provides the flexibility to design asymmetric architectures optimized to the problem being addressed.

While DTA algorithms, consistency-checking modules, and fault tolerance paradigms have been addressed previously in limited deployment contexts, ensuring computational tractability and system integrity of the complete on-line architecture vis-à-vis deployment is essential. This study aims to partly address some deployment issues. The notion of a remotely located computing unit for the architecture is analyzed through off-line experiments. The study results suggest that such an architecture is promising from a deployment standpoint while providing advantages in terms of circumventing redundancy in investment costs and maintenance needs for traffic agencies. They also emphasize the effectiveness of a Beowulf cluster paradigm in providing a flexible and customizable computing environment in this context. Synergistically, the proposed on-line traffic

28 control architecture and its components are amenable to parallelization, leading to significant computational time savings by exploiting alternative distributed paradigms.

ACKNOWLEDGEMENTS

This material is based upon work supported by the National Science Foundation under Grant No. CMS-9702612. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

REFERENCES

Chang, G.L., Junchaya, T. and Santiago, A.J. (1994), A real-time network traffic simulation model for ATMS applications: Part II – massively parallel model, IVHS Journal, 1(3), 243-259. Chiu, Y.-C. and Mahmassani, H. S. (2002), A hybrid real-time dynamic traffic assignment approach for robust network performance, Transportation Research Record, 1783, 89-97. Habbal, M., Koutsopoulos, H.N. and Lerman, S. (1994), A decomposition algorithm for the all- pairs shortest path problem on massively parallel computer architectures, Transportation Science, 28(4), 292-308. Hawas, Y. E. and Mahmassani, H. S. (1997), Comparative analysis of robustness of centralized and distributed network route control system in incident situations, Transportation Research Record, 1537, 83-90. Jayakrishnan, P., Mahmassani H.S. and Hu, T.-Y. (1994), An evaluation tool for advanced traffic information and management systems in urban networks, Transportation Research-C, 2(3), 129- 147. Mahmassani, H.S., Hawas, Y., Abdelghany, K., Abdelfatah, A., Chiu, Y.C. and Kang, Y. (1998a), DYNASMART-X: Analytical and algorithmic aspects, Technical Report, ST067-085-Volume II. Mahmassani, H.S., Hawas, Y., Abdelghany, K., Chiu, Y.C., Abdelfatah, A., Huyhn, N. and Kang, Y. (1998b), DYNASMART-X: System implementation and software design, Technical Report, ST067-085-Volume III. Papageorgiou, M. (1990), Dynamic modeling, assignment, and route guidance in traffic networks, Transportation Research-B, 24, 471-495. Pavlis, Y. and Papageorgiou, M. (1999), Simple decentralized feedback strategies for route guidance in traffic networks, Transportation Science, 33(3), 264-278. Peeta, S. and Anastassopoulos, I. (2002), Automatic real-time detection and correction of erroneous detector data using Fourier transforms for on-line traffic control architectures, Transportation Research Record, 1811, 1-11. Peeta, S. and Bulusu, S. (1999), A generalized singular value decomposition approach for consistent on-line dynamic traffic assignment, Transportation Research Record, 1667, 77-87. Peeta, S. and Chen, S.-C. (1999), A distributed computing environment for dynamic traffic operations, Computer-Aided Civil and Infrastructure Engineering, 14, 239-253. Peeta, S. and Mahmassani, H.S. (1995), Multiple user classes real-time traffic assignment for online operations: a rolling horizon solution framework, Transportation Research-C, 3(2), 83-98.

29

Peeta, S. and Yang, T-H. (2003), Stability issues for dynamic traffic assignment, Automatica, 39(1), 21-34. Peeta, S. and Zhou, C. (2002), A hybrid deployable dynamic traffic assignment framework for robust online route guidance, Networks and Spatial Economics, 2(3), 269-294. Snir, M., Otto, S., Huss, S., Walker, D. and Dongarra, J. (1998), MPI – The Complete Reference: Volume 1, The MPI Core (2nd edition), MIT Press, Cambridge, Massachusetts. Sterling, T.L., Salmon, J., Becker, D.J. and Savarese, D.F. (1999), How To Build A Beowulf: A Guide to the Implementation and Application of PC Clusters, MIT Press, Cambridge, Massachusetts. Ziliaskopoulos, A.K. and Mahmassani, H.S. (1994), A time-dependent shortest path algorithm for real-time intelligent vehicle/highway systems applications, Transportation Research Record, 1408, 94-100. Ziliaskopoulos, A.K., Kotzinos, D. and Mahmassani, H.S. (1997), Design and implementation of parallel time-dependent least-time path algorithms for intelligent transportation systems applications, Transportation Research-C, 5, 95-107. Ziliaskopoulos, A.K. & Waller, S.T. (2000), An Internet based geographic information system that integrates data, models, and users for transportation applications, Transportation Research-C, 8, 427-444.

30

Traffic Control Center (TCC 1) Traffic Control Center (TCC 2)

Solution Strategies via the Internet

TCC 1 TCC 2

Database Database

Information Information Dissemination Dissemination Strategies Strategies

Unit Computing Central

Traffic Data Feedback via the Internet

On-line Traffic Network 1 On-line Traffic Network 2

Fig. 1. Conceptual structure of information-based real-time route guidance.

31

Internet

Suggested Information Traffic Control Dissemination TCC Strategies Center (TCC) Database

Computing Unit

Traffic System Information Dissemination Control Models Strategies Implemented

Communication Link) Communication Virtual System Calibration Failure Mode Failure Simulator and

Data from the Data

On-line Network Consistency Checking

Fault Tolerance

(Dedicated

Internet

Fig. 2. On-line control architecture for real-time route guidance.

32

Virtual System Simulator

Fault Tolerance

Calibration & Traffic Control Consistency Checking Simulator Models Visualization Load Balancing Output

Functionality (Task Transmission Distribution) Task Coordination Format Output

Data Retrieval Input Formatting Data Storage

Time

Fig. 3. Concurrency diagram for the on-line control architecture.

33

3 3 3

2 2 2 2 2 2 2 2 2

. . .

1 1 1 1 1 1

(a) Scheme 1

3 3 3 2 2 2

4 . . .

1 1 1 1 1 1

4

(b) Scheme 2

3 3 3

2 2 2

4 . . . 4

1 1 1 1 1 1

4

(c) Scheme 3

Legend:

1 Processor 3 Hard Drive 2 File 4 Message Passing

Fig. 4. Data storage and transmission schemes.

34

Input Data

Initial Path Assignment Proportions (Iteration I = 0) for Period of Interest

Iteration Traffic Simulator I = I +1

SO Users UE Users

Marginal Link Travel Times Link Travel Times

Marginal Shortest Paths Shortest Paths

Other SO Class Other UE Class Modules Modules

Updated SO Path Updated UE Path Assignment Proportions Assignment Proportions

NO Converge? YES STOP

Fig. 5. The MUCTDTA algorithm.

35

Fig. 6. The Traffic Network for Experiment Tests.

36

Start

Static Data: Network Structure and Signal Setting Data Dynamic Data: Time-dependent Link Travel Times READ Obtained from the Traffic Simulator

N = 1

N > Number of

Destinations? Yes

No Discretized Time Interval T = 1

Compute the Time Dependent Shortest Paths COMPUTE from All Origins to Destination N for T

No T = T + 1 T > τ?

Yes

Output the Shortest Paths for T = OUTPUT 1,…, τ, to the SP Data File for N

N = N + 1

Stop

Fig. 7. The time-dependent shortest path algorithm.

37

Start

P1 P2 …… Pn

Read Assigned Data Read Assigned Data …… Read Assigned Data

Notation: n: Number of Processors Used

Exchange Data among Processors Di: Destination Node Executed on Processor No. I, I = 1, …, n Pi : Processor ID, Pi = 1, …, n

D = 1 D: Destination Node Pointer M: Total Number of Destinations

Update D Value Update D Value …… Update D Value

Yes Yes …… Yes D > M? D > M? D > M?

No No No

D1 = D D2 = D …… Dn = D

D = D + 1 D = D + 1 …… D = D + 1

Broadcast D Broadcast D …… Broadcast D

Run SPA for D Run SPA for D …… Run SPA for D 1 2 n

Output Shortest Output Shortest Output Shortest …… Paths for D1 Paths for D2 Paths for Dn

…… Stop Stop Stop

Fig. 8. A parallel implementation scheme for the time-dependent shortest path algorithm.

SCORCH-IT IBM SP2 Total Execution Time Computational Time Communication Time I/O Time

Average Time (seconds) Time Average

38

900 14

SCORCH-IT IBM SP2 800 Total Execution Time Computational Time 12 700 Communication Time I/O Time 10 600

500 8

400 SCORCH-IT Speedup 6 Speedup IBM SP2 Speedup 300 4

Average Times (seconds) Times Average 200

2 100

0 0 0 2 4 6 8 10 12 14 16 Number of Processors Fig. 9. SPA execution times and speedup on SCORCH-IT and IBM SP2.

39

250 IBM SP2 200 SCORCH-IT with Slower Switch SCORCH-IT with Faster Switch

150

100

50 Average Total Execution Time (seconds) Time Execution Total Average 0 16 80 160 240 320 400 480 Communication Data Size (MB)

Fig. 10. Performance of SCORCH-IT and IBM SP2 with communication data size.

40

80 14 Total Execution Time 70 Computational Time 12 Communication Time 60 I/O Time Speedup 10 50

8 40

6 Speedup 30 4 20 Average Times (seconds) Times Average 10 2

0 0 0 20 40 60 80 100 120 Period of Interest (minutes)

Fig. 11. SPA execution times on SCORCH-IT for different periods of interest.

41

Input Data

SO Initialization

Parallelizable Parallelizable Parallelizable P(1) … … P(n) SO Modules SO Modules P(2) SO Modules

Non-parallelizable SO Modules Initialization

UE Initialization

Parallelizable Parallelizable Parallelizable P(1) P(2) … … P(n) UE Modules UE Modules UE Modules

Non-parallelizable UE Modules

Initialization Stop

Fig. 12(a). SPMD parallel paradigm.

Input Data

SO Initialization UE Initialization

Parallelizable Parallelizable Parallelizable Parallelizable P(m) P(m+1)… P(n) SO Modules P(1) …… SO Modules UE Modules UE Modules

Non-parallelizable SO Modules Non-parallelizable UE Modules

Stop

Fig. 12(b). MIMD parallel paradigm.

Fig. 12. Parallel paradigms for MUCTDTA algorithm implementation.

42

900

800 Sequential 700 SPMD MIMD 600

500

400

300

200 Total Execution Times (seconds) Times Execution Total 100

0 5 10 15 20 Loading Factor

Fig.13. Parallel performance for MUCTDTA algorithm using different parallel paradigms.