Proceedings Template

pd-gem5: Simulation Infrastructure for Parallel/Distributed Com- puter Systems for Network-Driven Optimization

Advisor: Prof. Nam Sung Kim

Summer 2015 1

Abstract In developing and optimizing a parallel/distributed computer system, it is critical to study the impact of the complex interplay amongst processor, node, and network architectures on performance and power efficiency in detail. This necessitates a flexible, detailed, and open-source full-system simulation infrastructure, but our community lacks such an infrastructure. Responding to such a need, we present pd-gem5, a gem5-based infrastructure that can model and simulate a parallel/distributed computer system using multiple simulation hosts. We show that pd-gem5 running on 6 simulation hosts speeds up the simulation of a 24-node computer system up to 3.7× compared to simulating all 24 nodes in a single simulation host. As a use case of pd-gem5, we model a parallel/distributed computer system after enabling pd-gem5 with the Linux default ondemand governor. While running parallel/distributed workloads, we observe that a sudden increase in network activity often immediately leads to and/or is highly correlated with a high utilization of processors receiving the packets. However, the governor does not instantly increase the frequency of the processors. Such delayed responses often result in violations of service level agreements (SLAs), discouraging service providers from deploying an aggressive power management governor. To achieve both short response time and high energy-efficiency, we propose a network-driven power management governor and evaluate its efficacy compared with the default on-demand and performance governors.

1. Introduction The single-thread performance improvement of processors has been sluggish for the past decade as the technology scaling (also known as Dennard’s scaling) is approaching its fundamental physical limit. Thus, the importance of efficiently running applications on a parallel/distributed computer system has continued to increase and diverse applications based on parallel/distributed computing models such as MapReduce [1] and MPI [2] have thrived. It is intuitive that the complex interplay amongst processor, node, and network architectures strongly affects the performance and power efficiency of a parallel/distributed computer system, which is confirmed by our re- cent experiment using a small-scale but physical parallel/distributed computer system. In particular, we observe that all the hardware and software aspects of network, which encompass interface technology (e.g., Ethernet [3], RapidIO [4], and InfiniBand [5]), switch/router capability, link bandwidth, topology, traffic patterns, protocols, etc., significantly impact the processor and node activities. Therefore, to maximize performance and power efficiency of a parallel/distributed computer system, it is critical to develop various optimization strategies cutting across processor, node, and network architectures, as well as their software stacks, necessitating full-system simulation. Doing so today, however, is challenging, as our community lacks a proper research infrastructure to study the interplay of these sub systems; we can evaluate them independently with existing tools, but not together. Along with MARSSx86 [6] gem5 is one of the most widely used full-system simulators [7]. Since it can boot a given operating system (OS), researchers can evaluate various processor architectures while reflecting complex interactions between the processor and the OS; it is feasible to run such full-system simulation due to high performance of today’s computers and gem5’s support of transitions between fast functional and precise cycle-level simulation modes for efficiently evaluating a program’s region of interest (ROI). Lastly, gem5 supports two most widely used instruction set architectures (i.e., x86 and ARM). The current official release of gem5 supports thread-based parallelism within a single process, but it does not support parallelism across multiple simulation hosts yet. Consequently, it can model and simulate only a limited number of nodes connected through one or more simulated network switches. Moreover, gem5 supports only simple network interface card (NIC) and network switch models, preventing us from precisely capturing various network effects on overall performance of a parallel/distributed computer system. In this paper, we first present a gem5-based simulation infrastructure dubbed pd-gem5. To support simulation of a computer system where constituting nodes can be connected in any given network topology, we (1) enhance gem5 to support a simulation of nodes across multiple simulation hosts along with synchronization amongst simulated nodes; (2) enhance the NIC and network switch models; and (3) validate the models. In particular, we enhance the NIC performance model to precisely capture the non-linear latency effect of handling diverse packet sizes and validate it using a physical computer system comprised of 4 nodes. Subsequently, to support efficient simulation of a parallel/distributed computer system we augment various features for (1) simulation precision and speed trade-offs and (2) creation/restoration of check-points. In simulating a computer system of more than 8 nodes, we demonstrate that pd-gem5 with multiple simulation hosts offers lower simulation time than with a single simulation host. pd-gem5 running on 6 simulation hosts speeds up the simulation of a 24-node computer system up to 3.7× compared to simulating all 24 nodes in a single simulation host. Second, we investigate a power management driven by network activities of a parallel/distributed computer system as a use case of pd-gem5. More specifically, we (1) enable Linux default governors in pd-gem5; (2) run ap- 2 plications based on MPI and ApacheBench; and (3) monitor the relationship amongst the network activity, processor utilization, and operating frequency change. In the experiment, we observe that a sudden increase in network activity almost immediately leads to and/or is highly correlated with a high utilization of processors receiving the packets. However, the default ondemand governor does not instantly increase the frequency of the processors. This is because the ondemand governor periodically adjusts the operating frequency (e.g., every 10ms consid- ering the cost of changing voltage/frequency) and it has to wait until the next adjustment period. Such a delayed response may result in violating a service level agreement (SLA), discouraging service providers from deploying an aggressive power management governor. To achieve both short response time and high energy efficiency, we propose a network-driven power management governor. Evaluating a client/server microbenchmark (ApacheBench), it can improve the response time of 95 percentile requests by 43% compared with the default ondemand governor. It also offers almost the same response time as the performance governor that always runs at the highest frequency, while reducing the energy consumption by 8%. The remainder of this paper is organized as follows. 2‎ describes the overall architecture and implementation details of pd-gem5. Section 3‎ validates the network model, demonstrates the performance scalability of pd-gem5, and discuss the trade-off between simulation precision and speedup. Section 4‎ describes our proposed network- driven ondemand governor and demonstrates its efficacy. Section 5‎ describes the related work. Section 6‎ con- cludes this paper.

2. pd-gem5 Figure 1 overviews pd-gem5. A simulation host system running pd-gem5 can be comprised of one or more physical nodes connected with one or more physical network switches (Figure 1(a)). Each simulation host node typically provides multiple cores, each of which may run one or more gem5 instances (Figure 1(b)). Each gem5 instance (or process) simulates a node (Figure 1(c)) or a network switch (Figure 1(d)). Assume we want to model and simulate an 8-node computer system connected by 1 network switch in a star network topology. pd-gem5 can simulate such a computer system as follows. First, it runs 8 gem5 instances (i.e., 8 simulated nodes) on two simulation hosts (“host 0” and “host 1” in Figure 1(e)). Second, it runs 1 gem5 instance (i.e., 1 simulated network switch) on another simulation host (“host 2” in Figure 1(e)); it may dedicate a gem5 instance to simulate just a network switch to prevent a process simulating the network switch from being the simulation bottleneck. In theory, pd-gem5 can model and simulate a parallel/distributed computer system in any given network topology (e.g., star, ring, and mesh topologies). Subsequently, in this section we will describe three key enhancements of gem5: (1) synchronizing simulated nodes, (2) modeling network, and (3) assisting fast evaluation. These enhancements allow us to model and efficiently simulate a distributed/parallel computer system in a desired network topology. 2.1 Synchronization amongst Simulated Nodes Although simulated nodes run on identical simulation hosts, the simulation progress (or simulated wall clock time) of each simulated node may considerably vary due to various reasons. For example, the diversity of simulated events across the simulated nodes leads to the load imbalance across the simulation hosts, because gem5 is an event-driven simulator. Thus, without proper synchronization amongst simulated nodes, pd-gem5 simulates a modeled parallel/distributed system non-deterministically and fails the simulation on some occasions. To synchronize simulated nodes in pd-gem5, we implement barrier synchronization that synchronizes each

Host 0 Host 1 node1 node0 node4 node5

Host 2 1 0 0 1 2 3 4 5 node2 node3 node6 node7

4 5 6 7 Transport 2 3 Internet 6 7 Application Network Application Transport Transport Internet Internet Network Network

(a) (b) (c) (d) (e) Figure 1: pd-gem5 overview: (a) simulation hosts; (b) gem5 instances running on a simulation host; a gem5 instance simulating (c) a full system or (d) a network switch; and (e) an example of modeling/simulating a parallel/distributed computer system with a star network topology using 3 simulation hosts. 3

1: @ gem5 simulation script 2: while (true) do 3: barrier.receive() // blocking read,‎wait‎for‎‘start’ 4: m5.simulate(q) // simulate q ticks 5:‎‎‎‎‎‎barrier.send(‘finish’) 6: end while 7: @ barrier script: 8: while (true) do 9: for i in range of (num_nodes) do 10: node[i].send(‘start’)‎//‎send‎‘start’‎signal 11: node[i].receive() // blocking read,‎wait‎for‎‘finish’ 12: end for 13: end while Figure 2: Pseudo code for implementing synchronization amongst simulated nodes. simulated node at the end of each fixed simulated time quantum (denoted by q) as illustrated in Figure 2 . Each gem5 process will perform simulation for a fixed amount of simulated time (e.g., q = 10 µs) and then send a “finish” message to the barrier waiting to receive the “finish” message. After the barrier receives “finish” messages from all gem5 processes, it sends back a “start” message to all gem5 processes, which resume their simulations for the next time quantum. Such a synchronization technique is well known for parallel simulation. For example, Wisconsin Wind Tunnel (WWT) uses a similar technique to simulate a multi-processor system using multiprocessors [8]. As long as q is equal to or smaller than the minimum simulated communications (or network) latency between simulated nodes, pd-gem5 can offer precise simulation; see Section 3.2‎ for detailed network latency modeling. Furthermore, pd-gem5 supports relaxed synchronization approach where q can be a simulation parameter to trade simulation speed with simulation precision [9]. Frequent synchronization amongst simulated nodes slows down simulation. This is the case in particular for simulating a multi-core processor in a parallel/distributed manner (e.g., SlackSim) since the minimum simulated communication latency between simulated cores (shared cache access latency) is very short (e.g., tens of simulated processor cycles). In pd-gem5, nonetheless, we expect that the overhead of synchronization is reasonable. This is because the minimum simulated communication latency in pd-gem5 is determined by the NIC network latency (2-10 µs in simulated wall clock time). In other words, a synchronization is performed every several thousands of simulation cycles. See Section 3.3‎ for more details on how pd-gem5’s simulation speedup scales with the number of simulated nodes and quantum size. 2.2 Modeling Network To model network amongst simulated nodes, pd-gem5 forwards each packet generated by a simulated NIC of a simulated source node to a simulated network switch port through a TCP socket. The simulated network switch routes the packet to a simulated NIC of a simulated destination node. Figure 3 depicts how simulated nodes are interfaced with each other through a simulated network switch. EtherTap is a built-in interface of gem5 to connect a simulated NIC (eth0) to the physical link through a TCP socket, while EtherLink connects EtherTap with a simulated NIC. NIC: The current gem5’s NIC model adopts a simple network latency model: latencyNIC (s) = lNIC + s/bNIC where s is the packet size and smaller than the NIC’s maximum transmission unit (MTU); lNIC and bNIC denote NIC’s fixed (inherent) latency and maximum bandwidth, respectively. In pd-gem5 we enhance the NIC model to more precisely capture the non-linear effect of handling diverse packet sizes on network latency. For example, many NICs are optimized to provide much shorter latency than expected by a simple linear model for small s. The latency of out-going packets is determined by the following equation in pd-gem5: latencyNIC (s) = lNIC (s) + s/bNIC (s)

1: if (rt < packet.time_stamp) then // no packet enqueued 2: packet.time_stamp += lswitch + s/bswitch 3: rt = packet.time_stamp + s/blink 4: else if (qs < params.queue_size) then 5: packet.time_stamp = rt + lswitch + s/bswitch 6: rt = rt + s/bswitch 7: qs ++ 8: enqueue decrement qs event at packet.time_stamp 9: else then // queue full 10: drop packet

Figure 3: Interface with a network switch and other gem5 11:end if nodes. Figure 4: Network Switch latency model. where s is the packet size. We tune the parameters lNIC and bNIC based on empirical data from physical systems, as the architectural details of modern NICs are not publically available. Network Switch: To build a functional model of a network switch, we leverage gem5’s EtherTap and Etherlink, a pair of which comprises a port of the network switch, as depicted in Figure 3. Figure 4 depicts the latency model for each port of the network switch, where rt is the release tick of the port and qs is the number of enqueued packets (refer to “Network Subsystem Timing” for a more detail on the packet time stamp later in this section). We implement the timing model of a simulated network switch by synchronizing the network switch process with other gem5 processes and scheduling timing events of the network switch. This allows pd-gem5 to model various hierarchical network topologies. Ethernet Link: We utilize EtherLink to connect different network devices together while applying timing parameters to packets: (1) NIC  EtherTap, adds link latency (llink) to packet.time_stamp and instantly delivers it (2) EtherTap  NIC, adds llink to packet.time_stamp and delivers it at packet.time_stamp (3) EtherTap  Network Switch, delivers a packet instantly (4) Network Switch  Network Switch, adds llink to packet.time_stamp and instantly delivers it. Network Subsystem Timing: Consider a scenario of sending a packet from one simulated node to another in pd- gem5. A gem5 process simulating a source node (“gem5(0)” in Figure 5) releases a packet from its EtherTap to the physical network. After physical network latency, a gem5 process simulating a destination node (“gem5(1)” in Figure 5) receives the packet.

Figure 5: Simulated packet forwarding between two gem5 nodes.

1: @simulated source NIC: 2: packet.time_stamp = current_tick()/* sender tick */ + NIC.latency(packet.size); 3: @simulated source EtherLink: 4: packet.time_stamp + = link.latency 5: @simulated network switch: 6: reorder packets on each port based on time stamp 7: packet.time_stamp += switch.latency(packet.size, switch.state) 8: @simulated destination EtherLink: 9: packet.time_stamp + = link.latency 10: if (packet.time_stamp > current_tick()) then 11: enqueue NIC event at packet.time_stamp 12: else then 13: release NIC event instantly or abort simulation 14: end if Figure 6: pd-gem5 network timing.

Figure 6 summarizes the pd-gem5’s network timing model. As soon as a packet arrives at the EtherTap of a destination simulated node, a gem5 event is scheduled such that the packet is released to the simulated destination NIC at “Expected Delivery Tick” (= “Sender Tick” + “Simulated Network Latency”) as shown in Figure 5. To facilitate such event scheduling, pd-gem5 appends a time stamp to each simulated Ethernet packet. As the simulated packet is propagated through and processed by each simulated network stage (i.e., source NIC, network switch, Etherlink, and destination NIC), the time stamp value is increased by simulated latency values that are functions of the size of a given packet and the current (congestion or queuing) state of each simulated network stage as depicted in Figure 4. In Section 2.1‎ q is set to be equal to or smaller than the minimum simulated network latency, which is typically orders of magnitude greater than the physical network latency of sending/receiving a TCP packet in physical wall clock time. Thus, a packet typically arrives well before “Expected Delivery Tick” and can be scheduled to be released after at least one synchronization is performed. However, a packet may not be delivered before “Ex- pected Delivery Tick” on few unusual occasions. For example, the destination node advances its simulation very quickly due to lack of simulation events while a packet experiences an unusually long physical network latency due to various issues. Then, pd-gem5 delivers the packet immediately to the simulated destination NIC, as depicted in Figure 4. Such an event is logged as an exception and a user can decide whether or not pd-gem5 aborts the simulation. Lastly, we can dynamically adapt q with the current state of physical network to improve the simulation speed without impacting simulation precision, but we leave it as future work. Network Topology Exploration: The network topology describes how nodes are interconnected and is an im- portant component for evaluating a parallel/distributed computer system. pd-gem5 does not provide any analytical performance models to simulate various network topologies. However, users can instantiate different network building blocks (network switch and EtherLink) and interconnect them together for a given network topology. For example, to model a simple two level binary tree topology with four nodes, we can instantiate three network switches: one at the root and the remaining two connected to four leaf nodes. We can adjust the bandwidth parameters of the network switches to model a fat tree instead of a simple binary tree. In a similar fashion, one can create a hierarchy of appropriate number of network switch instances and configure thenetwork switch parameters to model the topology of interest. 2.3 Assisting Fast Evaluation We often run detailed full-system simulation only for ROIs. Thus, for pd-gem5 we preserve all the features of gem5 to assist fast performance evaluation such as sampling and check-pointing. In simulating a parallel/distributed computer system, creating and restoring a check-point presents unique challenges: (1) all the simulated nodes need to write the check-point at the same simulated time; (2) the network status at the check-point should be pre- served; (3) all the simulated nodes should be networked together before restoring from a check-point; and (4) while restoring from a check-point, all the simulated nodes should resume the simulation at the same simulated time point.

Table 1: Processor configurations. Category Configuration Number of cores: 4 Core frequency: 3.2 GHz System bus frequency: 1.2 GHz O3 core Superscalar: 4 way Integer / FP ALUs: 3 / 2 ROB/IQ/LSQ Entries: 128/36/48/32 BP: Bi-mode L1I/L1D/L2 size (KB): 64/32/2048 Memory L1I/L1D/L2 associativity: 2/4/16 Hierarchy DRAM: 4GB/DDR3_1600 NIC: Intel 82574GI Gigabit Network Link: 100Mbps and 2-10µs latency OS Linux Ubuntu 14.04

To tackle these challenges, we leverage the barrier mechanism depicted in Section 2.1‎ . Whenever one of the gem5 processes encounters an m5-checkpoint pseudo instruction, it will send a “recv-ckpt” signal to the “barrier” process, the “barrier” process sends a “take-ckpt” signal to all the simulated nodes (including the node that en- countered m5-checkpoint) at the end of the current simulation quantum. On the reception of “take-ckpt” signal, gem5 processes start dumping check-points. This makes each simulated node dump a check-point at the same simulated time point. To preserve the simulated network state, we ensure that all in-flight packets reach and are buffered at EtherTap before dumping the checkpoints.

3. Network Model Validation and Simulation Speedup Evaluation 3.1 Methodology Simulation Host: To run pd-gem5 simulations for both validation and evaluation, we use simulation hosts, each of which consists of a quad-core Intel® Xeon E3-1230 processor, two 8GB DDR3-1600 DIMMs, and Intel® 82580 Giga- bit Ethernet NIC. Validation: Since we mainly enhance the network subsystem of gem5, we perform the validation of pd-gem5 focus- ing on the network aspects (i.e., bandwidth and latency). To validate the network subsystem model of pd-gem5, we assemble a 4-node parallel/distributed computer system, where each physical node is comprised of a quad- core AMD A10-5800K, one 8GB DDR3-1600 DIMMs, and Realtek PCIe Gigabit Ethernet NIC. The 4 nodes are connected to an Allied Telesyn AT-FS705L 100Mbps Ethernet network switch to establish a star network topology. Subsequently, we configure pd-gem5 using the parameters tabulated in Table 1 to model the assembled 4-node parallel/distributed computer system. Although the original gem5’s NIC functional model we choose is based on an Intel 82574GI NIC, we tune the parameters of the pd-gem5 network subsystem models to mimic the Realtek PCIe Gigabit Ethernet NIC and Allied Telesyn 100MBps Ethernet switch; we use network packet traces that we measure using the assembled 4-node parallel/distributed computer system. To measure latency and bandwidth of the network subsystem, we run tcptest [10] and netperf comprised of TCP_STREAM, TCP_MAERTS, and UDP stream [11] microbenchmarks. Simulation Speedup: To evaluate the speedup of the modeled parallel/distributed computer systems, we model 2 to 24-node parallel/distributed computer systems with a star network topology using pd-gem5. Then we run the MPI implementation of the NAS parallel benchmark suite comprised of five benchmarks (i.e., integer sort (IS), embarrassingly parallel (EP), conjugate gradient (CG), multi-grid on a sequence of meshes (MG), and discrete fast Fourier Transform (FT)) [12], the Hadoop/MapReduce implementation of the DCBench “basic operation” suite (i.e., grep, sort, and wordcount) with the input dataset size of 32MB × the number of simulated nodes [13]. We setup pd-gem5 such that it can run Hadoop in fully featured distributed mode and thus MapReduce benchmarks; it can run the Hadoop Distributed File System (HDFS) and MapReduce engine. We use the following HDFS and MapReduce configuration parameters: dfs.replication (= 1), dfs.heartbeat.interval (= 100ms), mapred.max.split.size (= 32MB); we use the default values for other parameters. Lastly, we design a microbenchmark such that it attempts to balance the load across all simulated nodes performing memory-intensive operations. That is, all simulated nodes start to format the HDFS at the same time. 3.2 Network Subsystem Model Validation In this section, we compare the measured bandwidth and latency of the 4-node AMD A8-5800K computer system with those of pd-gem5 configured similar to the 4-node physical computer system.

Figure 7(a) demonstrates that pd-gem5’s network subsystem model provides a similar non-linear latency trend as the physical computer system while the current gem5’s network subsystem model yields a linear latency trend. To measure the round-trip latency of TCP packets with various sizes we run tcptest 100 times; tcptest is a client- server application that measures the round-trip latency by sending and receiving several fixed size packets between client and server processes. When the packet size is equal to or smaller than 64 bytes, the current gem5’s network subsystem model gives 34% higher latency while the pd-gem5’s network subsystem models provides 6% lower latency on average. When the packet size is larger than 64 bytes, the current gem5’s network subsystem model gives 40% lower latency while the pd-gem5’s network subsystem model provides 5% lower latency on average. Partly this inaccuracy comes from the fact that the method using tcptest for measuring the latency involves diverse layers of the network stack. Thus, imperfections in microarchitectural modeling of the underlying system affect both pd-gem5 and gem5 results. Figure 7(b) compares the measured bandwidth of the physical system with the simulated bandwdith of pd- gem5 after running microbenchmarks from tcptest and netperf; TCP_S, TCP_M, and UDP_S denote TCP_STREAM, TCP_MAERTS, and UDP_STREAM. In our physical system setup, the network switch is the bandwidth bottleneck. Therefore, the measured bandwidth is the same regardless of testing technique (TCP_S, TCP_M, tcptest) or packet type (UDP_S) used for testing. Overall, pd-gem5 exhibits 9% lower geo-mean bandwidth than while gem5 default model exhibits 11% lower geo-mean bandwidth. Figure 7(c) plots the latency versus packet rate of the physical computer system and pd-gem5. For both physical and simulated computer systems, we use hping3 [14] to send Internet Control Message Protocol (ICMP) packets at various rates. To minimize the effect of various networking layers on latency measurement in physical hardware, tcpdump [15] is used to measure the round trip latency of packets as close as possible to the physical layer. Each data point in Figure 7(c) is the average latency of packets sent with a specified rate over a fixed period of time (20ms). The gem5 network subsystem model utilizes EtherLink to regulate the sending rate of each associ- ated network device by keeping the link busy for a certain amount of time (= s/bNIC). This simple approach is suffi- cient for modeling the aggregate bandwidth between two nodes (as shown in Figure 7(b)). However, such a simple model cannot precisely capture various aspects of network timing, such as congestion and queuing latency. In Figure 7(c) we see that pd-gem5 follows the same trend as the physical computer system, but gem5 network subsystem model is unable to model the non-linear latency trend that is caused by congestion in network switch. At 20-70Mbps points, pd-gem5 exhibits 8% error in packet latency on average. At near 80Mbps points, gem5 gives 80% longer latency than the physical system while pd-gem5 gives only 1%-3% longer latency. 3.3 Simulation Speedup Evaluation Figure 8 demonstrates the simulation speedup of pd-gem5 running on multiple simulation hosts over pd-gem5 running on a single simulation host. For each benchmark and the number of simulated nodes, we run two experiments. First, we run two gem5 instances per core since Intel processors can support two threads per core through HyperThreading (denoted by “thread”). Second, we run one gem5 instance per core (denoted by “core”). The number of simulation hosts (Nsim_host) is the number of simulated nodes (Nsim_node) divided by the number of cores (or threads) per simulation host. For example, Nsim_host is 2 (1), 4 (2), and 6 (3) for Nsim_node = 8, 16, and 24 since the number of cores (threads) per simulation host is 4 (8); we dedicate another simulation host for a network switch. To obtain the speedup, the simulation time while using multiple simulation hosts is normalized to that of using a single simulation host running all the simulated nodes. pd-gem5 running NAS benchmarks on 2 (1), 4 (2), and 6 (3) simulation hosts (i.e., running one gem5 instance

(a) latency vs packet size (b) bandwidth (c) latency vs packet rate Figure 7: Comparison of (a) latency vs packet size, (b) bandwidth, and (c) latency versus package rate amongst the physical system, pd-gem5, and gem5. 8

Figure 8: pd-gem5 speedup over using a single simulation host running all the simulated nodes.

(two gem5 instances) per core) offers 1.6× (1.1×), 2.6× (1.8×), and 3.7× (2.5×) higher geo-mean performance for 8, 16, and 24 simulated nodes than running them on a single simulation host. Running one gem5 instance per core offers 42%-51% higher speedup than two gem5 instances per core. Note that pd-gem5’s simulation speedup also depends on the characteristics of a given program. If there exist a significant fraction of serial phases in a program or the program exhibits load imbalance, then the degree of parallelization in simulating such a program is low. For example, DCBench exhibits some load imbalance amongst simulated nodes due to the dead-times between task assignments in Hadoop jobtracker. Hence, we observe that some of the gem5 processes are idle while simulating these benchmarks. This leads to lower speedup than expected. To prove our hypothesis, we run a memory-intensive microbenchmark explained in Section 3.1‎ and balance the load across all the simulated nodes; we see that the speedup of distributed simulation of a 16- node computer system is significantly higher than the geo-mean speedup of the NAS benchmark suite. The microbenchmark fails to complete in a reasonable amount of time when a single simulation host attempts to simulate a 24-node computer system because of extremely slow simulation. Thus, we excluded it in obtaining the geo- mean speedup. Figure 9 shows the impact of q and the number of simulated nodes per physical node on simulation speed and preciseness (in terms of the percentage of straggler packets). pd-gem5 with the link latency set to 2µs simulates a 16-node computer system running the NAS benchmark suite with 1 and 2 gem5 nodes per core (= 4 and 2 simulation hosts), respectively. With q = 2µs we do not observe any straggler packets regardless of the number of gem5 nodes per core. pd-gem5 does not show a significant improvement in simulation speed when the quantum size is increased because we simulate a small-scale parallel/distributed computer system. For example with a quantum size of 2µs, simulating 2s of a program exhibits only 1M synchronization points. With an average cost of 200µs per synchronization point in our small-scale parallel/distributed computer system, the synchronization overhead is approximately 200s. In our setup, the synchronization overhead is small in comparison to the total

Figure 9: Speedup and the number of straggler packets for various synchronization time quantum (q) values: 2 simulated gem5 nodes per core (left) and 1 simulated gem5 node per core (right). 9

Table 2: Impact of q on simulation precision. q Δsim_sec Δsim_inst Δsim_pkts stragglers r 10 µs 0.02% 0.7% 0.3% 0.22% 20 µs 1.9% 14.0% 2.4% 2.18% 40 µs 6.3% 12.7% 1.1% 7.99% 100 µs 2.3% 25.7% 1.6% 21.5% simulation time. Further, in the benchmarks that we run, all simulated nodes operate under similar load which limits the scope of improvement in simulation time that could come from allowing a greater slack (larger quantum size) [9]. Compared with using 2 gem5 nodes per core with q = 2µs, using 1 and 2 gem5 nodes per core with q = 100µs can improve the simulation speed by 1.7× and 1.2× while leading to 22% and 20% straggler packets, respectively. Lastly, Table 2 summarizes the impact of q on simulated execution time, instructions, packets received, and straggler packets. We collect these values from the previously simulated 24-node computer system using 1 gem5 node per core. For example, to evaluate the disparity in simulated instructions between using q = 10µs and 2µs, we aggregate the simulated instructions of all the simulated nodes for q = 10µs and 2µs, respectively. Then we use these aggregate simulated instructions to obtain the (average) disparity. We see that increasing q has non- negligible impact on key simulation statistics. 4. Network-driven Power Management In this section, as a use case of our developed pd-gem5, we investigate a power management technique for parallel/distributed computer systems where we hypothesize that network activities may significantly impact various power management decisions. More specifically, we first enable the Linux default ondemand governor in pd-gem5, run various applications based on MPI and Hadoop. Second, we develop intuitions for a network-driven power management technique after monitoring the relationship amongst the network activity, processor utilization, and operating frequency change. Third, we propose and implement a network-driven power management technique that augments the current ondemand governor. Lastly, we demonstrate that the proposed power management technique can provide power-efficient computing while minimizing the negative impact on response. 4.1 Methodology We enable DVFS capability of each simulated node in pd-gem5 by leveraging the previously proposed methodology [16,17,18]. More specifically, we enable the DVFS controller in pd-gem5, which actually changes the operating frequency of simulated processors. Through communication between the DVFS controller and the modified cpufreq driver of the Linux kernel, the default ondemand governor can control the current P-state of the simulated processors. We model a parallel/distributed computer system comprised of four slaves and one master using pd-gem5. For each node, we model a quad-core processor that is capable of per-core DVFS with 15 P-states (i.e., frequencies) ranging from 0.8GHz to 3.1GHz. We evaluate ApacheBench [19] and two of NAS parallel benchmarks (MG and IS) to demonstrate the correlation between network activity and processor utilization with the default ondemand governor in Section 4.2‎ , and to evaluate our proposed network-driven ondemand governor in Section 4.3‎ . Each slave runs ApacheBench sending network packets to the master. To generate multiple bursts of network traffic, each slave runs ApacheBench for a certain interval after the end of execution of the previous node. 4.2 DVFS Behavior of a Parallel/Distributed Computer System To demonstrate the correlation between network activity and core utilization, Figure 10(top) plots the operating frequency determined by the default ondemand governor with the default DVFS period of 20ms and the utilization of the integer functional units (FUs) and network bandwidth (denoted by INT and NICBW, respectively) sampled every 1ms; other cores also show a similar trend. Since the evaluated benchmarks are integer-intensive with negligible floating-point operations, we show the integer FU utilization as a proxy for core utilization; the default ondemand governor decides the frequency of the next 20ms DVFS period after computing the core utilization based on a certain formula for the past 20ms DVFS period. For the experiments, each core has three integer FUs. Note that 100% integer FU utilization means that all three integer FUs are fully utilized. In Figure 10(top) we can observe that the burst of network traffic is accompanied by high integer FU utilization. In ApacheBench, a burst of network requests lead to high integer FU utilization in the master node since the master handles the burst, which is intuitive. In IS and MG, the network traffic is for the communication between parallel threads and it implies the main parallel loop section start to run, which demands high core operating frequency for high performance.

Core Freq Int Util EthBW Util 3.5 100

3.0 80 2.5

2.0 60

1.5 40

1.0 Utilization (%) Frequency (GHz) 20 0.5

0.0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time (s)

Figure 10. The operating frequency and integer functional unit utilization of a core, and network bandwidth of ApacheBench, MG, and IS with the default (top) and network-aware (bottom) ondemand governors.

We also observe that the default ondemand governor does not timely react to the sudden increase in network traffic for ApacheBench because it is implemented to change the operating frequency every certain period that is long enough to amortize the cost of changing the frequency. Note that a typical DVFS period is 10-20ms because the cost of invoking the ondemand governor and changing frequency (i.e., 10s of µs in commercial processors [20]) and needs to be amortized. Although the default ondemand governor increases the operating frequency of the core to handle the network traffic, it increases only after detecting high core utilization in the previous DVFS period. Thus, if the previous DVFS period exhibits low core utilization, there is a significant delay in increasing the operating frequency (i.e., the maximum delay of up to the DVFS period set by the governor), hurting the response time for service requests. This in turn discourages service providers from using the default ondemand governor due to potential violations of service level agreements (SLAs). Consequently, the service providers typically tend to use the performance governor that sets the operating frequency at the highest value (P0 state) regardless of the core utilization at the cost of high power consumption. 4.3 Network-driven ondemand Governor Based on the observations in Section 4.2‎ , we propose a network-driven ondemand governor, which maximizes core frequency when a burst of network traffic is detected. By immediately boosting the core frequency, our governor can timely react to sudden increase in network traffic demanding high core frequency. Such boosting mechanism supplements the shortcoming of default ondemand governor by avoiding the delayed adjustment of the operating frequency. To detect the burst of network traffic, we implement a network-traffic monitor in the NIC. When the network traffic monitor detects that the rate of incoming network traffic exceeds a specified threshold, it generates an interrupt to inform the governor. We modify the NIC interrupt handler to handle the proposed interrupt type dubbed PDGEM5INT. As soon as the interrupt is received, the network-driven ondemand governor maximizes core frequency. To avoid unnecessary frequent PDGEM5INT interrupts, the network-traffic monitor suppresses the interrupt generation for a while after the interrupt is triggered. We jointly employ two policies to trigger the PDGEM5INT interrupt: NaiveP and FineP. Figure 11 describes the pseudo code for NaiveP and FineP. NaiveP triggers the interrupt when the monitored rate of the incoming network traffic exceeds a specified threshold (i.e. TH_HIGH). Although NaiveP is effective in detecting a sudden increase in incoming network traffic exceeding the threshold, it fails to detect several small bursts that do not ex- 11

1: Initialize threshold and counter values to detect burst of network traffic. 2: check_PDGEM5INT(): /* called every 200us */ 3: traffic = get_traffic(); 4: if (traffic >= TH_HIGH) then /* NaiveP */ 5: trigger PDGEM5INT and disable PDGEM5INT for 20 ms; 6: else if (traffic >= TH_LOW) then /* FineP */ 7: counter = counter + 1; 8: if (counter == TH_LOW_COUNT ) then 9: trigger PDGEM5INT and disable PDGEM5INT for 20 ms; 10: counter = 0; 11: end if 12: end if 12: get_traffic(): 13: return incoming network traffic rate for 200us;

Figure 11 Pseudo-code to trigger PDGEM5INT. ceed the threshold but lead to or strongly correlate to high core utilization. Thus, we also propose FineP that de- fines another threshold and a counter (TH_LOW and TH_LOW_COUNT) to detect multiple small bursts of incoming network traffic. FineP increases the counter value when the monitored rate is greater than TH_LOW and less than TH_HIGH. If the counter value reaches a specified number, FineP triggers PDGEM5INT. Figure 10 shows the operating frequency changes by our proposed network-driven ondemand governor along with integer FU and network utilizations. While the default ondemand governor takes time to reach the increased frequency, our network-driven governor maximizes the operating frequency as soon as the network traffic monitor detects the burst of network traffic by the algorithm described in Figure 11. Such fast reaction offers improved response/execution time for requests at the early stage. Figure 12 compares our network-driven governor with the default ondemand and performance governors in terms of response time for ApacheBench. Figure 12 shows a distribution of response time for ApacheBench which reports request percentile versus response time for each run. For example, (50%, 3ms) implies that 50% of the requests get serviced within 3ms. Our network–driven ondemand governor reduces response time of 50, 95 and 99 percentile requests by 33%, 43%, and 60%, respectively. Furthermore, our network-driven ondemand governor provides almost the same response times as the performance governor for most percentile points. Compared with the default performance governor, our network-driven ondemand governor reduces the energy consumption by 8%. In summary, due to the fast reaction for network requests, the network-driven ondemand governor provides as fast response times as the performance governor, which is critical for satisfying SLAs, while notably reducing energy consumption.

5. Related Work SST (structural simulation toolkit) [21] is “an open-source, multi-scale, and parallel architectural simulator aimed for the design and evaluation of HPC systems.” SST is an event-driven simulation model built upon MPI to support efficient parallel/distributed simulations. SST is also a component-based simulation model integrating various computer node components. For example, gem5 is integrated into SST as a component and parallel execution of gem5 instances is enabled. To support large HPC network experiments, a memory mapped device (translator) in gem5 is designed to interact with detailed HPC NIC and router models (i.e., Portals NIC [22] and SeaStar Rout- er [23]) in SST. We, however, identify several challenges in using SST with gem5. First, as SST has to use a translator to interact with node components in gem5, it suffers from compatibility issues with updated versions of gem5. In contrast, pd-gem5 is built upon the original gem5 and allows users to install enhanced features as patches, au- tomatically resolving such compatibility and future maintenance issues. Furthermore, SST is not validated as a multi-componet simulator and it relies on validation of componets in isolation. On the other hand, pd-gem5 is validated against a physical parallel/distributed system. COTson is a full-system simulator for multi-core processors and scale-out clusters [24]. COTson combines in- dividual node simulators to form a cluster simulator. It jointly uses a fast functional emulator (i.e., AMD’s Sim- Now [25]) and timing models to explore the trade-off between simulation precision and speed. COTSon supports (i) a dynamic sampling feature, which keeps track of SimNow’s simulation statistics to identify the phases of simulated workload and enables/disables timing models based on these phase changes, and (ii) an adaptive quantum 12

Figure 12. ApacheBench response time distribution. synchronization feature to support the trade-off between simulation precision and speed. A key advantage of COTson is its speed as it uses a proprietary x86 ISA emulator for fast forwarding. However, it is also a key disadvantage since it can only support x86 ISA; further COTson is not an open-source tool. In contrast, we build the infrastructure on the publically available gem5 supporting both x86 and ARM ISAs; it is critical to support ARM ISA with the growing interest in using ARM processors for servers. MARSSx86 is a cycle-level full-system simulator, specifically for multi-core x86-64 architectures [6]. Similar to COTson, it takes a hybrid simulation methodology leveraging QEMU for emulation. MARSSx86 supports only a functional NIC model. In contrast, the gem5’s NIC model is event-based that can be more easily adapted to precisely model the performance aspect of NICs [17]. Wisconsin Wind Tunnel (WWT) is developed to simulate a multi-processor system using multi-processors [8]. WWT adopts a quantum-based technique for synchronization amongst multiple processors and uses a time quantum value for precise simulation. pd-gem5 also uses the same technique to synchronize simulated nodes across multiple cores and nodes. SlackSim implements bounded slack synchronization amongst cores simulated across multiple physical cores to improve simulation speedup [9]. pd-gem5 allows relaxed synchronization where the time quantum can be a simulation parameter to trade simulation speed with simulation precision. Parallel Mambo is a multi-threaded implementation of an IBM’s full-system simulator to accelerate the simulation of a PowerPC-based system [26]. Unlike pd-gem5 supporting simulation of multiple full-system nodes using multiple simulation hosts, Parallel Mambo can simulate a system using only a single (multi-core) simulation host. Graphite is a parallel/distributed simulator for many-core processors [27]. It allows a user to distribute the execution of simulated cores across multiple nodes. A key advantage is that it uses a dynamic binary translator to directly execute the simulated code on the native machines for fast functional simulation. However, such an advantage can be a key disadvantage when users want to evaluate other processor architectures such as ARM while they have only x86-based nodes; currently, most of affordable nodes used for academic research are based on x86 ISA. Furthermore, it is not a full-system simulator. Thus, it cannot simulate complex workloads that need OS support. Lastly, it is not intended to be completely cycle-accurate with a scalable synchronization mechanism (LaxP2P) based on periodic, random, point-to-point synchronization between target tiles. ZSim is a fast, parallel microarchitectural simulator for many-core simulation [28]. Similar to Graphite, it uses a binary translation technique to reduce the overhead of conventional cycle-driven core models. Its key advantage over Graphite is that it implements a user-level virtualization to support complex workloads without requiring full-system simulation. Lastly, BigHouse is a simulation infrastructure for datacenter systems [29]. It uses a combination of queuing theory and stochastic analytical models to quickly simulate servers. Instead of application binaries, it uses empiri- cally measured distribution of arrival and service time of tasks in the system. It is neither appropriate for studies that require microarchitectural details nor suited for some high-level studies.

6. Conclusion The importance of efficiently running applications on a parallel/distributed computer system has continued to increase. On the other hand, our community lacks an open-source full-system simulation infrastructure to study such a parallel/distributed computer system in detail. Responding to such a need, we present pd-gem5, a gem5- based infrastructure that can model and simulate a parallel/distributed computer system using multiple simulation hosts in this paper. More specifically, we (1) enhance the NIC performance model; (2) develop network switch functional and performance models; (3) integrate the models with gem5; and (4) validate the models. Run- ning MPI- and Hadoop-based parallel/distributed benchmarks, we show that pd-gem5 running on 6 simulation

13 hosts speeds up the simulation of a 24-node computer system up to 3.7× compared to simulating all 24 nodes in a single simulation host. Subsequently, as a use case of pd-gem5, we develop a network-driven ondemand governor after observing that a sudden increase in network activity immediately leads to and/or is strongly correlated with high utilization of processors receiving the packets. We demonstrate that our network-driven ondemand governor reduce the response time of 95 percentile requests by 43% compared with the default ondemand governor. It also offers almost the same responses times as the performance governor while reducing the energy consumption by 8%.

References [1] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, Jan 2008. [2] R.L. Graham, T.S. Woodall, and J.M. Squyres, "Open MPI: A Flexible High Performance MPI," in International Conference on Parallel Processing and Applied Mathematics (PPAM), 2005. [3] J.F. hoch, Y.K Dalal, D.D. Redell, and R.C. Crane, "Evolution of the Ethernet Local Computer Network," IEEE Computer, vol. 15, no. 8, pp. 10-27, Aug 1982. [4] RapidIO Trade Association. (2002, June) RapidIO Interconnect Specification Documentation Overview. [Online]. http://www.rapidio.org/rapidio-specifications [5] InfiniBand Trade Association. [Online]. http://www.infinibandta.org/ [6] MARSSx86 - Micro-ARchitectural and System Simulator for x86-based Systems. [Online]. http://http://marss86.org/~marss86/index.php/Home [7] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," in ACM SIGARCH, 2011. [8] S.K. Reinhardt, M.D. Hill, J.R. Larus, J.C. Lewis, and D.A. Wood, "The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers," in ACM SIGMETRIC, vol. 8, October 1993, pp. 12-20. [9] M. Annavaram, M. Dubois J. Chen, "SlackSim: a platform for parallel simulations of CMPs on CMPs," ACM SIGARCH, vol. 37, no. 2, pp. 20-29, May 2009. [10] tcptest. [Online]. http://hpcbench.sourceforge.net/tcp.html [11] Hewllett-Packar Company. Netperf: A network performance benchmark. [Online]. http://www.netperf.org/netperf/NetperfPage.htm [12] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and and S. K. Weeratunga, "The Nas Parallel Benchmarks," Int. J. High Perform. Comput. Appl, vol. 5, no. 3, pp. 63-73, sep 1991. [13] Z. Jia, L. Wang, J. Zhang, L. Zhan, and Luo C., "Characterizing Data Analysis Workloads in Data Centers," in IEEE Int. Symp. on Workload Characterization (IISWC), 2013, pp. 66-76. [14] hping3. [Online]. http://linux.die.net/man/8/hping3 [15] tcpdump. [Online]. http://www.tcpdump.org/ [16] [Online]. http://www.it.uu.se/katalog/vassp447/gem5_dvfs.pdf%20 [17] [Online]. http://www.linux-arm.org/git?p=linux-linaro-tracking-gem5.git;a=summary [18] [Online]. http://www.acpi.info/ [19] ab - Apache HTTP server benchmarking tool. [Online]. https://httpd.apache.org/docs/2.2/programs/ab.html [20] J.-T. Wamhoff, S. Diestelhorst, C. Fetzer, P. Marlier, P. Felber, and D. Dice, "The TURBO Diaries: Application-controlled Frequency Scaling Explained," in USENIX Annual Technical Conference (USENIX ATC), 2014. [21] M. Hsieh, K. Pedretti, J. Meng, A. Coskun, M. Levenhagen, and A. Rodrigues, "SST + Gem5 = a Scalable Simulation Infrastructure for High Performance Computing," in International Conference on Simulation Tools and Techniques (ICST), 2012, pp. 196-201. [22] R. Brightwell, R. Riesen, B. Lawry, and A.B. Maccabe, "Portals 3.0: protocol building blocks for low overhead communication," in International Parallel and Distributed Processing Symposium (IPDPS), 2002. [23] R. Brightwell, K.T. Pedretti, K.D. Underwood, and T. Hudson, "SeaStar Interconnect: Balanced Bandwidth for Scalable Performance," IEEE Micro, vol. 26, no. 3, pp. 41-57, May 2006.

[24] F. Ryckbosch, S. Polfliet, and L. Eeckhout, "Fast, Accurate, and Validated Full-System Software Simulation of x86 Hardware," IEEE Micro, vol. 30, no. 6, pp. 45-56, Nov 2010. [25] AMD.‎SimNow™‎Simulator.‎[Online].‎http://developer.amd.com/tools-and-sdks/cpu-development/simnow-simulator/ [26] Wang K., Zhang Y., Wang Y., and Shen X., "Parallelization of IBM mambo system simulator in functional modes," ACM SIGOPS Operating Systems Review, vol. 42, no. 1, pp. 71-76, Jan 2008. [27] J.E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, "Graphite: A distributed Parallel Simulator for Multicores," in International Symposium on High Performance Computer Architecture (HPCA), 2010. [28] D Sanchez and C. Kozyrakis, "ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems," in International Symposium on Computer Architecture (ISCA), 2013, pp. 475-486. [29] D. Meisner, Junjie Wu, and T.F. Wenisch, "BigHouse: A simulation Infrastructure for Data Center Systems," in International Symposium on Performance Analysis of Systems and Software (ISPASS), 2012, pp. 35-45. [30] B. Rountree, D.K. Lowenthal, B.R. de Supinski, M. Schulz, V.W. Freeh, and T. Bletsch, "Adagio: Making DVS Practical for Complex HPC Applications," in ACM Int. Conf. Supercomp. (ICS), 2009, pp. 1-10. [31] C. Bienia and K. Li, "PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors," in Workshop on Modeling, Benchmarking and Simulation, 2009. [32] Intel Corporation. VRM/DVRD 11.1 Design Guide. [Online]. www.intel.com/assets/PDF/designguide/321736.pdf [33] H.R. Ghasemi, A.A. Sinkar, M.J. Schulte, and N.S. Kim, "Cost-effective power delivery to support per-core voltage domains for power-constrained processors," in IEEE Design Automation Conf. (DAC), 2012, pp. 56--61. [34] McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. [Online]. http://www.hpl.hp.com/research/mcpat [35] N. Kurd, J. Douglas, P. Mosalikanti, and R. Kumar, "Next generation Intel micro-architecture (Nehalem) clocking architecture," in IEEE Int. Symp. on VLSI Circuits, 2008, pp. 62 -63. [36] K. Aygun, M.J. Hill, K. Eilert, K. Radhakrishnan, and A. Levin, "Power delivery for high-performance microprocessor," Intel Technology J., vol. 9, no. 4, pp. 273-283, Nov 2005. [37] Microsoft. [Online]. http://msdn.microsoft.com/en-us/windows/hardware/gg463252.aspx [38] J. Shun, G.E. Blelloch, J.T. Fineman, P.B. Gibbons, A. Kyrola, H.V. Simhadri, and T. Kanat, "Brief Announcement: The Problem Based Benchmark Suite ," in ACM Symp. on Parallelism in Algorithms and Architecture (SPAA), 2012. [39] [Online]. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2008/10/24594_APM_v3.pdf [40] Experimenting with DVFS. [Online]. http://www.m5sim.org/Running_gem5#Experimenting_with_DVFS [41] B Bochocki, D. Rajan, X.S. Hu, C. Poellabauer, K. Otten, and T. Chantem, "Network-Aware Dynamic Voltage and Frequency Scling," in IEEE Real Time and Embedded Technology and Applications Symp., 2007.