Implementing Replication for Predictability within Jianwei Tu The Ohio State University [email protected]

ABSTRACT have a large number of packets. A study indicated that about Interactive applications, such as search, social networking and 0.02% of all flows contributed more than 59.3% of the total retail, hosted in cloud data center generate large quantities of traffic volume [1]. TCP is the dominating transport protocol small workloads that require extremely low median and tail used in data center. However, the performance for short flows latency in order to provide soft real-time performance to users. in TCP is very poor: although in theory they can be finished These small workloads are known as short TCP flows. in 10-20 microseconds with 1G or 10G interconnects, the However, these short TCP flows experience long latencies actual flow completion time (FCT) is as high as tens of due in part to large workloads consuming most available milliseconds [2]. This is due in part to long flows consuming buffer in the switches. Imperfect routing algorithm such as some or all of the available buffers in the switches [3]. ECMP makes the matter even worse. We propose a transport Imperfect routing algorithms such as ECMP makes the matter mechanism using replication for predictability to achieve low even worse. State of the art forwarding in enterprise and data flow completion time (FCT) for short TCP flows. We center environment uses ECMP to statically direct flows implement replication for predictability within Apache Thrift across available paths using flow hashing. It doesn’t account transport layer that replicates each short TCP flow and sends for either current network utilization or flow size, and may out identical packets for both flows, then utilizes the first flow direct many long flows to the same path causing flash that finishes the transfer. Our results show that by using congestion and long-tailed FCT even when the network is replication for predictability for short TCP flows, we can lightly loaded [2]. reduce median FCT by 15% and tail FCT by 20%. When We propose a transport mechanism using replication for integrated with Cassandra, we can also improve the predictability [2] [9] [10] to achieve low FCT for short flows. performance of Read operation with flow replication. For each short TCP flow, we replicate it and send the identical packets for both flows by creating two connections to the Keywords receiver. The application uses the first flow that finishes the transfer. We observe that the congestion levels of different Replication for Predictability; Apache Thrift; Interactive paths in data center networks are statistically independent. Applications; Flow Completion Time; Tail Latency The original flow and replicated flow are highly likely to traverse different paths, reducing the probability that both 1. INTRODUCTION paths experience long queuing delay. A variety of interactive applications, such as search, social network content composition, and advertisement selection, are We implement flow replication in Apache Thrift [11] hosted in cloud data centers nowadays. In order to provide transport layer. Apache Thrift is a RPC framework that high availability to the users, these interactive applications provides scalable cross-language services implementation. It have very strict requirements on the completion time. If these can be used as a middleware at the application layer that requirements are violated, the services become less available. means these is no need to modify the switches and operating Less available Internet services cause loss of customers and systems. We conduct the experiments on our private cloud reduce companies’ revenue. According to the paper [9], and Amazon EC2 data center. The latest EC2 data center is delays exceeding 100ms decrease total revenue by 1%. known to have multiple equal cost paths between two virtual machines. Our experiment results show that replication for Interactive applications often follow a Partition/Aggregate predictability can reduce the Flow Completion Time of short design pattern [3]. The incoming requests to the higher layer TCP flows over 20%. When integrated with Cassandra, we aggregator of the application are partitioned into several can also improve the performance of Read operation with pieces and sent out to the workers in the lower layer. After flow replication. some computation, the responses of the workers are sent back to the higher layer aggregator and aggregated into the final result. During Partition/Aggregate processes, a large amount 2. BACKGROUND of short TCP query and response flows are generated. These In this section, we introduce the background information of short TCP flows need to be completed with the latency this project. boundary to meet strict service level objectives (SLOs). A. Replication for Predictability The network traffic in data centers is a mix of short and In recent years, replication for predictability has gained long flows, and flow statistics exhibit strong heavy-tail increasing attention in both academia and industry. It can be behaviors which mean that most flows (short flows) only have used to provide high throughput, low latency and high a number of packets, while a very few flows (long flows) availability. In [9], Zoolander scales out using replication for predictability and uses redundant accesses to mask outlier response times. Google uses request replications to achieve rapid response times in their large online services [14]. B. ECMP Routing Equal-cost multi-path (ECMP) routing is widely used in today’s datacenters. It’s a mechanism to distribute traffic across multiple paths of the same network cost and perform flow-level load balancing. For a switch that uses ECMP as the routing mechanism, it routes all packets of the same flow based on the hash value of the five-tuple in each packet header (source/destination IP address, source/destination port number, and transport protocol) to the same path. Due to its randomness, it’s highly possible that short TCP flows and large TCP flows are routed to the same path that causes heavy tail latency for short TCP flows. . Apache Thrift Apache Thrift [11] is a popular RPC framework developed Figure 1. Apache Thrift Software Stack by . It provides scalable cross-language services Transport Layer: The Transport Layer provides the ability to implementation. For example, if a client that is written in read from and write to the network. It decouples the Python needs to communicate with a server that is written in underlying transport from other components in the rest of the Java, Apache Thrift can be used to provide the interfaces for system. The methods exposed by the Transport interface are the cross-language communication. Apache Thrift provides a open, close, read, write and flush. In addition to the Transport full software stack which includes Server Layer, Processor interface, Thrift also uses a ServerTransport interface that is Layer, Protocol Layer and Transport Layer. It’s also used as used mainly on the server side to create new Transport objects middleware in , Apache HBase and JBoss for incoming connections [17]. to provide cross-language communication. Protocol Layer: . Apache Cassandra The Protocol Layer maps in-memory data structures to a wire- Apache Cassandra [15] is a distributed storage system format. A protocol defines the way of data types using the developed by Facebook. It manages very large amounts of underlying Transport Layer to encode and decode themselves. structured data that spread out across many commodity It’s also responsible for and deserialization. servers and provides highly available service with no single Some examples of protocols include JSON, XML, plain text, point of failure. Apache Thrift is used in Cassandra to allow compact binary etc. [17]. portable (across programming languages) access to Cassandra . Processor Layer: A Processor provides the ability to read data from input streams and write data to output streams. It uses 3. DESIGN Protocol objects to represent the input and output streams. In this section, we first discuss the default design of The compiler of Thrift generates service-specific processor Apache Thrift and its whole software stack. Then we propose implementations [17]. our design for flow replication in Apache Thrift that takes Server: A server is used to create a transport, create advantage of multi-path diversity that exists due to input/output protocols for the transport, create a processor randomized load balancing in data center networks. We based on the input/output protocols and wait for incoming discuss the key components of this design and describe the connections and hand them off to the processor [17]. related challenges and our approach to find the solution. Because we only replicate short TCP flows (<=100KB) to 3.2 Proposed Design achieve better latency, so flow replication is a good idea for Flow replication happens in the Transport Layer that is on latency-sensitive applications that generate a lot of short TCP top of TCP/IP, so there is no need for us to modify the flows. Here, we keep the existing design and add flow transport protocol or operating system. We can use this as a replication as a configurable option with a selection parameter, middleware for any application to invoke. Before we build the flow_replication_enabled. When this parameter is set to true, design, we need to make several design choices. First, which the application can enable flow replication feature and vice short flow should be replicated? We consider flows that are versa. less than or equal to 100KB as short flow and thus can be replicated to reduce the flow completion time. We get this 3.1 Existing Design threshold value from [2]. Second, when to replicate the short Apache Thrift [11] provides a software stack that aims at flows? If we only replicate the short flows when they are simplifying the development of cross-language applications. suffering long queuing delays, it definitely decreases the Figure 1 [12] shows the software stack of Apache Thrift. performance due to the short duration of these short flows. So in our design, we replicate each short flow from the very beginning to get the best performance. Because short flows only account for a small fraction of the total bytes in the number of bytes we have read so for. So if we find out that network, the overhead of replication is negligible. the bytes we just read are old data, we just throw them away saliently; otherwise we put these new bytes into the buffer and deliver them to the upper level. In this way, we can make sure that we always use the result that arrives first. For a write operation from Client, MultiConnect writes the same data to both of the connections. It’s very similar to the functionality of MultiReceiver.

4. PERFORMANCE EVALUATION A. Evaluation on Private Cloud First, we evaluate our implementation on our private cloud. For each pair of virtual machines, there is only one path between them. So it’s obvious that our design can’t benefit from multi-path diversity in this private cloud. However, we can examine the replication overhead here. Because replication of short TCP flows generate more traffic that goes into the network, we want to know if the overhead is negligible or not. We use a simple add program to do the evaluation. The Figure 2. Proposed Design Server is running on one virtual node and providing an add function which can be called by a remote Client. The remote Figure 2 shows the main components of our proposed Client is running on another virtual node. It performs a RPC design of applying replication for predictability within operation every 1 second and sends out two parameters to the Apache Thrift. remote Server. The remove Server calculates the result and The major components that we have added in Server side sends it back to the Client. We measure the latency of this are: RPC operation. MultiListener: The Server initiates a MultiListener during its startup. Instead of listening on just one port, MultiListener is listening to two different ports on the same IP address. MultiListener waits for incoming connection requests from the Client side, and adds the connection to a pre-established queue. ConnectionQueue: ConnectionQueue is used to hold all the connections between Server and Client. Once a connection is established, MultiListener adds the connection into this queue. (a) Median FCT (b) 99th FCT MultiReceiver: Each MultiReceiver is responsible for two connections: one is the connection for the original flow; the Figure 3. FCT for short TCP flows on private cloud other is the connection for the replicated flow. We allocate an Figure 3 shows the results. As we can see in (a) and (b), the appropriate buffer for each MultiReceiver. For a read FCT for short TCP flows with/without replication are almost operation from Server, once there are some bytes available in the same. It proves that the overhead of flow replication can one connection, MultiReceiver gets the bytes out of the be negligible. This can be explained by the fact that short TCP connection. We use a COUNTER variable to track the number flows only account for a small fraction of the total bytes in the of bytes we have read so for. So if we find out that the bytes network. we just read are old data, we just throw them away saliently; B. Evaluation on Amazon EC2 otherwise we put these new bytes into the buffer and deliver them to the upper level. In this way, we can make sure that we To fully exploit the benefits of our implementation, we always use the result that arrives first. For a write operation need to evaluate it in data centers with multiple equal cost from Server, MultiReceiver writes the same data to both of paths between many pairs of virtual machines. We evaluate the connections. the performance of our implementation in Amazon EC2. Amazon EC2 data center is a large-scale production center The Client side is extended by adding a MultiConnect that allows us to conduct real-world experiments. The latest component. EC2 data centers (us-east-1c and us-east-1d) [13] are known MultiConnect: For each Client request to the Server, if the to have topology structures that provide many alternative connection is not established, MultiConnect creates two equal cost paths between certain pairs of virtual machines. connections to the Server on different ports. We allocate an The typical switches in EC2 data centers run a variant of appropriate buffer for each MultiConnect. For a read ECMP routing that perform hash-based flow-level load operation from Client, once there are some bytes available in balancing by hashing flows to equal cost paths across the one connection, MultiConnect gets the bytes out of the network topology. connection. We also use a COUNTER variable to track the However, we don’t have the knowledge of the exact C. Integration with Apache Cassandra network topology of the EC2 US East data centers, and we can only infer the network topology. This brings a lot of Cassandra uses Thrift to allow portable (across complexity to our evaluation. Even worse, each instance is a programming languages) access to Cassandra database. We virtual machine that shares the physical hardware with other integrate our Thrift libraries into Cassandra. Thrift server in Cassandra is modified to use our libraries. We conduct our virtual machines, and it’s difficult for us to decide if two virtual machines are on the same physical hardware or not. experiments on Amazon EC2. We set up a single node The background workload traffic levels in the network are Cassandra on an EC2 instance and four Thrift clients on other four EC2 instances. also unknown to us, and can change between experiments. In order to utilize the multi-path property of the environment, we ran our experiments on 10 EC2 instances (1 server, 9 clients) for 24 hours. The test program used here is the same as we used in our private cloud. We measured the flow completion time between each pair of hosts.

(a) Median FCT (b) 99th FCT Figure 5. 1KB Read FCT for Cassandra Figure 5 shows the results we get for 1KB Read operation in Cassandra. Figure 5(a) shows the median FCT is reduced th th by 8% with flow replication. Figure 5(b) shows the 99 tail (a) Median FCT (b) 95 FCT FCT is improved by 12%.

5. RELATED WORKS Researchers have proposed many solutions to reduce the FCT of short TCP flows. In [16], they use redundancy to achieve significant mean and tail latency reduction in real- world systems including DNS queries, database servers, and packet forwarding within networks. In [3], they leverage Explicit Congestion Notification (ECN) in the network to th th provide multi-bit feedback to the end hosts. Preemptive (c) 99 FCT (d) 99.9 FCT Distributed Quick (PDQ) flow scheduling protocol is Figure 4. FCT for short TCP flows on EC2 designed to complete flow quickly and meet flow deadlines [4]. Deadline-aware scheduling is used in [5]. Beyond short flows, the general problem of handling performance anomalies has also been well studied. These solutions are effective; however, they need to modify switches and operating systems, which make them difficult to be deployed in a large-scale data center. Shen et al [6] used reference executions to detect and debug problems, but their granularity was too large for TCP. Stewart et al [7] and Attariyan et al [8] studied the effects of poor configurations. These approaches detect and fix anomalies online but require running times that are much larger than short flows.

6. CONCLUSION In this paper, we address the problem of how to use Figure 5. Median FCT for short TCP flows on 9 different replication for predictability to reduce the Flow Completion EC2 instances Time of short TCP flows. We design and implement replication for predictability within Apache Thrift, which is a Figure 4 shows the results. Figure 4(a) shows that flow popular RPC framework used by companies such as Facebook. replication can reduce median FCT by 15%. For tail FCT, the Our design replicates each short TCP flow to exploit the improvement is over 20%. Figure 5 shows the median FCT on multi-path diversity that exists in modern data centers and different instances (from instance 1 to instance 9). These needs no modifications to transport protocol or operating results all prove that replication for predictability is an systems. We conduct the experiments in Amazon EC2 data powerful strategy when multi-path diversity exists in data center, and the results show that our design can achieve better center network. median and tail Flow Completion Time compared to default single flow design. 7. ACKNOWLEDGMENTS [12] . Jones. Simplifying Scalable Cloud Software I want to thank my advisor Dr. Christopher Stewart for his Development with Apache Thrift. support and help. He gave me a lot of insightful advices for https://www.ibm.com/developerworks/library/os-cloud- this whole project. apache-thrift/ [13] C. Raiciu, S. Barre, C. Pluntke, A. Greemnhalgh, D. 8. REFERENCES Wischik, and M. Handley. Improving datacenter [1] T. Mori, M. Uchida, R. Kawahara, J. Pan, and S. Goto. performance and robustness with multipath tcp. In Proc. Identifying Elephant Flows Through Periodically ACM SIGCOMM, 2011. Sampled Packets. In IMC’04, October 25-27, 2004, [14] J. Dean. Achieving rapid response times in large online Taormina, Sicily, Italy. services. Berkeley AMPLab Cloud Seminar, [2] H. Xu and B. Li. RepFlow: Minimizing Flow http://research,google.com/people/jeff/latency.html, Completion Times with Replicated Flows in Data March 2012. Centers. In IEEE INFOCOM’14, Toronto, Canada. [15] A. Lakshman, P. Malik. Cassandra: A Decentralized [3] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Structured Storage System. In ACM SIGOPS, April 2010, Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data New York, USA. Center TCP (DCTCP). In SIGCOMM’10, Aug. 30-Sep. 3, [16] A. Vulimiri, P. Godfrey, R. Mittal, J. Sherry, S. 2010, New Delhi, India. Ratnasamy, S. Shenker. Low Latency via Redundancy. [4] C.–Y. Hong, M. Caesar, and P. B. Godfrey. Finishing In CoNEXT’13, December 2013, Santa Barbara, flows quickly with preemptive scheduling. In Proc. ACM California, USA. SIGCOMM’12. [17] D. Gupta. Thrift: The Missing Guide. [5] A. Munir, I. A. Qazi, Z. A. Uzmi, A. Mushtag, S. N. http://diwakergupta.github.io/thrift-missing-guide/, July Ismail, M. S. Iqbal, and B. Khan. Minimizing flow 2013. completion times in data centers. In Proc. IEEE INFOCOM’13.

[6] K. Shen, C. Stewart, C. Li, and X. Li, "Reference-Driven

Performance Anomaly Identification". In Proc. of the 11th ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'09), Seattle, WA, Jun. 2009.

[7] C. Stewart, K. Shen, A. Iyengar, and J. Yin. EntomoModel: Understanding and Avoiding Performance Anomaly Manifestations. In Symposium on Modeling, Analysis and Simulation of Computer and

Telecommunication Systems (MASCOTS’10), Miami Beach, FL, 2010. [8] M. Attariyan and J. Flinn. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th Usenix Symposium on Operating Systems Design and Implementation (OSDI), Vancouver, BC, Canada, October 2010. [9] C. Stewart, A. Chakrabarti, and R. Griffith. Zoolander: Efficiently Meeting Very Strict, Low-Latency SLOs. In International Conference on Autonomic Computing (ICAC), San Jose, CA, 2013. [10] B. Trushkowsky, P. Bodik, A. Fox, M. Franklin, M. Jordan, D. Patterson. The scads director: scaling a distributed storage system under stringent performance

requirements. In Proceedings of the Conference on File and Storage Technologies (FAST), 2011. [11] M. Slee, A. Agarwal, and M. Kwiatkowski. Thrift: Scalable Cross-Language Services Implementation. Technical report, Facebook, 2007.