Time Capsule: Tracing Packet Latency Across Different Layers in Virtualized Systems

Time Capsule: Tracing Packet Latency across Different Layers in Virtualized Systems Kun Suo Jia Rao Luwei Cheng Francis C. M. Lau The University of Texas at Arlington Facebook The University of Hong Kong [email protected] [email protected] [email protected] [email protected] Abstract while traditional tracing tools designed for physical machines are unable to trace across the boundaries (e.g., be- Latency monitoring is important for improving user experi- tween the hypervisor and the guest OS). Second, workload ence and guaranteeing quality-of-service (QoS). Virtualized consolidation inevitably incurs interference between appli- systems, which have complex I/O stacks spanning multi- cations, which often leads to unpredictable latencies. Thus, ple layers and often with unpredictable performance, present fine-grained tracing of network latency, instead of coarse- more challenges in monitoring packet latency and diagnos- grained monitoring, is particularly important to guaranteeing performance abnormalities compared to traditional sys- ing QoS. Third, many applications in the cloud are highly tems. Existing tools either trace network latency at a coarse optimized and the tracing tool should incur negligible per- granularity, or incur considerable overhead, or lack the abil- formance impact [31]. Otherwise, tracing overhead may hide ity to trace across different boundaries in virtualized en- seemingly slight but relatively significant latency changes of vironments. To address this issue, we propose Time Cap- the cloud services. Last, as clouds host diverse workloads, sule (TC), an in-band profiler to trace packet level latency it is prohibitively expensive to devise application specific in virtualized systems with acceptable overhead. TC times- tracing mechanisms and this calls for a system level and tamps packets at predefined tracepoints and embeds the tim- application transparent tracing tool. ing information into packet payloads. TC decomposes and There exist many studies focusing on system monitoring attributes network latency to various layers in the virtual- and diagnosing. Traditional tools, such as SystemTap [12], ized network stack, which can help monitor network latency, DTrace [17], Xentrace [13], or existing instrumentation sys- identify bottlenecks, and locate performance problems. tems, such as DARC [34] and Fay [21], are limited to use within a certain boundary, e.g., only in the hypervisor or 1. Introduction in a virtual machine (VM). None of them can trace activ- As virtualization has become mainstream in data centers, a ities, e.g. network processing, throughout the entire virtu- growing number of enterprises and organizations are mov- alized I/O stack. Many state-of-the-art works, like Mystery ing applications into the cloud, such as Amazon EC2 [1]. It Machine [20], Draco [24], LogEnhancer [37], lprof [38], is well believed that virtualization introduces significant and identify performance anomalies among machines based on often unpredictable overhead to I/O-intensive workloads, es- distributed logs. However, analyzing massive logs, not only pecially latency-sensitive network applications. To improve introduces non-negligible runtime overhead, but also cannot user experience and guarantee QoS in the cloud, it is neces- guarantee providing the information users need, e.g., statis- sary to efficiently monitor and diagnose the latency of these tics on tail latency. In addition, such out-of-band profiling applications in virtualized environments. needs additional effort to stitch the distributed logs [18, 33], However, compared to traditional systems, it is more and time drift among logs on different machines may also challenging to trace I/O workloads and troubleshoot latency affect the accuracy. problems in virtualized systems. First, virtualization intro- In this paper, we propose Time Capsule (TC), a profiler to duces additional software stacks and protection domains trace network latency at packet level in virtualized environments. TC timestamps packets at predefined tracepoints and Permission to make digital or hard copies of all or part of this work for personal or embeds the timing information into the payloads of packets. classroom use is granted without fee provided that copies are not made or distributed Due to in-band tracing, TC is able to measure packet latency for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM across different layers and protection boundaries. In addi- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, tion, based on the position of tracepoints, TC can decom- to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. pose and attribute network latency to various components of APSys ’16,, August 04-05, 2016, Hong Kong, China. the virtualized network stack to locate the potential bottle- Copyright c 2016 ACM 978-1-4503-4265-0/16/08. $15.00. neck. Further, TC incurs negligible overhead and requires http://dx.doi.org/10.1145/2967360.2967376 1 TSC Physical Machine Tracepoint Dom 0 Dom U Dom U Tracepoint(Enabled) Local Backend Frontend Frontend tsc a Original packet driver driver driver driver Machine a TC receive packet ' TC send packet tsc a Machine b Hypervisor β Packet receive path tscb NIC Hardware tsc' Packet send path b Collect data path α 0 t1 t2 Time Figure 2. The relationship of TSCs on different physical Figure 1. The architecture of Time Capsule. machines. The constant tsc flag allows TSC to tick at the processor’s maximum rate regardless of the actual CPU no changes to the traced applications. We demonstrate that speed. Line slope represents the maximum CPU frequency. fine-grained tracing and latency decomposition enabled by TC shed light on the root causes of long tail network latency for latency measurement at the microsecond granularity. A and help identify real performance bugs in Xen. similar clocksource kvm-clock is also available in KVM. The rest of the paper is organized as follows. Section2 Furthermore, we enabled constant tsc and nonstop tsc and3 describe the design, implementation, and optimiza- of Intel processors, which can guarantee that TSC rate is not tions of Time Capsule, respectively. Section4 presents eval- only synchronized across all sockets and cores, but also is uation results. Section5 and6 discuss future and related not affected by power management on individual processors. work, respectively. Section7 concludes this paper. As such, TSC ticks at the maximum CPU clock rate regardless of the actual CPU running speed. For cross-machine 2. Design and Implementation tracing, the clocks on physical nodes may inevitably tick at different rates due to different CPU speeds. Therefore, the The challenges outlined in Section1 motivate the following relative difference between timestamps recorded on separate design goals of Time Capsule: (1) cross-boundary tracing; machines does not reflect the actual time passage. Figure2 (2) fine-grained tracing; (3) low overhead; (4) application shows the relationship between the TSC readings on two transparency. Figure1 presents a high-level overview of how machines with different CPU speeds. The slopes of the two TC enables tracing for network send (Tx) and receive (Rx). lines represent the maximum CPU frequency on the respec- TC places tracepoints throughout the virtualized network tive machines. There exist two challenges in correlating the stack and timestamps packets at enabled tracepoints. The timestamps on separate machines. First, TSC readings are timing information is appended to the payload of the packet. incremented at different rates (i.e., different slopes). Sec- For network receive, before the traced packet is copied to ond, TSC registers are reset at boot time or when resuming user space, TC restores the packet payload to its original size from hibernation. The relative difference between TSC read- and dumps the tracing data to a kernel buffer, from where the ings on two machines includes the absolute distance of these tracing data can be copied to user space for offline analysis. machines since last TSC reset. For example, as shown in For network send, trace dump happens before the packet Figure2, the distance between TSC reset on two machines is transmitted by the physical NIC. Compared to packet is denoted by α = jta − tb j, where ta and tb receive, we preserve the timestamp of the last tracepoint reset reset reset reset are the last TSC reset time of machine a and b, respectively. in the payload of a network send packet to support tracing across physical machines. Since tracepoints are placed in Tracepoints are placed on the critical path of packet process- either the hypervisor, the guest kernel or the host kernel, TC ing in the virtualized network stack. When a target packet is transparent to user applications. Next, we elaborate on the passes through a predefined tracepoint, a timestamp based design and implementation of TC in a Xen environment. on local clocksource is appended to the packet payload. The time difference between two tracepoints measures how much Clocksource To accurately attribute network latency to time it spent in a particular processing stage. For example, various processing stages across different protection do- two tracepoints can be placed at the backend in Dom0 and mains, e.g., Dom0, DomU, and Xen, a reliable and cross- frontend in DomU to measure packet processing time in the domain clocksource with high resolution is needed. The hypervisor. As timestamps are taken sequentially at vari- para-virtualized clocksource xen meets the requirements. ous tracepoints throughout the virtualized network stack, TC In a Xen environment, the hypervisor, Dom0, and the guest does not need to infer the causal relationship of the trace- OS all use the same clocksource xen for time measurement. points (as Pivot Tracing does in [27]) and the timestamps in Therefore, packet timestamping using the xen clocksource the packet payload have strict happened-before relations.

Load more