10-Millisecond Computing

10-millisecond Computing Gang Lu⋆, Jianfeng Zhan∗,‡, Tianshu Hao∗, and Lei Wang∗,‡ ⋆Beijing Academy of Frontier Science and Technology ∗Institute of Computing Technology, Chinese Academy of Sciences ‡University of Chinese Academy of Sciences [email protected], [email protected], [email protected], wanglei [email protected] Abstract—Despite computation becomes much complex on data So perfect human-computer interactions come from human with an unprecedented scale, we argue computers or smart being’s requirements, and should be irrelevant to data scale, devices should and will consistently provide information and task complexity, and their underlying hardware and software knowledge to human being in the order of a few tens milliseconds. We coin a new term 10-millisecond computing to call attention systems. to this class of workloads. The trend of 10-ms computing has been confirmed by 10-millisecond computing raises many challenges for both current internet services industries. Internet service providers software and hardware stacks. In this paper, using a a typical will not lower their QoS expectation because of the complexity workload—memcached on a 40-core server (a main-stream server of underlying infrastructures. Actually, keeping latency low is in near future), we quantitatively measure 10-ms computing’s challenges to conventional operating systems. For better com- of the vital importance for attracting and retaining users [15], munication, we propose a simple metric—outlier proportion to [11], [27]. Google [34] and Amazon [27] found that moving measure quality of service: for N completed requests or jobs, if from a 10-result page loading in 0.4 seconds to a 30-result M jobs or requests’ latencies exceed the outlier threshold t, the M page loading in 0.9 seconds caused a decrease of 20% of the outlier proportion is N . For a 1K-scale system running Linux traffic and revenue; Moreover delaying the page in increments (version 2.6.32), LXC (version 0.7.5 ) or XEN (version 4.0.0), respectively, we surprisingly find that so as to reduce the service of 100 milliseconds would result in substantial and costly outlier proportion to 10% (10% users will feel QoS degradation), drops in revenue. the outlier proportion of a single server has to be reduced by The trend of 10-ms computing is also witnessed by other 871X, 2372X, 2372X accordingly. Also, we discuss the possible ultra-low latency applications [3]; for example, high-frequency design spaces of 10-ms computing systems from perspectives of trading and internet of thing applications. These applications datacenter architectures, networking, OS and scheduling, and benchmarking. are characterized by a request-response loop involving ma- chines in stead of humans, and operations involving multiple I. INTRODUCTION parallel requests/RPCs to thousands of servers [3]. Since a Despite computation becomes much complex on data with service processing or a job completes when all of its requests an unprecedented scale, in this paper we argue computers or or tasks are satisfied, the worst-case latency of the individual smart devices should and will consistently provide information requests or tasks is required to be ultra-low to maintain service and knowledge to human being in the order of a few tens or job-level quality of service. Someone may argue that those milliseconds. We coin a new term 10-millisecond (in short, applications demand lower and lower latency. However, as 10-ms) computing to call attention to this class of workloads. there are end-host stacks, NICs (network interface cards), and First, determined by the nature of human being’s nervous switches on the path of an end-to-end application at which a arXiv:1610.01267v3 [cs.PF] 8 Mar 2017 and motor systems, the timescale for many human activities is request or response currently experience delay [3], we believe in the order of a few hundreds milliseconds [17], [10], [11]. in next decade 10-ms is a reasonable latency performance For example, in a talk, the gaps we leave in speech to tell goal for most of end-to-end applications with ultra-low latency the other person it is ’your turn’ are only a few hundred requirements. milliseconds long [17]; the response time of our visual system Previous work [36] also demonstrates that it is advantageous to a very brief pulse of light and its duration is also in to break data-parallel jobs into tiny tasks each of which this order. Second, one of the key results from early work complete in hundreds of milliseconds. Ousterhout et al. [36] on delays in command line interfaces is that regularity is of demonstrate a 5.2x improvement in response times due to the the vital importance [17], [10], [11]. If people can predict use of smaller tasks: Tiny tasks alleviate long wait times seen how long they are likely to wait they are far happier [17], in todays clusters for interactive jobs—even large batch jobs [10], [11]. Third, the experiments in [10] show perceptual can be split into small tasks that finish quickly. events occurring within a single cycle (of this timescale) However, 10-ms computing raises many challenges to both are combined into a single percept if they are sufficiently software and hardware stack. In this paper, we quantitatively similar, indicating our perceptual system cannot provide much measure the challenges raised for conventional operating finer capability. That is to say, much lower latency (i.e., systems. memcached [1] is a popular in-memory key-value less than 10 milliseconds) means nothing to human being. store intended for speeding up dynamic web applications by alleviating database loads. The average latency is about tens ignore the overhead of merging responses from different sub- or hundreds µs. A real-world memcached-based application requests. Meanwhile, for the case of breaking a large job into usually need to invoke several get or put memcached opera- tiny tasks, we only consider the most simplest scenario—-one- tions, in addition to many other procedures, to serve a single round tasks are merged into results, excluding the iterative request, so we choose it as a case study on 10-millisecond computation scenarios. computing. The service or job-level outlier proportion is defined as Running memcached on a 40-core Linux server, we found, follows: for N completed requests or jobs, if M jobs or when the outlier threshold decreases, the outlier proportion requests’ latencies exceed the outlier threshold t, e.g. 10 M of a single server will significantly deteriorate. Meanwhile, milliseconds, the outlier proportion op sj(t) is N . the outlier proportion also deteriorates as the system core According to [15], the service or job-level outlier propor- number increases. The outlier is further amplified by the tion will be extraordinarily magnified by the system scale SC. system scale. For a 1K-scale system running Linux (version The outlier proportion of a single server is represented by 2.6.32) or LXC (version 0.7.5 ) or XEN (version 4.0.0)— op(t)= P r(T >t)=1 − P r(T ≤ t). a typical configuration in internet services, we surprisingly Assuming the servers are independent from each other, the find that so as to reduce the service outlier proportion to 10% service or job-level outlier proportion, op sj(t), is denoted by (The outlier threshold is 100 µs), the outlier proportion of a Equation 1 single server needs to be reduced by 871X, 2372X, 2372X, accordingly. We also conducted a list of experiments to reveal the current Linux systems still suffer from poor performance op sj(t)= P r(T1 ≥ t or T2 ≥ t, ..., or TSC ≥ t) (1) outlier. The operating systems we tested include Linux with =1 − P r(T1 ≤ t)P r(T2 ≤ t)...P r(TSC ≤ t) (2) different kernels: 1) 2.6.32, an old kernel released five years =1 − P r(T ≤ t)SC =1 − (1 − P r(T >t))SC (3) ago but still popularly used and in long-term maintenance. =1 − (1 − op(t))SC (4) 2) 3.17.4, a latest kernel released on November 21, 2014. 3) 2.6.35M, a modified version of 2.6.35 integrated with When we deploy the XEN or LXC/Docker solutions, the sloppy counters proposed by Boyd-Wickizer et al. to solve service or job-level outlier proportion will be further amplified scalability problem and mitigate kernel contentions [9] [2]. by the number K of guest OSes or containers deployed on 4) representative real time schedulers, SCHED FIFO (First In each server. First Out) and SCHED RR (Round Robin). This observation indicates that the new challenges are significantly different from traditional outlier and stagger issues widely investigated op sj(t)= P r(T1 ≥ torT2 ≥ t, ..., orTSC∗K ≥ t) (5) in MapReduce and other environments [32], [23], [28], [29], =1 − (1 − op(t))SC∗K (6) [37]. Furthermore, we discuss the possible design spaces and challenges from perspectives of datacenter architectures, On the vice versa, to reduce an service or job-level outlier networking, OS and scheduling, and benchmarking. proportion to be op sj(t), the outlier proportion of a single Section II formally states the problem. Section III quantita- server must be low as shown in Equation 7. tively measures the OS challenges in terms of reducing outlier proportion. Section IV discusses the possible design space op(t)=1 − SCp1 − op sj(t) (7) of 10-ms computing systems from perspectives of datacenter architectures, networking, OS and Scheduling, and bench- For example, a Bing search may access 10,000 index marking. Section V summarizes the related work. Section VI servers [31]. If we need to reduce the service or job-level draws a conclusion.

10-Millisecond Computing

HOW FAST IS RF? by RON HRANAC

Low Latency – How Low Can You Go?

Timestamp Precision

Discovery of Frequency-Sweeping Millisecond Solar Radio Bursts Near 1400 Mhz with a Small Interferometer Physics Department Senior Honors Thesis 2006 Eric R

Microsecond and Millisecond Dynamics in the Photosynthetic

Challenges Using Linux As a Real-Time Operating System

The Effect of Long Timescale Gas Dynamics on Femtosecond Filamentation

1 Evolution and Practice: Low-Latency Distributed Applications in Finance

TIME 1. Introduction 2. Time Scales

Millisecond Pulsar Rivals Best Atomic Clock Stability

An Accurate Millisecond Timer for the Commodore 64 Or 128

Time, Delays, and Deferred Work