High Volume Computing: Identifying and Characterizing Throughput Oriented Workloads in Data Centers
Total Page:16
File Type:pdf, Size:1020Kb
High Volume Computing: Identifying and Characterizing Throughput Oriented Workloads in Data Centers Jianfeng Zhan, Lixin Zhang, Ninghui Sun, Lei Wang, Zhen Jia, and Chunjie Luo State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing, China Email: fzhanjianfeng, zhanglixin, [email protected], fwl, jiazhen, [email protected] Abstract—For the first time, this paper systemati- which is also referred to as big data applications. cally identifies three categories of throughput oriented Third, lots of users tend to use streaming media or workloads in data centers: services, data processing ap- VoIP for fun or communications. Different from plications, and interactive real-time applications, whose an ordinary Web server, a VoIP application will targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum maintain a user session of a long period (e.g., more number of simultaneous subscribers, respectively, and than five minutes) while guaranteeing the real time we coin a new term high volume computing (in short quality of service, which we call an interactive HVC) to describe those workloads and data center real-time application. computer systems designed for them. We characterize The workloads mentioned above consist of a and compare HVC with other computing paradigms, e.g., large amount of loosely coupled workloads instead high throughput computing, warehouse-scale computing, and cloud computing, in terms of levels, workloads, of a big tightly coupled job. The nature of this metrics, coupling degree, data scales, and number of jobs class of workloads is throughput-oriented, and the or service instances. We also preliminarily report our target of data center computer systems designed ongoing work on the metrics and benchmarks for HVC for them is to increase the volume of throughput in systems, which is the foundation of designing innovative terms of processed requests (for services), or pro- data center computer systems for HVC workloads. cessed data (for data processing applications), or Keywords-High volume computing; Throughput- the maximum number of simultaneous subscribers oriented workloads; Data center computer systems; Met- (for interactive real-time applications), performed rics; Benchmarks; or supported in data centers. So as to pay attention to this class of workloads and computer systems I. INTRODUCTION designed for them, in this paper, we coin a new In the past decade, there are three trends in com- term high volume computing (nature: throughput puting domains. First, more and more services, computing; target: high volume, in short HVC) involving a large amount of data, are deployed to describe this class of workloads or data center in data centers to serve the masses, e.g., Google computer systems designed for them. search engine and Google Map. Second, massive In this paper, we identify three categories of data are produced, stored, and analyzed in real workloads in HVC: services, data processing ap- time or off line. According to the annual survey of plications, and interactive real-time applications, the global digital output by IDC, the total amount all of which are throughput-oriented workloads. of global data passes 1.2 zettabytes in 2010. In this A service is a group of applications that collabo- paper, we call applications that produce, store, and rate to receive user requests and return responses analyze massive data data processing applications, to end users. More and more emerging services are data-intensive, e.g., Google search engine or in cloud workloads [1] [38] are excluded, since Google Map. Data processing applications pro- they are tightly-coupled. duce, store, and analyze massive data, and we only After widely investigating previous bench- focus on loosely coupled data processing appli- marks, we found there is no systematic work cations, excluding tightly coupled data-intensive on benchmarking data center computer systems computing, e.g., those written in MPI. Typical in the context of our identified three categories examples are MapReduce or Dryad based com- of throughput-oriented workloads. We present our puting. We also include data stream applications preliminary work on the metrics and benchmarks that process continuous unbounded streams of data for HVC systems. in real time into the second category of HVC The remainder of the paper is organized as fol- workloads. Different from an ordinary Web server, lows. Section II characterize and compare different an interactive real-time application will maintain computing paradigms. Section III revisits previous a user session of a long period while guaranteeing benchmarks and report our preliminary work on the real time quality of service. Typical interactive the HVC metrics and benchmarks. Section IV real-time applications include streaming media, draws a conclusion. desktop clouds [33], and Voice over IP (VoIP) applications. The details of three categories of II. CHARACTERIZING COMPUTING PARADIGMS workloads can be found at II-A. In this section, we give out the definition of Despite several computing paradigms are not HVC, and identify its distinguished differences formally or clearly defined, e.g., warehouse-scale from other computing paradigms. computing (WSC) [4], data-intensive scalable computing (DISC) [28], and cloud computing, we A. What is HVC? compare HVC with several computing paradigms HVC is a data center based computing paradigm in terms of six dimensions: levels, workloads, met- focusing on throughput-oriented workloads. The rics, coupling degree, data scales, and number of target of a data center computer system designed jobs or service instances as shown in Table I. Our for HVC workloads is to increase the volume definition of HVC is towards data center computer of throughput in terms of requests, or processed systems, while many task computing [14] and high data, or the maximum number of simultaneous throughput computing [26] are defined towards subscribers, which are performed or supported in runtime systems. In terms of workloads and re- a data center. spective metrics, both high throughput computing In Table I, we characterize HVC from six and high performance computing are about sci- dimensions: levels, workloads, metrics, coupling entific computing centering around floating point degree, data scales, and number of jobs or service operations, while most of HVC applications have instances. few floating point operations as uncovered in our The HVC system is defined on a data center preliminary work [3]. Meanwhile, we also notice level. We identify three categories of workloads that many emerging workloads can be included in HVC: services, data processing applications, into one or two categories of HVC workloads, e.g., and interactive real-time applications. Services WSC (into the first and second categories), and belong to the first category of HVC workloads. A DISC (into the second category). As for (public) service is a group of applications that collaborate cloud computing, we believe it is basically a to receive user requests and return responses to business model of renting computing or storage end users. We call a group of applications that resources, which heavily relies upon virtualization independently process requests a service instance. technologies, however HVC is defined in terms of For a large Internet service, a large amount of workloads. We think many well-know workloads service instances are deployed with requests dis- in cloud [5] can be included into HVC, but HPC tribution enabled by load balancers. Since each Table I: Characterizing different computing paradigms. Computing paradigm level Workloads Metrics Coupling Data ] jobs or ser- degree scale vice instances High performance Super computers Scientific comput- Float point opera- Tight n/a Low computing ing: heroic MPI ap- tions per second plications High performance Processors Traditional server Overall work per- loose n/a Low throughput computing workloads formed over a fixed [27] time period High throughput com- Distributed runtime Scientific comput- Float point opera- loose n/a Medium puting [26] systems ing tions per month Many task computing Runtime systems Scientific comput- Tasks per second Tight n/a Large [14] ing or data analy- or sis: workflow jobs loose Data-intensive scalable Runtime systems Data analysis: n/a Loose Large Large computing [28] or data MapReduce-like center computing [29] jobs Warehouse-scale com- Data centers for Inter- Very large Internet n/a Loose large Large puting [4] net services, belonging services to a single organization Cloud computing [34] Hosted data centers SaaS + utility com- n/a Loose n/a Large [15] puting Services Requests per min- Loose Medium Large utes and joule High volume Data centers Data processing ap- Data processed per Loose Large Large computing (HVC) plications minute and joule Interactive real- Maximum number Loose From Large time applications of simultaneous medium subscribers and to large subscribers per watt request is independent, a service in itself is loosely data stream applications into this category of HVC coupled. For an ordinary Apache Web server, the workloads. For example, S4 is a platform that data scale is lower, while for a search engine allows programmers to easily develop applications provided by Google, the data scale is large. More for processing continuous unbounded streams of and more emerging services are data-intensive. data [37]. The second category of HVC workloads is data The third