High Volume Computing: Identifying and Characterizing Throughput
Total Page:16
File Type:pdf, Size:1020Kb
High Volume Computing: Identifying and Characterizing Throughput Oriented Workloads in Data Centers Jianfeng Zhan, Lixin Zhang, Ninghui Sun, Lei Wang, Zhen Jia, and Chunjie Luo State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing, China Email: {zhanjianfeng, zhanglixin, snh}@ict.ac.cn, {wl, jiazhen, luochunjie}@ncic.ac.cn Abstract—For the first time, this paper systematically also referred to as big data applications. Third, lots identifies three categories of throughput oriented work- of users tend to use streaming media or VoIP for loads in data centers: services, data processing applica- fun or communications. Different from an ordinary tions, and interactive real-time applications, whose targets Web server, a VoIP application will maintain a are to increase the volume of throughput in terms of processed requests or data, or supported maximum number user session of a long period (e.g., more than five of simultaneous subscribers, respectively, and we coin a new minutes) while guaranteeing the real time quality term high volume computing (in short HVC) to describe of service, which we call an interactive real-time those workloads and data center computer systems de- application. signed for them. We characterize and compare HVC with The workloads mentioned above consist of a large other computing paradigms, e.g., high throughput comput- amount of loosely coupled workloads instead of a ing, warehouse-scale computing, and cloud computing, in terms of levels, workloads, metrics, coupling degree, data big tightly coupled job. The nature of this class of scales, and number of jobs or service instances. We also workloads is throughput-oriented, and the target of preliminarily report our ongoing work on the metrics and data center computer systems designed for them is benchmarks for HVC systems, which is the foundation to increase the volume of throughput in terms of of designing innovative data center computer systems for processed requests (for services), or processed data HVC workloads. (for data processing applications), or the maximum Index Terms—High volume computing; Throughput- number of simultaneous subscribers (for interactive oriented workloads; Data center computer systems; Met- rics; Benchmarks; real-time applications), performed or supported in arXiv:1202.6134v2 [cs.DC] 14 Jan 2013 data centers. So as to pay attention to this class of workloads and computer systems designed for I. INTRODUCTION them, in this paper, we coin a new term high volume In the past decade, there are three trends in computing (nature: throughput computing; target: computing domains. First, more and more services, high volume, in short HVC) to describe this class of involving a large amount of data, are deployed in workloads or data center computer systems designed data centers to serve the masses, e.g., Google search for them. engine and Google Map. Second, massive data are In this paper, we identify three categories of produced, stored, and analyzed in real time or off workloads in HVC: services, data processing ap- line. According to the annual survey of the global plications, and interactive real-time applications, digital output by IDC, the total amount of global all of which are throughput-oriented workloads. A data passes 1.2 zettabytes in 2010. In this paper, service is a group of applications that collaborate we call applications that produce, store, and analyze to receive user requests and return responses to massive data data processing applications, which is end users. More and more emerging services are data-intensive, e.g., Google search engine or Google marking data center computer systems in the context Map. Data processing applications produce, store, of our identified three categories of throughput- and analyze massive data, and we only focus on oriented workloads. We present our preliminary loosely coupled data processing applications, ex- work on the metrics and benchmarks for HVC cluding tightly coupled data-intensive computing, systems. e.g., those written in MPI. Typical examples are The remainder of the paper is organized as fol- MapReduce or Dryad based computing. We also lows. Section II characterize and compare different include data stream applications that process con- computing paradigms. Section III revisits previous tinuous unbounded streams of data in real time into benchmarks and report our preliminary work on the the second category of HVC workloads. Different HVC metrics and benchmarks. Section IV draws a from an ordinary Web server, an interactive real- conclusion. time application will maintain a user session of a long period while guaranteeing the real time quality II. CHARACTERIZING COMPUTING PARADIGMS of service. Typical interactive real-time applications In this section, we give out the definition of HVC, include streaming media, desktop clouds [33], and and identify its distinguished differences from other Voice over IP (VoIP) applications. The details of computing paradigms. three categories of workloads can be found at II-A. Despite several computing paradigms are not A. What is HVC? formally or clearly defined, e.g., warehouse- HVC is a data center based computing paradigm scale computing (WSC) [4], data-intensive scalable focusing on throughput-oriented workloads. The computing (DISC) [28], and cloud computing, we target of a data center computer system designed compare HVC with several computing paradigms for HVC workloads is to increase the volume in terms of six dimensions: levels, workloads, met- of throughput in terms of requests, or processed rics, coupling degree, data scales, and number of data, or the maximum number of simultaneous sub- jobs or service instances as shown in Table I. Our scribers, which are performed or supported in a data definition of HVC is towards data center computer center. systems, while many task computing [14] and high In Table I, we characterize HVC from six dimen- throughput computing [26] are defined towards run- sions: levels, workloads, metrics, coupling degree, time systems. In terms of workloads and respective data scales, and number of jobs or service instances. metrics, both high throughput computing and high The HVC system is defined on a data center performance computing are about scientific com- level. We identify three categories of workloads in puting centering around floating point operations, HVC: services, data processing applications, and while most of HVC applications have few floating interactive real-time applications. Services belong point operations as uncovered in our preliminary to the first category of HVC workloads. A service work [3]. Meanwhile, we also notice that many is a group of applications that collaborate to receive emerging workloads can be included into one or user requests and return responses to end users. two categories of HVC workloads, e.g., WSC (into We call a group of applications that independently the first and second categories), and DISC (into the process requests a service instance. For a large second category). As for (public) cloud computing, Internet service, a large amount of service instances we believe it is basically a business model of renting are deployed with requests distribution enabled by computing or storage resources, which heavily relies load balancers. Since each request is independent, a upon virtualization technologies, however HVC is service in itself is loosely coupled. For an ordinary defined in terms of workloads. We think many well- Apache Web server, the data scale is lower, while know workloads in cloud [5] can be included into for a search engine provided by Google, the data HVC, but HPC in cloud workloads [1] [38] are scale is large. More and more emerging services excluded, since they are tightly-coupled. are data-intensive. After widely investigating previous benchmarks, The second category of HVC workloads is data we found there is no systematic work on bench- processing applications. Please note that we only TABLE I: Characterizing different computing paradigms. Computing paradigm level Workloads Metrics Coupling Data ♯ jobs or ser- degree scale vice instances High performance Super computers Scientific comput- Float point opera- Tight n/a Low computing ing: heroic MPI ap- tions per second plications High performance Processors Traditional server Overall work per- loose n/a Low throughput computing workloads formed over a fixed [27] time period High throughput com- Distributed runtime Scientific comput- Float point opera- loose n/a Medium puting [26] systems ing tions per month Many task computing Runtime systems Scientific comput- Tasks per second Tight n/a Large [14] ing or data analy- or sis: workflow jobs loose Data-intensive scalable Runtime systems Data analysis: n/a Loose Large Large computing [28] or data MapReduce-like center computing [29] jobs Warehouse-scale com- Data centers for Inter- Very large Internet n/a Loose large Large puting [4] net services, belonging services to a single organization Cloud computing [34] Hosted data centers SaaS + utility com- n/a Loose n/a Large [15] puting Services Requests per min- Loose Medium Large utes and joule High volume Data centers Data processing ap- Data processed per Loose Large Large computing (HVC) plications minute and joule Interactive real- Maximum number Loose From Large time applications of simultaneous medium subscribers and to large subscribers per watt include loosely coupled data-intensive computing, Typical interactive real-time applications include e.g., MapReduce jobs,