Containerized SQL Query Evaluation in a Cloud
Total Page:16
File Type:pdf, Size:1020Kb
Containerized SQL Query Evaluation in a Cloud Dr. Weining Zhang and David Holland Department of Computer Science The University of Texas at San Antonio {Weining.Zhang, david.holland}@utsa.edu Abstract—Recent advance in cloud computing and light- management system (DBMS) inside a virtual ma- weight software container technology opens up opportu- chine (VM). The user must administer most sys- nities to execute data intensive applications where data tem management activities, including software li- is located. Noticeably, current database services offered censing, installation, configuration, management, on cloud platforms have not fully utilized container tech- nologies. In this paper we present an architecture of a backup, and recovery. This option is difficult to cloud-based, relational database as a service (DBaaS) that scale. Two) Use a NoSQL database, e.g. Apache can package and deploy a query evaluation plan using HBase[1], Google BigQuery[2], and Cassandra[3]. light-weight container technology and the underlying cloud These services are highly efficient and scalable for storage system. We then focus on an example of how a some types of large data, but the user is forced to select-join-project-order query can be containerized and deal with the lack of structured data and strong deployed in ZeroCloud. Our preliminary experimental results confirm that a containerized query can achieve data integrity that is present in relational models. a high degree of elasticity and scalability, but effective Three) Use a managed SQL database service, e.g. mechanisms are needed to deal with data skew. Amazon RDS[4], Google Cloud SQL[5], and MS Index Terms—Database, DBaaS, query evaluation, soft- Azure SQL[6]. User can rent conventional database ware container, cloud, OpenStack, ZeroVM servers running in the cloud. This is convenient be- cause the DBaaS takes care all system maintenance I. INTRODUCTION and storage; existing applications can run without Recent years have witnessed fast growth of modification. However, while the user may scale- cloud computing. An increasing number of cloud out by adding more servers, the execution of each platforms, such as Amazon AWS EC2, Google query is still bounded by a single server and lack Compute Engine, and Microsoft Azure1, are now of parallelism. available to users. A cloud platform provides a Another noticeable recent technology advance is combination of services, including infrastructure the rapid adoption of light-weight container tech- (IaaS), platform (PaaS), and software (SaaS). These nologies, such as Docker[7] and ZeroVM[8]. By services run on an infrastructure of large numbers running applications in containers, which are easily of commodity computers connected by high speed deployed and quickly instantiated, compute can be networks, delivering economy of scale, elasticity, moved directly to the data instead of the other efficiency, availability, and reliability. way around. Containers have smaller footprints, To support data processing for cloud users, it is less start-up overhead, and secure isolation among important to provide scalable, reliable, highly avail- tenants[9]. Containerized applications often have able and highly efficient database services (DBaaS) better performance because of parallelism. These in cloud. Currently users have three options when features have prompted a strong push to integrate it comes to use databases. container technology into cloud platforms. However, Cloud users currently have three DBaaS options the use of container technology in SQL DBaaS has to choose from. One) Run a traditional database not been reported in the literature. In this paper, we present a study on using light- 1On-line at aws.amazon.com, appengine.google.com and weight containers to evaluate relational query plans. azure.microsoft.com, respectively. We consider a DBaaS that provides the functions Figure 1: A Layered Architecture of a DBaaS in Cloud of a traditional SQL database but run queries in Section II, we present the layered architecture of the containers inside a cloud storage system. Specifi- DBaaS. In Section III, we present a containerizable cally, the DBaaS stores data using the cloud storage algorithm example. In section IV, we present a system, which automatically partitions, replicates, method to containerize a plan and deploy it into and distributes the data. When processing a query, ZeroCloud. In Section V, we present experimental the DBaaS first accepts an SQL query from a client results obtained from running containerized join and generates an optimized query evaluation plan query plans on ZeroCloud. Finally, we briefly dis- in a traditional way. It then containerizes the query cuss related work in Section VI and conclude the plan by identifying a network of compute nodes and paper in Section VII. assembling executable programs for the compute nodes to run inside containers. The containerized II. A LAYERED ARCHITECTURE OF A DBAAS query plan is then deployed to the cloud storage As shown in Figure 1, the DBaaS is a set of system so that each container executes at the data layers built on top of cloud services. The layers storage node near its data when feasible. Interme- can be loosely coupled in the sense that upper diate results are pipelined into other containers. layers use lower layers only through the provided The execution of containers is load balanced and service interface. Thus, different implementations of monitored by the cloud’s job scheduler. The final a service layer will not affect the function of an result may be either returned directly to the client, upper layer. or stored into the cloud storage system. The Cloud Layer manages the cloud hardware, We focus on the containerization and deployment including compute nodes, storage nodes, and high of a query evaluation plan in such a DBaaS. We speed networks, as well as a suite of other cloud present a layered system architecture and show by services such as compute scheduling, storage repli- an example how a query can be containerized and cation, security access, networking, messaging, and deployed in the ZeroCloud[10]. Our preliminary container hypervisor services. Without loss of gen- experiments indicate that a containerized query has erality, we assume that this layer will provide high the potential of achieving scalability for big data availability (by data replication), multi-tenant iso- sets. However, effective mechanisms are needed to lation, load-balancing, some consistency guarantee, deal with data skew. security, and container management (including cre- The rest of the paper is organized as follows. In ation, scheduling, monitoring, disposal of contain- ers). These services can be accessed by public cloud nodes and assemble algorithms for these compute platform APIs. nodes. The DBaaS Layer provides the functionality There are several methods for the Plan-Assembler of relational database management, including the to generate a containerized query plan. For example, SQL query, optimization, transaction processing, given a query plan, the Plan-Assembler may take and ACID consistency. We assume that the user data each query operator of the plan as a compute node and system catalog are stored in the Cloud Layer and assign a parallel algorithm to the compute node using its storage service. The DBaaS Layer is itself to explore intra-operator parallelism. It can then divided into three sub-layers: a top sub-layer for optimize the containerized plan by consolidating higher level abstraction data management functions some adjacent compute nodes into a single compute and two lower sub-layers for containerized query node. Alternatively, the Plan-Assembler can assign evaluation. The two lower sub-layers are: Cloud- compute nodes according to a set of translation rules Independent QE (Query Evaluation) and Cloud De- that match a sub-plan with a specific pattern to a pendent QE. specific type of compute node. It can then assemble The top sub-layer receives user queries and gen- algorithm based on the type of the compute node. erates optimized query evaluation plans. It also In either case, the Plan-Assembler must preserve the manages transaction processing and guarantees the query plan’s overall inter-operator execution order, ACID consistency. Optimized query plans repre- but parallelize intra-operator whenever feasible. To sented in standard formats, e.g. an XML represen- do that, the Plan-Assembler needs to specify intra- tation such as DXL[11], are passed to the Cloud- operator as well as inter-operator network data Independent QE for execution. flows. Once the algorithms for compute nodes are The Cloud Independent QE layer containerizes a assembled, the program code for each compute node given query plan by mapping plan operators into a is then composed, compiled, linked, and packaged. network of cooperative compute nodes and assem- At the run-time, the program codes for compute ble an executable program for each compute node nodes will execute inside containers and the data using code from a library of relational operators. A flow among compute nodes will be realized by repository of libraries is maintained by the DBaaS communications among containers. for different query operators. A comprehensive treatment of methods to con- The Cloud Dependent QE translates and packages tainerize an arbitrary query plan is beyond the scope the containerized query plan into an executable of this paper. In the rest of this section, we present archive specific to the Cloud Layer consisting of an example containerized query plan. programs, dependent libraries, and configuration A. Network Topology of a SJPO Query meta data. The configuration meta data specifies deployment details such as number of containers, We consider the following simple Selection-Join- programs to be executed in containers, and com- Projection-Ordering (SJPO) query: munication topology among containers. The Cloud δ (π ((σ R) ./ (σ S))) Dependent QE then deploys the package through the R:A1;< R:A1;S:B1 R:A2≤a R:A3=S:B3 S:B2=b Cloud Layer’s Container Hypervisor’s service API. where R(A1;A2;A3) and S(B1;B2;B3) are two relations, σ, ./, π, and δ are selection, join, projec- III.