Mtbase: Optimizing Cross-Tenant Database Queries

MTBase: Optimizing Cross-Tenant Database Queries Lucas Braun*, Renato Marroqu´ıny, Kai-En Tsayy, Donald Kossmanny] * Oracle Corporation, [email protected], work performed while at ETH ySystems Group, Department of Computer Science, ETH Zurich, fmarenato, tsayk, [email protected] ]Microsoft Corporation, [email protected] ABSTRACT In the last decade, many business applications have moved into the cloud. In particular, the “database-as-a-service” paradigm has be- come mainstream. While existing multi-tenant data management systems focus on single-tenant query processing, we believe that it is time to rethink how queries can be processed across multiple tenants in such a way that we do not only gain more valuable insights, but also at minimal cost. As we will argue in this paper, standard SQL semantics are insufficient to process cross-tenant queries in an unambiguous way, which is why existing systems use other, expen- sive means like ETL or data integration. We first propose MTSQL, a set of extensions to standard SQL, which fixes the ambiguity problem. Next, we present MTBase, a query processing middleware that efficiently processes MTSQL on top of SQL. As we will see, there is a canonical, provably correct, rewrite algorithm from MTSQL to SQL, which may however result in poor query execution performance, even on high-performance database products. We further show that with carefully-designed optimizations, execution times Figure 1: Cross-tenant query processing systems can be reduced in such ways that the difference to single-tenant queries becomes marginal. Finally, there are the two schemes where tenants not only share a 1. INTRODUCTION database, but also the table layout (schema). Either, as for example Indisputably, cloud computing is one of the fastest growing busi- in Apache Phoenix [8], tenants still have their private tables, but nesses related to the field of computer science. Cloud providers these tables share the same (logical) schema (SS), or the data of dif- promise good elasticity, high availability and a fair pay-as-you- ferent tenants is consolidated into shared tables (ST) which is hence go pricing model to their tenants. Moreover, corporations are no the layout with the highest degree of physical and logical sharing. longer required to rely on on-promise infrastructure which is typi- SS and ST layouts are not only used in DaaS, but also in Software-as- cally costly to acquire and maintain. While it is still an open re- a-Service (SaaS) platforms, as for example in Salesforce [38] and search question whether and how these good promises can be kept FlexScheme [10, 11]. The main reason why these systems prefer ST with regard to databases [17, 28], all the big players, like Google over SS is cost [10]. Moreover, if the number of tenants exceeds the [26], Amazon [7], Microsoft [29] and recently Oracle [33], have number of tables a database can hold ,which is typically a number launched their own Database-as-a-Service (DaaS) cloud products. in the range of ten thousands, SS becomes prohibitive. Conversely, arXiv:1703.04290v1 [cs.DB] 13 Mar 2017 All these products host massive amounts of clients and are therefore ST databases can easily accommodate hundred thousands to even multi-tenant. millions of tenants. As pointed out by Chong et al. [15], the term multi-tenant data- An important feature of multi-tenant databases, which we be- base is ambiguous and can refer to a variety of DaaS schemes with lieve did not yet get the attention it deserves, is cross-tenant query different degrees of logical data sharing between tenants. On the processing. One compelling use case is health care where many other hand, as argued by Aulbach et al. [10], multi-tenant databases providers and insurances use the same integrated SaaS application. not only differ in the way how they logically share information be- If the providers would agree to query their joint datasets of (prop- tween tenants, but also how information is physically separated. We erly anonymized) patient data with scientific institutions, this could conclude that the multi-tenancy spectrum consists of four different enable medical research to advance much faster because the data schemes: First, there are DaaS products that offer each tenant her can be queried as soon as it gets in. proper database while relying on physically-shared resources (SR), This paper looks into cross-tenant query processing within the like CPU, network and storage. Examples include SAP HANA [35], scope of SS and ST databases, thereby optimizing a very specific SqlVM [31], RelationalCloud [30] and Snowflake [16]. Next, there sub-problem of data integration (DI). DI, in a broad sense, is about are systems that share databases (SD), but each tenant gets her own finding schema and data mappings between the original schemas of set of tables within such a database, as for in example Azure SQL different data sources and a target schema specified by the client ap- DB [18]. plication [24, 21, 34]. As such, DI techniques are applicable to the 1 entire spectrum of multi-tenant databases because even if tenants fits the cloud’s pay-as-you-go cost model. Specifically, the paper use different schemas or databases, these techniques can identify makes the following contributions: correlations and hence extract useful information. Our work em- braces and builds on top of the latest DI work. More concretely, • It defines the syntax and semantics of MTSQL, a query lan- we optimize conversion functions similar to those used in DI by guage that extends SQL with additional semantics suitable thoroughly analyzing and exploiting their algebraic properties. In for cross-tenant query processing. addition, instead of translating data into a specific client format (and • It presents the design and implementation of MTBase, a da- update periodically), we convert it to any required client format ef- tabase middleware that executes MTSQL on top of any ST ficiently and just-in-time. database. There are several existing approaches to cross-tenant query processing which are summarized in Figure 1: The first approach is • It studies MTSQL-specific optimizations for query execution data warehousing [25] where data is extracted from several data in MTBase. sources (tenant databases/tables), transformed into one common format and finally loaded into a new database where it can be queried • It extends the well-known TPC-H benchmark to run and eval- by the client. This approach has high integration transparency in uate MTSQL workloads. the sense that once the data is loaded, it is in the right format as • It evaluates the performance and the implementation correct- required by the client and she can ask any query she wants. More- ness of MTBase with this benchmark. over, as all data is in a single place, queries can be optimized. On the down-side of this approach, as argued by [13, 32, 9], are costs The rest of this paper is organized as follows: Section 2 defines in terms of both, developing and maintaining such ETL pipelines MTSQL while Section 3 gives an overview on MTBase. Section 4 and maintaining a separate copy of the data. Another disadvantage discusses the MTSQL-specific optimizations, which are validated is data staleness in the presence of frequent updates. in Section 6 using the benchmark presented in Section 5. Section 7 Federated Databases [27, 23] reduce some of these costs by inte- shortly summarizes lines of related work while the paper is con- grating data on demand, i.e. there is no copying. However, mainte- cluded in Section 8. nance costs are still significant as for every new data source a new integrator/wrapper has to be developed. As data resides in different E ttid E emp id E name E role id E reg id E salary E age places (and different formats), queries can only be optimized to a 0 0 Patrick 1 3 50K 30 very small extent (if at all), which is why the degree of integration 0 1 John 0 3 70K 28 transparency is considered sub-optimal. Finally, systems like SAP 0 2 Alice 2 3 150K 46 HANA [35] and Salesforce [38], which are mainly tailored towards 1 0 Allan 1 2 80K 25 1 1 Nancy 2 4 200K 72 single-tenant queries, offer some degree of cross-tenant query pro- 1 2 Ed 0 4 1M 46 cessing, but only through their application logic, which means that the set of queries that can be asked is limited. Employees (tenant-specific), E salary of tenant 0 in USD, E salary of tenant 1 in EUR The reason why none of these approaches tries to use SQL for cross-tenant query processing is that it is ambiguous. Consider, for R ttid R role id R name Re reg id Re name instance the ST datbase in Figure 2, which we are going to use 0 0 phD stud. 0 AFRICA as a running example throught the paper: As soon as we want 0 1 postdoc 1 ASIA to query the joint dataset of tenants 0 and 1 and, for instance, 0 2 professor 2 AUSTRALIA join Employees with Roles, joining on role id alone is not 1 0 intern 3 EUROPE Patrick researcher 1 1 researcher 4 N-AMERICA enough as this would also join with and 1 2 executive 5 S-AMERICA Ed with professor, which is clear nonsense. The obvious so- lution is to add the tenant-ID ttid to the join predicate. On the Roles (tenant-specific) Regions (global) other hand, joining the Employees table with itself on E1.age = E2.age does not require ttid to be present in the join predi- Figure 2: MTSQL database in Basic Layout (ST), cate because it actually makes sense to include results like (Alice, ttids not visible to clients Ed) as they are indeed the same age.

Load more