Link to Paper

Holistic Data Access Optimization for Analytics Reports∗ ABSTRACT stream ORM. It also shows that normalized-sets plans out- Object-Relational Mappers (ORMs) enable single language perform denormalized-sets plans by manyfold for the case of access to both the main memory data and the database data analytic reports, which was not the case in semistructured of an application. Unfortunately, they also lead to perfor- queries without aggregation. mance inefficiencies, especially in analytics applications with The Collage query processor has been integrated and de- information-rich reports involving nested and aggregated re- ployed as part of a complete web application development sults over large data volumes. framework, including a templating enegine. Using an online Past database research suggests that a report over a database IDE, one can build a wide class of web applications with can be modeled by a single semi-structured query, i.e. a both the ease of single language access, and performance of query involving nesting and heterogeneity. Thus, the re- holistic optimization. port's data accesses can be holistically optimized through a single query. Collage resolves important practical challenges 1. INTRODUCTION towards enabling the report-as-query approach and its holis- It has been well-known since the 80s that the impedance tic optimization. In particular, the Collage middleware sys- mismatch between the SQL database and the application tem models a report page as a SQL++ query. SQL++ is layer is a major time waste in application development [8]. A a superset of SQL by two extensions: (a) It provides dis- developer has to use two languages (SQL and an application tributed access to an SQL database and the in-memory ob- programming language, such as Java, Ruby or Python) for jects of the web application programming language. (b) The data access, and tediously convert between relational tables query output has the full generality needed by report tem- and objects. plating engines built around HTML and JSON. Indeed, the Best practices among practitioners to mitigate impedance SQL++ data model is a superset of JSON. The Collage mismatch are to utilize Object-Relational Mappers (ORMs). implementation features its own report templating engine With ORMs, developers specify mappings from relational but one can easily utilize Collage's query processor in other tables into classes of the application programming language. HTML and JSON-based templating engines. Consequently, the ORM automatically translates navigation On the query optimization side, Collage expands into an- across objects of these classes into SQL queries, often involv- alytics queries. The following query optimization problems ing joins. With this object navigation paradigm, ORMs offer emerge and are solved. (1) Novel rewritings guarantee the single language access to all data, be them native objects of compilation of any SQL++ query into an efficient set-at- the application or tuples of the database. The pervasive a-time query plan, assuming the objects of the application adoption of ORMs1, such as Hibernate, Active Record of programming language offer object id's that can be com- Ruby-on-Rails (RoR) and CakePHP is a testimony to the pared and be efficiently grouped. This guarantee includes importance of the impedance mismatch problem, and the analytics queries, i.e., queries that involve aggregation and desire for single language access to database tuples and ap- top-k processing, which were unaddressed by prior works on plication objects. semi-structured queries. (2) Normalized and denormal- While ORMs provide single language unified access for ized set-at-a-time query plans are formulated and evaluated. simple data access patterns, this ease of use often comes The evaluation validates the manyfold speedup over a main- at a performance cost of queries in very common reporting ∗ applications, especially analytics applications that display Supported by NSF III-1018961 and NSF III-1219263, PI'd nested, aggregated data [18, 17]. As a running example, con- by Prof Papakonstantinou who is a shareholder of App2You Inc, which commercializes outcomes of this research. sider the web application in Figure 1 that has a dashboard page for visualizing and analyzing sales data of a TPC-H [22] database. A user uses the checkboxes to select nations to analyze, and the application stores the selections as in- memory objects within the HTTP session of the application Permission to make digital or hard copies of all or part of this work for server. The application then displays a HTML table of se- personal or classroom use is granted without fee provided that copies are lected nations, where each row comprises the nation name not made or distributed for profit or commercial advantage and that copies 1 bear this notice and the full citation on the first page. To copy otherwise, to In this paper, code examples are presented in RoR, since republish, to post on servers or to redistribute to lists, requires prior specific RoR is a widely used web framework, and Active Record, its permission and/or a fee. built-in ORM, supports many mainstream database engines. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. Nevertheless, the discussion applies to other popular ORMs. and the top 3 years of sales revenue. almost always incurs a manyfold performance penalty due to extraneous sequential scans or repeated random accesses, -.$/%-#) as will be demonstrated in our experiments. Worse still, the -.$/%-*+',) performance penalty ratio increases with the amount of data -.&') !%&&'-$) accessed and visualized, since the number of queries issued is dependent on the number of nations. !"#$%&'(#) A sophisticated developer who optimizes for performance will re-implement tuple-at-a-time queries as set-at-a-time !"#$*+',) -.$/%-*('0) queries. Two alternatives demonstrated in this paper are: .11('##) • Denormalized-sets queries: Queries that retrieve re- %(1'(#) sults as denormalized-sets. For example, a single query %(1'(*+',) that retrieves the join of nations and top sales. !"#$*('0) %(1'(*1.$') • Normalized-sets queries: Queries that retrieve results $%$.2*3(/!') as normalized-sets. For example, a query that retrieves Figure 1: Analytics application for TPC- Figure 2: nations, and a separate query that retrieves top sales with corresponding nation ids. H data TPC-H schema An ORM induces developers to decompose data access ! "#$%&'#()*)+,)-#./,0%.#1)'(#23%4#4)4%3'#,%#-/,/5/()#,)4&%3/3'#,/5*)(# into multiple object navigations (translated by the ORM 6 7+,08)9)+%3-::;/()<+%..)+,0%.<)=)+>,)?# into multiple queries) that are interspersed among loops and @ ##A$9B7CB#CBDEF979G#C7;HB#()*)+,)-I./,0%.(#?./,0%.I3)2#JKCBLB9MA#M# N ()*)+,)-I./,0%.(<)/+O#-%#P.P# conditionals of imperative languages. This decomposition Q ##7+,08)9)+%3-::;/()<+%..)+,0%.<)=)+>,)?# R ####SJKTB9C#JKCF#()*)+,)-I./,0%.(?./,0%.I3)2M#U7HVBT#?A"W.XAMS#M# reduces data access performance: Even though each individ- Y ).-# ual query may be optimally executed by the database, the Z "#T,%3)#+%44%.#(>5[)=&3)((0%.#2%3#()*)+,)-#./,0%.(#0.#,)4&%3/3'#,/5*)(# set of queries collectively may not compute the report effi- \ 7+,08)9)+%3-::;/()<+%..)+,0%.<)=)+>,)?]]T^H# ciently. Such inefficiencies can be very large in analytics ap- !_ ##$9B7CB#CBDEF979G#C7;HB#,)4&I./,0%.(#7T# !! ##TBHB$C#.<./,0%.I1)'`#.<./4)# plications producing reports that involve nesting, joins and !6 ##a9FD###./,0%.(#7T#.# aggregation. In particular, this paper focuses on the ineffi- !@ ##bFJK###()*)+,)-I./,0%.(#7T#(#FK#(<./,0%.I3)2#c#.<./,0%.I1)'# !N T^H# ciencies of tuple-at-a-time queries, where each outer context !Q M# tuple causes the execution of an inner query that produces !R "#T>5[d>)3'#,%#3),30)8)#-0(,0.+,#./,0%.#1)'(# nested tuples. !Y d!#c#ATBHB$C#eJTCJK$C#./,0%.I1)'#7T#./,0%.I1)'I6#a9FD#,)4&I./,0%.(A# !Z "#T>5[d>)3'#,%#3),30)8)#/ff3)f/,)(#).[4/(()# ! "#$$%&'()*+,'%-./%01'023.4'5%$$6% !\ d6#c#73)*::C/5*)<.)g?:+>(,%4)3(M# 7 ",+3*'6% 6_ ##<&3%h)+,?A,!<./,0%.I1)'I6`## 8 %%"9%!"#$%&'"((')"*+',%-.&.-96% 6! ############-/,)I&/3,?iA')/3iA`#%<%3-)3I-/,)M#/(#%3-)3I')/3`## : %%%%"9%$/-0)00$%&120)()*#),3&"#$%&0241&'&"#$%&35)64-77-#89)-#+)&-96% 66 ############TVD?%<,%,/*I&30+)M#7T#(>4I&30+)AM# ; %%%%%%",/6% 6@ ##<23%4?A+>(,%4)3(#/(#+AM# < %%%%%%%%",=6"9>%&'&":)%96"?,=6% 6N ##<h%0.(?AbFJK#%3-)3(#7T#%#FK#+<+>(,I1)'#c#%<+>(,I3)2#A#j# @ %%%%%%%%",=6% 6Q #########SbFJK#?"Wd!XM#7T#,!#FK#+<./,0%.I3)2#c#,!<./,0%.I1)'I6SM# A %%%%%%%%%%"9%%";;8);"#)0-7-<8,)8- 6R ##<f3%>&?A,!<./,0%.I1)'I6`## B %%%%%%%%%%%%%%%%'0)()*#=2,"#)3>"8#=?26)"8?2@-%8,)83,"#)A-"0-%8,)836)"8@-- 6Y ##########-/,)I&/3,?iA')/3iA`#%<%3-)3I-/,)MAM# !C %%%%%%%%%%%%%%%%%%%%%%%%%09:=#%#"(3>8$*)A-"0-09:3>8$*)2A- !! %%%%%%%%%%%%%%%%'B%$&0=2*90#%:)82A- 6Z "#T>5[d>)3'#,%#&/3,0,0%.#/ff3)f/,)(#5'#./,0%.#1)'# !7 %%%%%%%%%%%%%%%%'C+)8)=2&"#$%&38)/-7-D2@-&'&"#$%&35)6A- 6\ d@#c#]]T^H# !8 %%%%%%%%%%%%%%%%';8%9>=2,"#)3>"8#=?26)"8?2@-%8,)83,"#)A-"0-%8,)836)"82A- @_ ##TBHB$C#,6<k`#3%gI.>45)3?M#FUB9?# !: %%%%%%%%%%%%%%%%'%8,)8=209:3>8$*)-EFGH2A- @! #################E79CJCJFK#;G#,6<./,0%.I1)'I6# !; %%%%%%%%%%%%%%%%'($:$#=IA--96% @6 #################F9eB9#;G#,6<(>4I&30+)# !< %%%%%%%%%%"#$$%&'()*+,'%-./%3+/%01+/,%D+E+F0/G),%0.().H'H,%$$6% @@ ###############M#7T#3%gI0-# !@ %%%%%%%%"?,=6% @N ##a9FD#?"Wd6<,%I(d*?MXM#7T#,6# !A %%%%%%"?,/6% @Q T^H# !B %%%%"9%)&,%96% 7C %%"9%)&,%96% @R "#B=)+>,)#d>)3'#,O/,#3),30)8)(#,%&[@#/ff3)f/,)(#0.#)/+O#&/3,0,0%.# 7! "?,+3*'6% @Y ,%&I(/*)(#c#F3-)3# @Z ##<()*)+,?A,@<./,0%.I1)'I6`#,@<%3-)3I')/3`#,@<(>4I&30+)AM# @\ ##<23%4?S?"Wd@XM#/(#,@SM# Figure 3: RoR page with tuple-at-a-time queries N_ ##<gO)3)?A3%gI0-#]c#@AM# For example, consider the RoR fragment in Figure 3 that N! "#K)(,#/ff3)f/,)(#0.#3)(&)+,08)#./,0%.#1)'(# N6 3)(>*,#c#l/(O<.)g#W#PO/(O`#1)'P#O/(Om1)'n#c#W# implements Figure 1, and operates over the TPC-H schema N@ ####:./4)#######co#.0*`# in Figure 2.

Load more