Sql Query Optimization for Highly Normalized Big Data
Total Page:16
File Type:pdf, Size:1020Kb
DECISION MAKING AND BUSINESS INTELLIGENCE SQL QUERY OPTIMIZATION FOR HIGHLY NORMALIZED BIG DATA Nikolay I. GOLOV Lecturer, Department of Business Analytics, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation E-mail: [email protected] Lars RONNBACK Lecturer, Department of Computer Science, Stocholm University Address: SE-106 91 Stockholm, Sweden E-mail: [email protected] This paper describes an approach for fast ad-hoc analysis of Big Data inside a relational data model. The approach strives to achieve maximal utilization of highly normalized temporary tables through the merge join algorithm. It is designed for the Anchor modeling technique, which requires a very high level of table normalization. Anchor modeling is a novel data warehouse modeling technique, designed for classical databases and adapted by the authors of the article for Big Data environment and a massively parallel processing (MPP) database. Anchor modeling provides flexibility and high speed of data loading, where the presented approach adds support for fast ad-hoc analysis of Big Data sets (tens of terabytes). Different approaches to query plan optimization are described and estimated, for row-based and column- based databases. Theoretical estimations and results of real data experiments carried out in a column-based MPP environment (HP Vertica) are presented and compared. The results show that the approach is particularly favorable when the available RAM resources are scarce, so that a switch is made from pure in-memory processing to spilling over from hard disk, while executing ad-hoc queries. Scaling is also investigated by running the same analysis on different numbers of nodes in the MPP cluster. Configurations of five, ten and twelve nodes were tested, using click stream data of Avito, the biggest classified site in Russia. Key words: Big Data, massively parallel processing (MPP), database, normalization, analytics, ad-hoc, querying, modeling, performance. Citation: Golov N.I., Ronnback L. (2015) SQL query optimization for highly normalized Big Data. Business Informatics, no. 3 (33), pp. 7–14. Introduction havior data into millions and billions of dollars. How- ig Data analysis is one of the most popular IT ever, implementations and platforms fast enough to tasks today. Banks, telecommunication compa- execute various analytical queries over all available data Bnies, and big web companies, such as Google, remain the main issue. Until now, Hadoop has been Facebook, and Twitter produce tremendous amounts considered the universal solution. But Hadoop has its of data. Moreover, nowadays business users know how drawbacks, especially in speed and in its ability to proc- to monetize such data [2]. Various artificial intelligence ess difficult queries, such as analyzing and combining marketing techniques can transform big customer be- heterogeneous data [6]. BUSINESS INFORMATICS №3(33)–2015 7 DECISION MAKING AND BUSINESS INTELLIGENCE This paper introduces a new data processing ap- described as entities of their own. The Name of a Cus- proach, which can be implemented inside a relational tomer and the Sum of an Order are two example at- DBMS. The approach significantly increases the vol- tributes. An attribute table must contain at least two col- ume of data that can be analyzed within a given time umns, one for the entity key and one for the attribute frame. It has been implemented for fast ad-hoc query value. If an entity has three attributes, three separate at- processing inside the column oriented DBMS Vertica tribute tables have to be created. [7]. With Vertica, this approach allows data scientists to 3. Tie, table of entity relationships. For example, perform fast ad-hoc queries, processing terabytes of raw which Customer placed a certain Order is typical for a data in minutes, dozens of times faster than this data- tie. Tie tables must contain at least two columns, one for base can normally operate. Theoretically, it can increase the first entity key (that of the Customer) and one for the the performance of ad-hoc queries inside other types of second (that of the related Order). DBMS, too (experiments are planned within the frame- work of further research). The approach is based on the new database modeling Sum Tie Name technique called Anchor modeling. Anchor modeling was first implemented to support a very high speed of by loading new data into a data warehouse and to support placed fast changes in the logical model of the domain area, such as addition of new entities, new relationships be- Order Customer tween them, and new attributes of the entities. Later, it turned out to be extremely convenient for fast ad-hoc Anchor Attribute Attribute Tie queries, processing high volumes of data, from hundreds of gigabytes up to tens of terabytes. Ord ID Cust ID Ord ID Sum Cust ID Name Ord ID Cust ID The paper is structured as follows. Since not everyone 12 4 12 150 4 Dell 12 4 may be familiar with Anchor modeling, Section 1 ex- plains its main principles, Section 2 discusses the aspects 26 8 26 210 8 HP 26 8 of data distribution in an massively parallel processing 35 35 98 35 (MPP) environment, Section 3 defines the scope of analytical queries, Section 4 introduces the main prin- Fig. 1. Anchor modeling sample data model ciples of query optimization approach for analytical Ties and attributes can store historical values using queries. Section 5 discusses the business impact this ap- principles of slow changing dimensions type 2 (SCD 2) proach has on Avito, and the paper is concluded in the [4]. Distinctive characteristic of Anchor modeling is that final section. historicity is provided by a single date-time field (so- called FROM DATE), not a two fields (FROM DATE – 1. Anchor Modeling TO DATE), as for Data Vault modeling methodology. If Anchor modeling is a database modeling technique, historicity is provided by a single field (FROM DATE), based on the usage of the normal form (6NF) [9]. then second border of interval (TO DATE) can be es- Modeling in 6NF yields the maximal level of table de- timated as FROM DATE of next sequential value, or composition, so that a table in 6NF has no non-trivial NULL, if the next value is absent. join dependencies. That is why tables in 6NF usually contain as few columns as possible. The following con- 2. Data structs are used in Anchor modeling: distribution 1. Anchor, table of keys. Each logical entity of domain area must have a corresponding Anchor table. This table Massive parallel processing databases generally have contains unique and immutable identifiers of objects of shared-nothing scale-out architectures, so that each a given entity (surrogate keys). Customer and Order are node holds some subset of the database and enjoy a two example anchors. An anchor table may also contain high degree of autonomy with respect to executing technical columns, such as metadata. parallelized parts of queries. For each workload posed 2. Attribute, table of attribute values. This table stores to a cluster, the most efficient utilization occurs when the values of a logical entities attributes that cannot be the workload can be split equally among the nodes and 8 BUSINESS INFORMATICS №3(33)–2015 DECISION MAKING AND BUSINESS INTELLIGENCE each node can perform as much of its assigned work an instance takes part in can be resolved without the in- as possible without the involvement of other nodes. In volvement of other nodes, and that joining a tie will not short, data transfer between nodes should be kept to require any sort operations. Therefore, tie will require a minimum. Given arbitrarily modeled data, achiev- no more than one additional projection. ing maximum efficiency for every query is unfeasible Usage of single date-time field for historicity (SCD2) due to the amount of duplication that would be need- gives significant benefit – it does not require updates. ed on the nodes, while also defeating the purpose of New values can be added exclusively by inserts. This fea- a cluster. However, with Anchor modeling, a balance ture is very important for column-based databases, that is found where most data is distributed and the right are designed for fast select and insert operations, but are amount of data is duplicated in order to achieve high extremely slow for delete and update operations. Single performance for most queries. The following three dis- date-time field historicity also requires efficient analyti- tribution schemes are commonly present in MPP data- cal queries to estimate closing date for each value. Dis- bases and they suitably coincide with Anchor modeling tribution strategy, described above, guarantees, that all constructs. values for single anchor are stored on a single node and Broadcast the data to all nodes. Broadcasted data is properly sorted, so closing date estimation can be ex- available locally, duplicated, on every node in the clus- tremely efficient. ter. Knots are broadcasted, since they normally take up little space and may be joined from both attributes 3. Analytical and ties. In order to assemble an instance with knotted queries attributes or relationships represented through knot- ted ties without involving other nodes, the knot val- Databases can be utilized to perform two types of ues must be available locally. Thanks to the uniqueness tasks: OLTP tasks or OLAP tasks. OLTP means online constraint on knot values, query conditions involving transaction processing: huge number of small insert, them can be resolved very quickly and locally, to reduce update, delete, select operations, where each operation the row count of intermediate result sets in the execu- processes small chunk of data. OLTP tasks are typical for tion of the query. operational databases. OLAP means online analytical Segment the data according to some operation that processing: relatively small number of complex analyti- splits it across the nodes.