DECISION MAKING AND

SQL QUERY OPTIMIZATION FOR HIGHLY NORMALIZED BIG DATA

Nikolay I. GOLOV Lecturer, Department of Business Analytics, School of Business Informatics, Faculty of Business and Management, National Research University Higher School of Economics Address: 20, Myasnitskaya Street, Moscow, 101000, Russian Federation E-mail: [email protected] Lars RONNBACK Lecturer, Department of Computer Science, Stocholm University Address: SE-106 91 Stockholm, Sweden E-mail: [email protected]

This paper describes an approach for fast ad-hoc analysis of Big Data inside a relational data model. The approach strives to achieve maximal utilization of highly normalized temporary tables through the merge join algorithm. It is designed for the Anchor modeling technique, which requires a very high level of normalization. Anchor modeling is a novel modeling technique, designed for classical and adapted by the authors of the article for Big Data environment and a massively parallel processing (MPP) . Anchor modeling provides flexibility and high speed of data loading, where the presented approach adds support for fast ad-hoc analysis of Big Data sets (tens of terabytes). Different approaches to query plan optimization are described and estimated, for -based and - based databases. Theoretical estimations and results of real data experiments carried out in a column-based MPP environment (HP Vertica) are presented and compared. The results show that the approach is particularly favorable when the available RAM resources are scarce, so that a switch is made from pure in-memory processing to spilling over from hard disk, while executing ad-hoc queries. Scaling is also investigated by running the same analysis on different numbers of nodes in the MPP cluster. Configurations of five, ten and twelve nodes were tested, using click stream data of Avito, the biggest classified site in Russia.

Key words: Big Data, massively parallel processing (MPP), database, normalization, analytics, ad-hoc, querying, modeling, performance. Citation: Golov N.I., Ronnback L. (2015) SQL query optimization for highly normalized Big Data. Business Informatics, no. 3 (33), pp. 7–14.

Introduction havior data into millions and billions of dollars. How- ig Data analysis is one of the most popular IT ever, implementations and platforms fast enough to tasks today. Banks, telecommunication compa- execute various analytical queries over all available data Bnies, and big web companies, such as Google, remain the main issue. Until now, Hadoop has been Facebook, and Twitter produce tremendous amounts considered the universal solution. But Hadoop has its of data. Moreover, nowadays business users know how drawbacks, especially in speed and in its ability to proc- to monetize such data [2]. Various artificial intelligence ess difficult queries, such as analyzing and combining marketing techniques can transform big customer be- heterogeneous data [6].

BUSINESS INFORMATICS №3(33)–2015 7 DECISION MAKING AND BUSINESS INTELLIGENCE

This paper introduces a new data processing ap- described as entities of their own. The Name of a Cus- proach, which can be implemented inside a relational tomer and the Sum of an Order are two example at- DBMS. The approach significantly increases the vol- tributes. An attribute table must contain at least two col- ume of data that can be analyzed within a given time umns, one for the entity key and one for the attribute frame. It has been implemented for fast ad-hoc query value. If an entity has three attributes, three separate at- processing inside the column oriented DBMS Vertica tribute tables have to be created. [7]. With Vertica, this approach allows data scientists to 3. Tie, table of entity relationships. For example, perform fast ad-hoc queries, processing terabytes of raw which Customer placed a certain Order is typical for a data in minutes, dozens of times faster than this data- tie. Tie tables must contain at least two columns, one for base can normally operate. Theoretically, it can increase the first entity key (that of the Customer) and one for the the performance of ad-hoc queries inside other types of second (that of the related Order). DBMS, too (experiments are planned within the frame- work of further research). The approach is based on the new database modeling Sum Tie Name technique called Anchor modeling. Anchor modeling was first implemented to support a very high speed of by loading new data into a data warehouse and to support placed fast changes in the logical model of the domain area, such as addition of new entities, new relationships be- Order Customer tween them, and new attributes of the entities. Later, it turned out to be extremely convenient for fast ad-hoc Anchor Attribute Attribute Tie queries, processing high volumes of data, from hundreds of gigabytes up to tens of terabytes. Ord ID Cust ID Ord ID Sum Cust ID Name Ord ID Cust ID The paper is structured as follows. Since not everyone 12 4 12 150 4 Dell 12 4 may be familiar with Anchor modeling, Section 1 ex- plains its main principles, Section 2 discusses the aspects 26 8 26 210 8 HP 26 8 of data distribution in an massively parallel processing 35 35 98 35 (MPP) environment, Section 3 defines the scope of analytical queries, Section 4 introduces the main prin- Fig. 1. Anchor modeling sample data model ciples of query optimization approach for analytical Ties and attributes can store historical values using queries. Section 5 discusses the business impact this ap- principles of slow changing dimensions type 2 (SCD 2) proach has on Avito, and the paper is concluded in the [4]. Distinctive characteristic of Anchor modeling is that final section. historicity is provided by a single date-time field (so- called FROM DATE), not a two fields (FROM DATE – 1. Anchor Modeling TO DATE), as for methodology. If Anchor modeling is a database modeling technique, historicity is provided by a single field (FROM DATE), based on the usage of the normal form (6NF) [9]. then second border of interval (TO DATE) can be es- Modeling in 6NF yields the maximal level of table de- timated as FROM DATE of next sequential value, or composition, so that a table in 6NF has no non-trivial NULL, if the next value is absent. join dependencies. That is why tables in 6NF usually contain as few columns as possible. The following con- 2. Data structs are used in Anchor modeling: distribution 1. Anchor, table of keys. Each logical entity of domain area must have a corresponding Anchor table. This table Massive parallel processing databases generally have contains unique and immutable identifiers of objects of shared-nothing scale-out architectures, so that each a given entity (surrogate keys). Customer and Order are node holds some subset of the database and enjoy a two example anchors. An anchor table may also contain high degree of autonomy with respect to executing technical columns, such as . parallelized parts of queries. For each workload posed 2. Attribute, table of attribute values. This table stores to a cluster, the most efficient utilization occurs when the values of a logical entities attributes that cannot be the workload can be split equally among the nodes and

8 BUSINESS INFORMATICS №3(33)–2015 DECISION MAKING AND BUSINESS INTELLIGENCE each node can perform as much of its assigned work an instance takes part in can be resolved without the in- as possible without the involvement of other nodes. In volvement of other nodes, and that joining a tie will not short, data transfer between nodes should be kept to require any sort operations. Therefore, tie will require a minimum. Given arbitrarily modeled data, achiev- no more than one additional projection. ing maximum efficiency for every query is unfeasible Usage of single date-time field for historicity (SCD2) due to the amount of duplication that would be need- gives significant benefit – it does not require updates. ed on the nodes, while also defeating the purpose of New values can be added exclusively by inserts. This fea- a cluster. However, with Anchor modeling, a balance ture is very important for column-based databases, that is found where most data is distributed and the right are designed for fast select and insert operations, but are amount of data is duplicated in order to achieve high extremely slow for delete and update operations. Single performance for most queries. The following three dis- date-time field historicity also requires efficient analyti- tribution schemes are commonly present in MPP data- cal queries to estimate closing date for each value. Dis- bases and they suitably coincide with Anchor modeling tribution strategy, described above, guarantees, that all constructs. values for single anchor are stored on a single node and Broadcast the data to all nodes. Broadcasted data is properly sorted, so closing date estimation can be ex- available locally, duplicated, on every node in the clus- tremely efficient. ter. Knots are broadcasted, since they normally take up little space and may be joined from both attributes 3. Analytical and ties. In order to assemble an instance with knotted queries attributes or relationships represented through knot- ted ties without involving other nodes, the knot val- Databases can be utilized to perform two types of ues must be available locally. Thanks to the uniqueness tasks: OLTP tasks or OLAP tasks. OLTP means online constraint on knot values, query conditions involving transaction processing: huge number of small insert, them can be resolved very quickly and locally, to reduce update, delete, select operations, where each operation the row count of intermediate result sets in the execu- processes small chunk of data. OLTP tasks are typical for tion of the query. operational databases. OLAP means online analytical Segment the data according to some operation that processing: relatively small number of complex analyti- splits it across the nodes. For example, a modulo op- cal select queries, where each query processes significant eration on a hash of the surrogate identifier column part of data (or whole data). Given article is focused on in the anchor and the attribute tables could be used to OLAP queries to Big Data. determine on which node data should end up. Using Analytical queries may include three types of subtasks: the same operation on anchors and attributes keeps 1. Combine dataset according to query. Big Data is an instance of an ensemble and its history of changes stored as tables, which have to be combined (joined) ac- together on the same node. Assembling an instance cording to given conditions. in part or in its entirety can therefore be done with- out the involvement of other nodes. Since anchors and 2. Filter dataset according to query. Query can contain attributes retain the same sort order for the surrogate conditions on the data: single column conditions, col- identifier, no sort operations are needed when they are umn comparisons and sampling conditions (for exam- joined. ple «top 10 percent of user accounts according to total number of payments of each user account»). Project the data with a specified sort order and speci- fied segmentation. If table has multiple columns, if it is 3. Calculate aggregates over filtered datasets. segmented by one column, and is joined by another col- umn, than this join operation will require redistribution 4. Efficient plans of data across all nodes. MPP databases support such for analytical queries joins with an ability to create projection – full copy of a table, segmented and sorted by another column. There SQL can describe almost all types of analytical que- can be multiple projections of a single table, according ries. It is one of the main advantages of SQL and rela- to business needs. Ties have one projection for each role tional databases: if analytical query can be formulated in the tie, where each segmentation is done over and or- using SQL, than relational database can process it and dered by the surrogate identifier representing the cor- return correct result. Main problem lies in the optimiza- responding role. This ensures that all relationships that tion of the query execution time.

BUSINESS INFORMATICS №3(33)–2015 9 DECISION MAKING AND BUSINESS INTELLIGENCE

In this paper, queries are analyzed from a Big Data analyst’s point of view. Analysts want to get their answers as fast as possible. Query with suboptimal execution plan can be processed tens of times slower than optimal one. If query processes Big Data, inefficient plan can work hours and days, while efficient one can return result in minutes. This section contains main principles of optimiza- Case 1 Case 2 Case 3 tion analytical queries to Big Data in MPP environ- Fig. 2. Three cases of inner join ment, obtained by Avito. Growth of Avito data ware- house enabled its analysts to study various approaches Figure above demonstrates three cases of inner join. to query optimization and chose most efficient ones, First case: smaller table contain two disk pages of keys, able to speed queries to Big Data in an MPP environ- correspondent keys in bigger table occupy three disk ment from hours to minutes. These researches were pages. Second case: smaller table reduced twice, to one launched at the initial phase of DWH development, be- disk page, correspondent keys in bigger table are con- cause there were concerns about speed of OLAP que- centrated, so to read them one need to scan only two ries to a Anchor Model. On the further phases, those disk pages instead of three. Third case: smaller table re- results were applied to analysis of both, normalized duced twice, but correspondent keys are not concentrat- data (Anchor modeling) and denormalized data (data ed but distributed (see gaps), so one need to read same marts) with equal effectiveness. number of disk pages, as before (three). Here is a list of main risks of query execution plan in a Group by spill. Group by phase of the query can MPP environment: also be executed by Hash Group algorithm, and it also can be spilled on disk, as a join phase. This risk is actual Join spill. According to Section 3, analytical tasks if there are more than hundreds of millions of groups in a can require joins. Joins can be performed via Hash Join, group by phase. Nested Loop Join and Sort-Merge join algorithms. Nested loop join is inefficient for OLAP queries (it’s best Resegmentation/broadcast. If two tables are joined for OLTP) [3]. Sort phase of sort-merge algorithm join together, and they are joined and segmented over nodes can be inefficient for MPP environment, so if source of MPP cluster using different keys, than joining may be date is not sorted, query optimizer prefers hash join al- performed either by resegmentation of one table across gorithm. Hash join is based on storing left part of the all nodes, or by broadcasting all data of one table across join in RAM. If RAM is not enough, query will face a all nodes (see Section 2). This risk is actual if tables in join spill. Join spill (term can differ in various DBMS-s) question contains billions of rows. mean that some part of query execution plan requires Wrong materialization. Materialization strategy is too much RAM, and source data have to be separated extremely important for column-based databases [8]. into N chunks (small enough to fit into available RAM) Materialization strategy defines process of combin- to be processed consequently and at the end - to be com- ing column data together (in a column based databases bined together. Join spill reduces maximum RAM utili- they are stored separately). It is extremely important for zation, but increases disk I/O, a slower resource. Fol- join queries. There can be early materialization (read lowing diagram illustrates, why a condition reducing a columns, combine columns, than apply filter), or late table on one side of the join may not reduce the number materialization (read filtered column, apply filter, read of disk operations for the table on the other side of the other columns according to filtered rows, combine). Ac- join. Therefore, disk I/O operations can be estimated cording to [1], it is more efficient to select late materiali- according to optimistic (concentrated keys) or pessimis- zation strategy. But if a query is complex, query optimiz- tic (spread out keys) scenarios. In the optimistic one, er can make a mistake. If query requires join of tables A, the number of operations is the same as in a RAM join, B and C, and has a filtering condition to tables A and B, whereas in the pessimistic one the whole table may need late materialization may lead to loading of filtered values to be scanned for each chunk, so disk I/O can increase from A, B in parallel, all values from C, and join them N times. According to modern servers(>100Gb of RAM for together. This strategy is efficient if condition on table A a node), this risk is actual when left part of the join contains and B reduce data a lot, while join with tables B and C over one billion of rows. Hash join for tables of millions don’t reduce data. Otherwise, if each condition takes of rows is safe. of A and B, and conditions are uncorrelated, than join-

10 BUSINESS INFORMATICS №3(33)–2015 DECISION MAKING AND BUSINESS INTELLIGENCE ing filtered tables A and B will result in part of data, 1. Estimate data reduction rate for each table so early materialization can significantly reduce disk I/O of tables B and C. Risks, listed above, can be avoided by using modified query execution plans, based on following principles: where R(Tj) means number of rows. Chose the tables Maximum merge join utilization. Sort-Merge join with best reduction rate to be the first one (assume that it is table ). algorithm is a most effective for non-selective analyti- T1 cal queries among such algorithms as Hash Join, Nested 2. Take k1 tables, identically segmented with table T1.

Loop Join and Sort-Merge join, especially if analyzed Assume that those tables are . Join all k1 tables, data are already sorted on join key, so only merge join using all applicable filtering conditions, using merge join operation is required, without sort operations [3]. algorithm, consequently, starting from table T1. So table T has to be scanned and filtered first, than only cor- Single time disk page access. Query plan with a 1 responding rows from T has to be scanned and filtered join spill has awful characteristics in terms of pessi- 2 mistic number of disk page access operations because according to conditions and , than the same for ta- many disk pages are accessed multiple times during bles . plan execution. To create plan with optimal charac- 3. Estimate data reduction rate for remaining tables teristics each required disk page must be accessed no more than once. , Massive temporary table utilization. Two princi- using data tables, joined and filtered on step 2. Chose ples, mentioned above, cannot be guaranteed for all the tables with best reduction rate to be the next one (as- ad-hoc queries to real-world Big Data databases, cre- sume that it is table ). ated according to 1/2/3 normal form. Table of dozen columns cannot be sorted on each column (without 4. Save results of joining and filtering tables from step duplicating this table dozen times). Also it cannot be 2 inside a temporary table , identically segmented guaranteed that all keys in joined tables are concen- with table . trated, to equalize pessimistic and optimistic disk page 5. Replace tables of initial list with a table . operation count (see fig. 2). But this principles can be 6. Return to step 2 with reduced list of tables. guaranteed in a Anchor Modeling data model, if each 7. It list of tables contain single table, perform final ag- data subset, retrieved after each step of filtering, is gregations, save results. stored as a temporary table. Temporary tables can be sorted on specific column, to optimize further merge Given algorithm has important drawbacks: join. Also those temporary tables contain concentrated 1. It requires a lot of statistics about source tables, keys, only keys, required for given ad-hoc query. This about data reduction rate of conditions, for step 1. In feature helps to avoid double disk page access. Con- some cases first estimation of statistics can require more cerning speed of creating temporary table for large data efforts, than analytical query itself. set: modern HDDs provide equal speed for sequential 2. Statistics estimation, as well as creation of tempo- read and sequential write. Important. Given algorithms rary tables with dedicated segmentation, has to be im- don’t require creation of indexes of any kind. Only plemented manually, because current version of query storing sorted rows as a temporary table, so this opera- optimizers and query processing engines can not do it. tion can be as fast, as saving a data file on HDD. Sort- Avito used Python implementation. ing is enough for efficient Merge Join. Drawback 1 can be avoided using estimates from pre- Efficient plans for analytical queries, utilizing princi- viously calculated queries, continuously. First process- ples, listed above, can be created using following algo- ing of brand new query can be time-consuming. rithm: Drawback 2 can be avoided for identically segmented 0. Assume that query is based on joining tables tables, because algorithm ends on step 2. That is why , filtered according to complex conditions Anchor Modeling data model for MPP environment is , there is a condition on table j, especially favorable for approach. Step 2 does not re- which can be estimated only when i tables are already quire implementation of additional logic, it can be per- loaded. So, is single table filtering condition. is a formed using SQL, database query execution engine and combination of all conditions on table j. some hints.

BUSINESS INFORMATICS №3(33)–2015 11 DECISION MAKING AND BUSINESS INTELLIGENCE

5. Business enables parallel loading of all entities, their attributes, applications and links with or without historization of each link and attribute. The supposed drawback of a highly normal- The approach described in Section 4 has been im- ized data model is slow ad-hoc reporting, which is why plemented for ad-hoc querying in a Big Data (data Inmon [4] recommends combining normalized mod- warehouse) solution at Avito. Based in Moscow, Avito els for centralized repositories with denormalized data is Russias fastest growing e-commerce site and portal, marts. The paper, however, demonstrates that while «Russia’s Craiglist». The Avito data warehouse is imple- Anchor modeling may require some sort of denormal- mented according to the Anchor modeling methodolo- ization for single-entity analysis, it can yield very high gy. It contains 200 Anchors, 600 Attributes, and 300 performance for queries that involve multiple linked en- Ties. Using this model it loads, maintains, and presents tities, and particularly so when there is a risk of join spill many types of data. The greatest volume can be found (lack of RAM). in the part loaded from the click stream data sources. The experiments carried out showed benefits of the Click stream data are loaded every 15 minutes, with each given approach for the simplified case of two linked en- 15-minute batch containing 5 mln. rows at the begin- tities. Realworld business cases sometimes require three, ning of 2014 and 15 mln. at the beginning of 2015. four or even a dozen linked entities in the same ad-hoc Each loaded row contained approximately 30 columns query. Such cases multiply the risk of join spill occurring at the beginning of 2014 and approximately 70 columns in some step, and amplify its negative effect. If number at the beginning of 2015. Each column is a source for a of joins, tables and filtering conditions increases, que- separate Anchor, Attribute or Tie table. The current size ry RAM requirements estimation accuracy decreases. of the Avito data warehouse has been limited to 51Tb Query optimizer can significantly overestimate RAM for licensing reasons. It contains years of consistent his- requirements and cause an unnecessary join spill with all torical data from various sources (back office, Google associated drawbacks. The approach from Section 4 has DFP/AdSense, MDM, accounting software, various ag- been in use at Avito for over a year, for ad-hoc and reg- gregates), and a rolling half year of detailed data from ular reporting, and even for some near-real time KPIs. click stream. It has demonstrated stability in terms of execution time The presented approach for ad-hoc queries was suc- and resource consumption, while ordinary queries often cessfully implemented by the Avito BI team in less than degraded after months of usage because of data growth. a year. It provides analysts at Avito with a tool for fast One can wonder if the approach from Section 4 can be (5-10-2060 minutes) analysis of near-real time Big applied for implementations based on other techniques Data, according to various types of tasks, such as direct than Anchor modeling. The answer is yes, it can be ap- marketing, pricing, CTR prediction, illicit content de- plied, but with additional work. The strictly determined tection, and customer segmentation. structure of Anchor modeling enables Avito to have a fully automated and metadata-driven implementa- Conclusions tion of the given approach. To conclude, the approach, while not being uniquely tied to Anchor modeling, suits A high degree of normalization has the advantage of such implementations very well and compensates for the flexibility. The resulting data models can be extended commonly believed drawback of normalization in the easily and in a lossless manner. Anchor modeling also perspective of ad-hoc querying.

References

1. Abadi D.J., Myers D.S., DeWitt D.J., Madden S.R. (2007) Materialization strategies in a Column-Oriented DBMS. Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE 2007), April 15-20, 2007, Istanbul, Turkey, pp. 466–475. 2. Chen M., Mao S., Liu Y. (2014) Big Data: A survey. Mobile Networks and Applications, no. 19, pp. 171–209. 3. Graefe G. (1999) The value of merge-join and hash-join in SQL Server. Proceedings of the 25th International Conference on Very Large Data Bases (VLDB 1999), September 7-10, 1999, Edinburgh, Scotland, pp. 250–253. 4. Inmon W.H. (1992) Building the Data Warehouse, Wellesley, MA: QED Information Sciences. 5. Kalavri V., Vlassov V. (2013) MapReduce: Limitations, optimizations and open issues. Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2013), July 16-18, 2013, Melbourne, Australia, pp. 1031–1038.

12 BUSINESS INFORMATICS №3(33)–2015 DECISION MAKING AND BUSINESS INTELLIGENCE

6. Knuth D. (1998) The art of computer programming. Volume 3: Sorting and searching. 2nd edition, Boston, MA: Addison–Wesley. 7. Lamb A., Fuller M., Varadarajan R., Tran N., Vandiver B., Doshi L., Bear C. (2012) The Vertica analytic data- base: C-store 7 years later. Proceedings of the VLDB Endowment, vol. 5, no. 12, pp. 1790–1801. 8. Shrinivas L., Bodagala S., Varadarajan R., Cary A., Bharathan V., Bear C. (2013) Materialization strategies in the Vertica analytic database: Lessons learned. Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE 2013), April 8-11, 2013, Brisbane, Australia, pp. 1196–1207. 9. Ronnback L., Regardt O., Bergholtz M., Johannesson P., Wohed P. (2010) Editorial: Anchor Modeling – agile information modeling in evolving data environments, Data and Knowledge Engineering, vol. 69, no. 12, pp. 1229–1253.

ОПТИМИЗАЦИЯ SQL-ЗАПРОСОВ ДЛЯ ВЫСОКОНОРМАЛИЗОВАННЫХ БОЛЬШИХ ДАННЫХ

Н.И. ГОЛОВ преподаватель кафедры бизнес-аналитики, школа бизнес-информатики, факультет бизнеса и менеджмента, Национальный исследовательский университет «Высшая школа экономики» Адрес: 101000, г. Москва, ул. Мясницкая, д. 20 E-mail: [email protected] Л. РОНБАК преподаватель факультета компьютерных наук, Университет Стокгольма Адрес: SE-106 91 Stockholm, Sweden E-mail: [email protected]

В данной статье описывается подход для быстрого анализа больших данных в реляционной модели данных. Целью данного подхода является достижение максимального использования высоконормализанных временных таблиц, объединяемых посредством алгоритма соединения слиянием (merge join algorithm). Подход был разработан для методологии Anchor Modeling, предполагающей крайне высокий уровень нормализации таблиц. Anchor Modeling – это новейшая методология построения хранилищ данных, разработанная для классических баз данных и адаптированная для задач больших данных и MPP (массивно-параллельных) баз данных авторами статьи. Anchor Modeling обеспечивает гибкость расширения и высокую скорость загрузки данных, в то время как представленный подход к оптимизации запросов дополняет методологию возможностью «на лету» проводить быстрый анализ больших выборок данных (десятки Тб). В статье описаны и оценены различные подходы к оптимизации планов выполнения запросов для колоночных и обычных (строчных) баз данных. Представлены и сопоставлены результаты теоретических оценок и практических экспериментов на реальных данных, проведенных на платформе колоночной массивно- параллельной (MPP) базы данных HP Vertica. Результаты сравнения демонстрируют, что подход особенно эффективен для случаев нехватки доступной оперативной памяти, в результате чего оптимизатору запросов базы данных при обработке аналитических запросов приходится переходить от наиболее оптимального режима обработки в оперативной памяти (in-memory) к режиму подкачки с жесткого диска. Также изучен вопрос масштабирования нагрузки. Для этого один и тот же анализ производился на кластерах массивно-параллельной СУБД Вертика, состоящих из разного количества серверов. Были испытаны конфигурации из пяти, десяти и двенадцати серверов. Для анализа применялись данные типа «поток кликов» – обезличенные данные о кликах пользователей Авито, крупнейшего российского сайта объявлений.

BUSINESS INFORMATICS №3(33)–2015 13 ПРИНЯТИЕ РЕШЕНИЙ И БИЗНЕС-ИНТЕЛЛЕКТ

Ключевые слова: большие данные, массивно-параллельная обработка (MPP), база данных, нормализация, аналитика, аналитика «на лету», запросы, моделирование, производительность. Цитирование: Golov N.I., Ronnback L. SQL query optimization for highly normalized Big Data // Business Informatics. 2015. No. 3 (33). P. 7–14.

Литература

1. Abadi D.J., Myers D.S., DeWitt D.J., Madden S.R. Materialization strategies in a Column-Oriented DBMS // Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE 2007), April 15-20, 2007, Istanbul, Turkey. IEEE, 2007. P. 466–475. 2. Chen M., Mao S., Liu Y. Big Data: A survey // Mobile Networks and Applications. 2014. No. 19. P. 171–209. 3. Graefe G. The value of merge-join and hash-join in SQL Server // Proceedings of the 25th International Conference on Very Large Data Bases (VLDB 1999), September 7-10, 1999, Edinburgh, Scotland. 1999. P. 250–253. 4. Inmon W.H. Building the Data Warehouse. Wellesley, MA: QED Information Sciences, 1992. 272 p. 5. Kalavri V., Vlassov V. MapReduce: Limitations, optimizations and open issues // Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2013), July 16-18, 2013, Melbourne, Australia. IEEE, 2013. P. 1031–1038. 6. Knuth D. The art of computer programming. Volume 3: Sorting and searching. 2nd edition. Boston, MA: Addison–Wesley, 1998. 782 p. 7. The Vertica analytic database: C-store 7 years later / A. Lamb [et al] // Proceedings of the VLDB Endowment. 2012. Vol. 5, No. 12. P. 1790–1801. 8. Shrinivas L., Bodagala S., Varadarajan R., Cary A., Bharathan V., Bear C. Materialization strategies in the Vertica analytic database: Lessons learned // Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE 2013), April 8-11, 2013, Brisbane, Australia. IEEE, 2013. P. 1196–1207. 9. Editorial: Anchor Modeling – agile information modeling in evolving data environments / L. Ronnback [et al] // Data and Knowledge Engineering. 2010. Vol. 69, No. 12. P. 1229–1253.

14 БИЗНЕС-ИНФОРМАТИКА №3(33)–2015 г.