Monetdb: the Challenges of a Scientific Database

Total Page:16

File Type:pdf, Size:1020Kb

Monetdb: the Challenges of a Scientific Database MonetDB: the Challenges of a Scientific Database Milena Ivanova, Niels Nes, Romulo Goncalves, Martin Kersten CWI, Amsterdam SkyServer Schema 446 columns >585 million rows 6 columns > 20 Billion rows M. Ivanova et al., CWI Outline • MonetDB/SQL • SkyServer porting lessons • Query log lessons • Recycling • Evaluation • Outlook M. Ivanova et al., CWI MonetDB Background PhotoObjAll Ra Dec U ... 0.0645 1.2079 14.70872 … 0.1433 1.0662 11.71277 … 0.2811 1.2495 12.02889 … … … … Ra BAT Dec BAT U BAT H T H T H T 0@0 Ra 0@0 Dec 0@0 U 0@0 0.0645 0@0 1.2079 0@0 14.70872 1@0 0.1433 1@0 1.0662 1@0 11.71277 2@0 0.2811 2@0 1.2495 2@0 12.02889 … … … M. Ivanova et al., CWI MonetDB Architecture select count(*) from photoobjall; function user.s3_1():void; SQL XQuery X1:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",0); X6:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",1); X9:bat[:oid,:lng] := sql.bind("sys","photoobjall","objid",2); X13:bat[:oid,:oid] := sql.bind_dbat("sys","photoobjall",1); X8 := algebra.kunion(X1,X6); X11 := algebra.kdifference(X8,X9); MAL X12 := algebra.kunion(X11,X9); X14 := bat.reverse(X13); X15 := algebra.kdifference(X12,X14); Tactical Optimizer X16 := calc.oid(0@0); X18 := algebra.markT(X15,X16); X19 := bat.reverse(X18); X20 := aggr.count(X19); MAL sql.exportValue(1,"sys.","count_","int",32,0,6,X20,""); end s3_1; MonetDB Kernel MonetDB Server M. Ivanova et al., CWI SkyServer with MonetDB Goal: To provide SkyServer mirror with similar functionality using MonetDB Three phases: 1%, 10%, entire SDSS data set Can we • Do better in terms of performance and functionality? • Improve query processing by novel parallelism and query cracking techniques? M. Ivanova et al., CWI Portability Lessons • Need for rich SQL environment (PSM) • Cast to SQL:2003 standard – Replacement of data types and operations – Specific extensions ignored or replaced • Avoid data redundancy – Auxiliary tables replaced by views:10% size reduction M. Ivanova et al., CWI Spatial Search Lesson • HTM (Hierarchical Triangular Mesh) – Implemented in C++, C# – Good for point-near-point and point-in- region queries • Zones – Implemented in SQL – Good for point-near-point (x3) – Efficient for batch-oriented spatial join(x32) – Enables SQL optimizer usage M. Ivanova et al., CWI Query Log Lessons • Query logs important for both application and science • Analysed 1.2M queries, August 2006 • Spatial access prevails (83%) • Small core of photo and spectro tables accessed – 64% photo, 44% spectro, 27% both M. Ivanova et al., CWI Common Patterns • Limited number of query patterns – Correlation to web site interface • Most popular query (25%) SELECT top 10 p.objID, p.run, p.rerun, p.camcol, p.field, p.obj, p.type, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, p.Err_u, p.Err_g, p.Err_r, p.Err_i, p.Err_z FROM fGetNearbyObjEq(195,2.5,3) n, PhotoPrimary p WHERE n.objID = p.objID; M. Ivanova et al., CWI Spatial Overlap • 24% queries overlap • Mean sequence length of 9.4, max of 6200 • Overlap and equality patterns for script- based interaction • Zoom in/zoom out patterns for manual interaction M. Ivanova et al., CWI Evaluation on 100GB • ‘Color-cut’ for low-z quasars SELECT g, run, rerun, camcol, field, objID, FROM Galaxy WHERE ( ( g <= 22) and (u - g >= -0.27) and (u - g < 0.71) and (g - r >= -0.24) and (g - r < 0.35) and (r - i >= -0.27) and (r - i < 0.57) and (i - z >= -0.35) and (i - z < 0.7) ); • Moving asteroids SELECT objID, sqrt(power(rowv,2) + power(colv,2)) as velocity FROM PhotoObj WHERE power(rowv,2) + power(colv,2) > 50 and rowv >= 0 and colv >= 0; M. Ivanova et al., CWI Staircase to the sky • Status nov 2008 – 1GB: done – 100GB: done – 2.7 TB DR6 • Platform – Dual quadcore 2.4Ghz, 64GB, 6TB raid5 • Web site M. Ivanova et al., CWI Moving ahead • Progress 2009 – Download DR 7 and installation – Development of export/attach functionality – Development of partial result recycler M. Ivanova et al., CWI MonetDB Background • Tuple-at-a-time pipelined execution – Materialized views and caches – Semi-automatic • Operator-at-a-time – Materialized intermediates – Automatic management and low cost Self-organizing cache of intermediates to speed up query streams MonetDB Architecture SQL XQuery function user.s1_2(A0:date, ...):void; X5 := sql.bind("sys","lineitem",...); X10 := algebra.select(X5,A0); X12 := sql.bindIdx("sys","lineitem",...); X15 := algebra.join(X10,X12); MAL X25 := mtime.addmonths(A1,A2); Recycler ... Tactical Optimizer Optimizer function user.s1_2(A0:date, ...):void; X5 := sql.bind("sys","lineitem",...); MAL X10 := algebra.select(X5,A0); X12 := sql.bindIdx("sys","lineitem",...); X15 := algebra.join(X10,X12); MonetDB Kernel Run-time Support X25 := mtime.addmonths(A1,A2); Admission & Eviction ... Recycle MonetDB Pool Server Instruction Matching Run time comparison of • instruction types • argument values Y3 := sql.bind("sys","orders","o_orderdate",0); Exact X1 := sql.bind("sys","orders","o_orderdate",0); matching … Name Value Data type Size X1 10 :bat[:oid,:date] T1 “sys” :str T2 “orders” :str … Instruction Subsumption Y3 := algebra.select(X1,20,45); X3 := algebra.select(X1,10,80); … X5X5 := algebra.select(X1,20,60); Name Value Data type Size X1 10 :bat[:oid,:int] 2000 X3 130 :bat[:oid,:int] 700 X5 150 :bat[:oid,:int] 350 … a Cache with Lineage Q1 … algebra.join algebra.join(X2,X3) X4 := algebra.select X3 := algebra.select(X1) X2 := sql.bind(“C1“) sql.bind(“C2“) X1 := sql.bind(“C1“) sql.bind(“C2“) a Cache with Lineage Q2 … algebra.join algebra.join sql.bind(“C3“) X4 := algebra.join(X2,X3) X4 X3 := sql.bind(“C2“) X3 algebra.select X2 := algebra.select(X1) X2 X1 := sql.bind(“C1“) X1 sql.bind(“C1“) sql.bind(“C2“) Mismatching Q2 … algebra.join Y4 Y4 := algebra.join(Y2,Y3) algebra.join sql.bind(“C3“) X4Y3 := algebra.join(X2,X3)sql.bind(“C2“) Y2 !=X2 Y2X3 := algebra.select(Y1)sql.bind(“C2“) algebra.select Y1X2 := sql.bind(“C1“)algebra.select(X1) Y1 Y3 !=X3 X1 := sql.bind(“C1“) sql.bind(“C1“) sql.bind(“C2“) Admission Policies Decide about storing the results • KEEPALL – all instructions advised by the optimizer • CREDIT – instructions supplied with credits – storage ‘paid’ with 1 credit – reuse returns credits – lack of reuse limits admission and resource claims Cache Policies • Decide about eviction of intermediates • Filter ‘top’ instructions without dependents • Pick instructions with smallest utility – LRU : time of computation or last reuse – BENEFIT : estimated contribution to performance: CPU and I/O costs, recycling • Triggered by resource limitations (memory or entries) SkyServer Evaluation • 100 GB subset of DR4 • 100-query batch from January 2008 log • 1.5GB intermediates, 99% reuse • Join intermediates major contributor to savings Status aug 2009 • DR 7 fully loaded – Loading and integrity checking – Queries ran and traces collected – 01: real 1m43.142s 5 rows – 02: real 0m10.836s 310 rows – 03: real 9m55.870s 7805794 rows – 04: real 3m46.905s 2088794 rows – 05: real 3m54.591s 264954 rows – 06: real 7m11.867s 584884 rows – 07: real 0m0.501s 1148 rows – 08: real 1m43.935s 58599 rows – 09: real 26m23.968s 33086 rows – 11: real 13m25.067s 11 rows – 12: real 0m0.914s 44 rows – 13: real 0m0.893s 4 rows – 14: real 5m18.018s 207 rows M. Ivanova et al., CWI Query traces [ 21472701 usec @0 _55<tmp_37313>[357175411] := algebra.uselect(_53=<tmp_37322>[585634220],1); ] [ 49 usec @0 _53 := nil:BAT; ] [ 38150716 usec @0 _56<tmp_37322>[357175411] := algebra.semijoin(_45=<tmp_37316>[585634220],_55=<tmp_37313>[357175411]); ] [ 38 usec @0 _45 := nil:BAT; ] [ 284808 usec @0 _55 := nil:BAT; ] [ 31309800 usec @0 _57<tmp_37313>[180377011] := algebra.uselect(_56=<tmp_37322>[357175411],6:sht); ] [ 2059575 usec @0 _56 := nil:BAT; ] [ 32340510 usec @0 _58<tmp_37322>[43684077] := algebra.semijoin(_36=<tmp_37330>[132480668],_57=<tmp_37313>[180377011]); ] [ 91705 usec @0 _36 := nil:BAT; ] [ 229621 usec @0 _59<tmp_37264>[585634220] := batcalc.flt(_26=<tmp_37332>[585634220]); ] [ 25123250 usec @0 _60<tmp_37316>[409973371] := algebra.thetauselect(_59=<tmp_37264>[585634220],A1=22.2999992,">"); ] [ 1208014 usec @0 _59 := nil:BAT; ] [ 30692604 usec @0 _61<tmp_37264>[121305794] := algebra.semijoin(_60=<tmp_37316>[409973371],_57=<tmp_37313>[180377011]); ] [ 295106 usec @0 _60 := nil:BAT; ] [ 4440129 usec @0 _57 := nil:BAT; ] [ 19408689 usec @0 _62<tmp_37313>[132511910] := algebra.kunion(_58=<tmp_37322>[43684077],_61=<tmp_37264>[121305794]); ] [ 105538 usec @0 _58 := nil:BAT; ] [ 320537 usec @0 _61 := nil:BAT; ] [ 49 usec @0 _64<tmp_37264>[132511910] := algebra.markT(_62=<tmp_37313>[132511910],0@0); ] [ 6 usec @0 _62 := nil:BAT; ] [ 5 usec @0 _65<tmpr_37264>[132511910] := bat.reverse(_64=<tmp_37264>[132511910]); ] [ 3 M. Ivanova et al., CWI • 2264769425 32 algebra.join • 1535721272 285 algebra.leftjoin • 636128509 76 algebra.semijoin • 306371507 59 algebra.uselect • 112269101 10 batcalc.* • 106087174 36 batcalc.- • 101124840 28 algebra.thetauselect • 54344621 2 user.getnearbyobjectsmode • 19421076 391 algebra.kunion • 2577559 25 batcalc.flt • 1242179 8 batcalc.int • 1105366 9 batcalc.+ • 657450 30 bat.append • 402815 570 sql.bind M. Ivanova et al., CWI Summary • Database architecture augmented with recycling intermediates • Self-organizing technique • Extension to MonetDB transforming materialization overhead into benefit Future Work • Refining cache policies • Opportunities by query class recognition • Automatic switch to suitable policies • Automatic database replication • Distributed processing (Octopus) Recycling Is Green 30/06/2009 An Architecture for Recycling 31/20 SIGMOD'09 Intermediates M. Ivanova, M. L. Providence, RI Kersten, N. Nes, R. Goncalves Inspirations • Self-organization vs. hard-coded zoning – Adaptive segmentation (ICDE’08) – Adaptive replication (EDBT’08) • Results caching and reuse • Workload-driven optimization M. Ivanova et al., CWI .
Recommended publications
  • Monetdb/X100: Hyper-Pipelining Query Execution
    MonetDB/X100: Hyper-Pipelining Query Execution Peter Boncz, Marcin Zukowski, Niels Nes CWI Kruislaan 413 Amsterdam, The Netherlands fP.Boncz,M.Zukowski,[email protected] Abstract One would expect that query-intensive database workloads such as decision support, OLAP, data- Database systems tend to achieve only mining, but also multimedia retrieval, all of which re- low IPC (instructions-per-cycle) efficiency on quire many independent calculations, should provide modern CPUs in compute-intensive applica- modern CPUs the opportunity to get near optimal IPC tion areas like decision support, OLAP and (instructions-per-cycle) efficiencies. multimedia retrieval. This paper starts with However, research has shown that database systems an in-depth investigation to the reason why tend to achieve low IPC efficiency on modern CPUs in this happens, focusing on the TPC-H bench- these application areas [6, 3]. We question whether mark. Our analysis of various relational sys- it should really be that way. Going beyond the (im- tems and MonetDB leads us to a new set of portant) topic of cache-conscious query processing, we guidelines for designing a query processor. investigate in detail how relational database systems The second part of the paper describes the interact with modern super-scalar CPUs in query- architecture of our new X100 query engine intensive workloads, in particular the TPC-H decision for the MonetDB system that follows these support benchmark. guidelines. On the surface, it resembles a The main conclusion we draw from this investiga- classical Volcano-style engine, but the cru- tion is that the architecture employed by most DBMSs cial difference to base all execution on the inhibits compilers from using their most performance- concept of vector processing makes it highly critical optimization techniques, resulting in low CPU CPU efficient.
    [Show full text]
  • Hannes Mühleisen
    Large-Scale Statistics with MonetDB and R Hannes Mühleisen DAMDID 2015, 2015-10-13 About Me • Postdoc at CWI Database architectures since 2012 • Amsterdam is nice. We have open positions. • Special interest in data management for statistical analysis • Various research & software projects in this space 2 Outline • Column Store / MonetDB Introduction • Connecting R and MonetDB • Advanced Topics • “R as a Query Language” • “Capturing The Laws of Data Nature” 3 Column Stores / MonetDB Introduction 4 Postgres, Oracle, DB2, etc.: Conceptional class speed flux NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1 Physical (on Disk) NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1 5 Column Store: class speed flux Compression! NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1 NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1 6 What is MonetDB? • Strict columnar architecture OLAP RDBMS (SQL) • Started by Martin Kersten and Peter Boncz ~1994 • Free & Open Open source, active development ongoing • www.monetdb.org Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB. Communications of the ACM 51, 12 (December 2008), 77-85. DOI=10.1145/1409360.1409380 7 MonetDB today • Expanded C code • MAL “DB assembly” & optimisers • SQL to MAL compiler • Memory-Mapped files • Automatic indexing 8 Some MAL • Optimisers run on MAL code • Efficient Column-at-a-time implementations EXPLAIN SELECT * FROM mtcars; | X_2 := sql.mvc(); | | X_3:bat[:oid,:oid] := sql.tid(X_2,"sys","mtcars"); | | X_6:bat[:oid,:dbl]
    [Show full text]
  • LDBC Introduction and Status Update
    8th Technical User Community meeting - LDBC Peter Boncz [email protected] ldbcouncil.org LDBC Organization (non-profit) “sponsors” + non-profit members (FORTH, STI2) & personal members + Task Forces, volunteers developing benchmarks + TUC: Technical User Community (8 workshops, ~40 graph and RDF user case studies, 18 vendor presentations) What does a benchmark consist of? • Four main elements: – data & schema: defines the structure of the data – workloads: defines the set of operations to perform – performance metrics: used to measure (quantitatively) the performance of the systems – execution & auditing rules: defined to assure that the results from different executions of the benchmark are valid and comparable • Software as Open Source (GitHub) – data generator, query drivers, validation tools, ... Audience • For developers facing graph processing tasks – recognizable scenario to compare merits of different products and technologies • For vendors of graph database technology – checklist of features and performance characteristics • For researchers, both industrial and academic – challenges in multiple choke-point areas such as graph query optimization and (distributed) graph analysis SPB scope • The scenario involves a media/ publisher organization that maintains semantic metadata about its Journalistic assets (articles, photos, videos, papers, books, etc), also called Creative Works • The Semantic Publishing Benchmark simulates: – Consumption of RDF metadata (Creative Works) – Updates of RDF metadata, related to Annotations • Aims to be
    [Show full text]
  • The Database Architectures Research Group at CWI
    The Database Architectures Research Group at CWI Martin Kersten Stefan Manegold Sjoerd Mullender CWI Amsterdam, The Netherlands fi[email protected] http://www.cwi.nl/research-groups/Database-Architectures/ 1. INTRODUCTION power provided by the early MonetDB implementa- The Database research group at CWI was estab- tions. In the years following, many technical inno- lished in 1985. It has steadily grown from two PhD vations were paired with strong industrial maturing students to a group of 17 people ultimo 2011. The of the software base. Data Distilleries became a sub- group is supported by a scientific programmer and sidiary of SPSS in 2003, which in turn was acquired a system engineer to keep our machines running. by IBM in 2009. In this short note, we look back at our past and Moving MonetDB Version 4 into the open-source highlight the multitude of topics being addressed. domain required a large number of extensions to the code base. It became of the utmost importance to support a mature implementation of the SQL- 2. THE MONETDB ANTHOLOGY 03 standard, and the bulk of application program- The workhorse and focal point for our research is ming interfaces (PHP, JDBC, Python, Perl, ODBC, MonetDB, an open-source columnar database sys- RoR). The result of this activity was the first official tem. Its development goes back as far as the early open-source release in 2004. A very strong XQuery eighties when our first relational kernel, called Troll, front-end was developed with partners and released was shipped as an open-source product. It spread to in 2005 [1].
    [Show full text]
  • Micro Adaptivity in Vectorwise
    Micro Adaptivity in Vectorwise Bogdan Raducanuˇ Peter Boncz Actian CWI [email protected] [email protected] ∗ Marcin Zukowski˙ Snowflake Computing marcin.zukowski@snowflakecomputing.com ABSTRACT re-arrange the shape of a query plan in reaction to its execu- Performance of query processing functions in a DBMS can tion so far, thereby enabling the system to correct mistakes be affected by many factors, including the hardware plat- in the cost model, or to e.g. adapt to changing selectivities, form, data distributions, predicate parameters, compilation join hit ratios or tuple arrival rates. Micro Adaptivity, intro- method, algorithmic variations and the interactions between duced here, is an orthogonal approach, taking adaptivity to these. Given that there are often different function imple- the micro level by making the low-level functions of a query mentations possible, there is a latent performance diversity executor more performance-robust. which represents both a threat to performance robustness Micro Adaptivity was conceived inside the Vectorwise sys- if ignored (as is usual now) and an opportunity to increase tem, a high performance analytical RDBMS developed by the performance if one would be able to use the best per- Actian Corp. [18]. The raw execution power of Vectorwise forming implementation in each situation. Micro Adaptivity, stems from its vectorized query executor, in which each op- proposed here, is a framework that keeps many alternative erator performs actions on vectors-at-a-time, rather than a function implementations (“flavors”) in a system. It uses a single tuple-at-a-time; where each vector is an array of (e.g.
    [Show full text]
  • Hmi + Clds = New Sdsc Center
    New Technologies for Data Management Chaitan Baru 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 2 2 Why new technologies? • Big Data Characteristics: Volume, Velocity, Variety • Began as a Volume problem ▫ E.g. Web crawls … ▫ 1’sPB-100’sPB in a single cluster • Velocity became an issue ▫ E.g. Clickstreams at Facebook ▫ 100,000 concurrent clients ▫ 6 billion messages/day • Variety is now important ▫ Integrate, fuse data from multiple sources ▫ Synoptic view; solve complex problems 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 3 3 Varying opinions on Big Data • “It’s all about velocity. The others issues are old problems.” • “Variety is the important characteristic.” • “Big Data is nothing new.” 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 4 4 Expanding Enterprise Systems: From OLTP to OLAP to “Catchall” • OLTP: OnLine Transaction Processing ▫ E.g., point-of-sale terminals, e-commerce transactions, etc. ▫ High throughout; high rate of “single shot” read/write operations ▫ Large number of concurrent clients ▫ Robust, durable processing 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO 5 5 Expanding Enterprise Systems: OLAP • OLAP: Decision Support Systems requiring complex query processing ▫ Fast response times for complex SQL queries ▫ Queries over historical data ▫ Fewer number of concurrent clients ▫ Data aggregated from multiple systems Multiple business systems, e.g.
    [Show full text]
  • Columnar Storage in SQL Server 2012
    Columnar Storage in SQL Server 2012 Per-Ake Larson Eric N. Hanson Susan L. Price [email protected] [email protected] [email protected] Abstract SQL Server 2012 introduces a new index type called a column store index and new query operators that efficiently process batches of rows at a time. These two features together greatly improve the performance of typical data warehouse queries, in some cases by two orders of magnitude. This paper outlines the design of column store indexes and batch-mode processing and summarizes the key benefits this technology provides to customers. It also highlights some early customer experiences and feedback and briefly discusses future enhancements for column store indexes. 1 Introduction SQL Server is a general-purpose database system that traditionally stores data in row format. To improve performance on data warehousing queries, SQL Server 2012 adds columnar storage and efficient batch-at-a- time processing to the system. Columnar storage is exposed as a new index type: a column store index. In other words, in SQL Server 2012 an index can be stored either row-wise in a B-tree or column-wise in a column store index. SQL Server column store indexes are “pure” column stores, not a hybrid, because different columns are stored on entirely separate pages. This improves I/O performance and makes more efficient use of memory. Column store indexes are fully integrated into the system. To improve performance of typical data warehous- ing queries, all a user needs to do is build a column store index on the fact tables in the data warehouse.
    [Show full text]
  • User's Guide for Oracle Data Visualization
    Oracle® Fusion Middleware User's Guide for Oracle Data Visualization 12c (12.2.1.4.0) E91572-02 September 2020 Oracle Fusion Middleware User's Guide for Oracle Data Visualization, 12c (12.2.1.4.0) E91572-02 Copyright © 2015, 2020, Oracle and/or its affiliates. Primary Author: Hemala Vivek Contributing Authors: Nick Fry, Christine Jacobs, Rosie Harvey, Stefanie Rhone Contributors: Oracle Business Intelligence development, product management, and quality assurance teams This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial computer software" or "commercial computer software documentation" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations.
    [Show full text]
  • (Spring 2019) :: Query Execution & Processing
    ADVANCED DATABASE SYSTEMS Query Execution & Processing @Andy_Pavlo // 15-721 // Spring 2019 Lecture #15 CMU 15-721 (Spring 2019) 2 ARCHITECTURE OVERVIEW Networking Layer SQL Query SQL Parser Planner Binder Rewriter Optimizer / Cost Models Compiler Scheduling / Placement Concurrency Control We Are Here Execution Engine Operator Execution Indexes Storage Models Storage Manager Logging / Checkpoints CMU 15-721 (Spring 2019) 3 OPERATOR EXECUTION Query Plan Processing Application Logic Execution (UDFs) Parallel Join Algorithms Vectorized Operators Query Compilation CMU 15-721 (Spring 2019) 4 QUERY EXECUTION A query plan is comprised of operators. An operator instance is an invocation of an operator on some segment of data. A task is the execution of a sequence of one or more operator instances. CMU 15-721 (Spring 2019) 5 EXECUTION OPTIMIZATION We are now going to start discussing ways to improve the DBMS's query execution performance for data sets that fit entirely in memory. There are other bottlenecks to target when we remove the disk. CMU 15-721 (Spring 2019) 6 OPTIMIZATION GOALS Approach #1: Reduce Instruction Count → Use fewer instructions to do the same amount of work. Approach #2: Reduce Cycles per Instruction → Execute more CPU instructions in fewer cycles. → This means reducing cache misses and stalls due to memory load/stores. Approach #3: Parallelize Execution → Use multiple threads to compute each query in parallel. CMU 15-721 (Spring 2019) 7 MonetDB/X100 Analysis Processing Models Parallel Execution CMU 15-721 (Spring 2019) 8 MONETDB/X100 Low-level analysis of execution bottlenecks for in- memory DBMSs on OLAP workloads. → Show how DBMS are designed incorrectly for modern CPU architectures.
    [Show full text]
  • Prototype and Deployment of a Big Data Analytics Platform
    Prototype and Deployment of a Big Data Analytics Platform CTT2.0 Carbon Track and Trace Deliverable D3.1 Atle Vesterkjær, Patrick Merlot (Numascale) Oslo, Norway| 14 December 2016 climate-kic.org Contents Preface ................................................................................................................................................................. 2 Requirement Analysis ........................................................................................................................................ 3 Previous E2E Analytics IoT platform ....................................................................................................................... 3 CTT Analytics IoT platform ......................................................................................................................................... 5 State of the art technology overview ..................................................................................................................... 6 Adaption of standard and develop-ment of custom components ............................................................. 7 Standard components ................................................................................................................................................. 7 Custom components .................................................................................................................................................... 9 Deployment of the platform ............................................................................................................................
    [Show full text]
  • DBMS Data Loading: an Analysis on Modern Hardware
    DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic1?, Manos Karpathiotakis2, Ioannis Alagiannis2, Raja Appuswamy2, and Anastasia Ailamaki2,3 1 University of Chicago [email protected] 2 Ecole Polytechnique Fed´ erale´ de Lausanne (EPFL) [email protected] 3 RAW Labs SA Abstract. Data loading has traditionally been considered a “one-time deal” – an offline process out of the critical path of query execution. The architecture of DBMS is aligned with this assumption. Nevertheless, the rate in which data is produced and gathered nowadays has nullified the “one-off” assumption, and has turned data loading into a major bottleneck of the data analysis pipeline. This paper analyzes the behavior of modern DBMS in order to quantify their ability to fully exploit multicore processors and modern storage hardware during data loading. We examine multiple state-of-the-art DBMS, a variety of hardware configurations, and a combination of synthetic and real-world datasets to iden- tify bottlenecks in the data loading process and to provide guidelines on how to accelerate data loading. Our findings show that modern DBMS are unable to saturate the available hardware resources. We therefore identify opportunities to accelerate data loading. 1 Introduction Applications both from the scientific and business worlds accumulate data at an increas- ingly rapid pace. Natural science experiments produce unprecedented amounts of data. Similarly, companies aggressively collect data to optimize business strategy. The recent advances in cost-effective storage hardware enable storing the produced data, but the ability to organize and process this data has been unable to keep pace with data growth.
    [Show full text]
  • Accelerating Database Queries for Advanced Data Analytics Hanfeng Chen, Joseph V
    Short Paper HorsePower: Accelerating Database Queries for Advanced Data Analytics Hanfeng Chen, Joseph V. D’silva, Laurie Hendren, Bettina Kemme McGill University [email protected],[email protected],[email protected],[email protected] ABSTRACT database columns, HorseIR follows conceptually the data The rising popularity of data science has resulted in a chal- model of column-based DBS, which has been proven to be lenging interplay between traditional declarative queries effective for data analytics tasks. and numerical computations on the data. In this paper, we HorsePower extends the idea to a full-fledged execution present and evaluate the advanced analytical system Horse- environment for data analytics. Additionally to support- Power that is able to combine and optimize both program- ing plain SQL queries, HorsePower also supports functions ming styles in a holistic manner. It can execute traditional written in MATLAB, a popular high-level array language SQL-based database queries, programs written in the statis- widely used in the field of statistics and engineering. Horse- tical language MATLAB, as well as a mix of both by support- Power can take stand-alone functions written in MATLAB ing user-defined functions within database queries. Horse- and translate them to HorseIR, of have these functions be Power exploits HorseIR, an array-based intermediate rep- embedded in SQL queries and then translate all into a sin- resentation (IR), to which source programs are translated, gle HorseIR program, before optimizing and compiling the allowing to combine query optimization and compiler opti- code in a holistic manner. mization techniques at an intermediate level of abstraction.
    [Show full text]