Dynamic Resource Management in Vectorwise on Hadoop

Total Page:16

File Type:pdf, Size:1020Kb

Dynamic Resource Management in Vectorwise on Hadoop Vrije Universiteit, Amsterdam Faculty of Sciences, Computer Science Department Cristian Mihai Bârcă, student no. 2514228 Dynamic Resource Management in Vectorwise on Hadoop Master Thesis in Parallel and Distributed Computer Systems Supervisor / First reader: Prof. Dr. Peter Boncz, Vrije Universiteit, Centrum Wiskunde & Informatica Second reader: Prof. Dr. Henri Bal, Vrije Universiteit Amsterdam, August 2014 Contents 1 Introduction 1 1.1 Background and Motivation..............................1 1.2 Related Work......................................2 1.2.1 Hadoop-enabled MPP Databases.......................4 1.2.2 Resource Management in MPP Databases..................7 1.2.3 Resource Management in Hadoop Map-Reduce...............8 1.2.4 YARN Resource Manager (Hadoop NextGen)................9 1.3 Vectorwise........................................ 11 1.3.1 Ingres-Vectorwise Architecture........................ 12 1.3.2 Distributed Vectorwise Architecture..................... 13 1.3.3 Hadoop-enabled Architecture......................... 16 1.4 Research Questions................................... 19 1.5 Goals.......................................... 20 1.6 Basic ideas....................................... 20 2 Approach 21 2.1 HDFS Local Reads: DataTransferProtocol vs Direct I/O.............. 22 2.2 Towards Data Locality................................. 24 2.3 Dynamic Resource Management........................... 26 2.4 YARN Integration................................... 28 2.5 Experimentation Platform............................... 31 3 Instrumenting HDFS Block Replication 33 3.1 Preliminaries...................................... 33 3.2 Custom HDFS Block Placement Implementation.................. 35 3.2.1 Adding (New) Metadata to Database Locations............... 36 3.2.2 Extending the Default Policy......................... 36 4 Dynamic Resource Management 42 4.1 Worker-set Selection.................................. 42 4.2 Responsibility Assignment............................... 47 4.3 Dynamic Resource Scheduling............................. 53 5 YARN integration 67 5.1 Overview: model resource requests as separate YARN applications........ 68 5.2 DbAgent–Vectorwise communication protocol: update worker-set state, increase/decrease resources................ 72 5.3 Workload monitor & resource allocation....................... 73 6 Evaluation & Results 80 6.1 Evaluation Plan..................................... 80 6.2 Results Discussion................................... 83 i List of Figures 1.1 Hadoop transition from v1 to v2........................... 10 1.2 The Ingres-Vectorwise (single-node) High-level Architecture............ 12 1.3 The Distributed Vectorwise High-level Architecture................. 14 1.4 The Vectorwise on Hadoop High-level Architecture (the session-master, or Vec- torwise Master, is involved in query execution).................... 17 2.1 The Vectorwise on Hadoop dynamic resource management approach in 7 steps.. 22 2.2 Local reads through DataNode (DataTransferProtocol)............... 23 2.3 Direct local reads (Direct I/O - open(), read() operations)............. 23 2.4 A flow network example for the min-cost problem, edges are annotated with (cost, flow / capacity).................................. 27 2.5 The flow network (bipartite graph) model used to determine the responsibility assignment........................................ 28 2.6 Out-of-band approach for Vectorwise-YARN integration............... 30 3.1 An example of two tables (R and S) loaded on 3 workers / 4 nodes, using 3-way partitioning and replication degree of 2: after the initial-load (upper image) vs. during a system failover (lower image)......................... 34 3.2 Instrumenting the HDFS block replication, in three scenarios, to maintain block (co-)locality to worker nodes.............................. 35 3.3 Baseline (default policy) vs. collocation (custom policy) – HDFS I/O throughput measured in 3 situations (1) healty state, (2) failure state (2 nodes down), (3) recovered state. Queries 1, 4, 6, and 12 on partitioned tables, 32-partitioning, 8 workers / 10 Hadoop "Rocks" nodes, 32 mpl..................... 40 4.1 Starting the Vectorwise on Hadoop worker-set through VectorwiseLauncher and DbAgent. All related concepts to this section are emphasized with a bold black color............................................ 43 4.2 The flow network (bipartite graph) model used to match partitions to worker nodes........................................... 46 4.3 Assigning responsibilities to worker nodes. All related concepts to this section are emphasized with bold black and red/blue/green colors (for single (partition) responsibilities)..................................... 48 4.4 The flow network (bipartite graph) model used to assign responsibilities to work- ers nodes......................................... 49 4.5 The flow network (bipartite graph) model used to calculate a part of the resource footprint (what nodes and table partitions we use for a query, given the worker’s available resource and data-locality). Note: fx is the unit with which we increase the flow along the path at step x of the algorithm.................. 57 ii LIST OF FIGURES iii 4.6 Selecting the (partition responsibilities and worker nodes involved in query ex- ecution. Only the first two workers from left are chosen (bold-italic black) for table scans; first one reads from two partitions (red and green), whereas the second reads just from one (blue). The third worker is considered overloaded and so, it is excluded from the I/O. Red/blue/green colors are used to express multiple (partition) responsibilities........................... 58 5.1 Timeline of different query workloads......................... 69 5.2 Out-of-band approach for Vectorwise-YARN integration. An example using 2 x Hadoop nodes that equally share 4 x YARN containers............... 70 5.3 Timeline of different query workloads (one-time release)............... 71 5.4 Database session example with one active query: the communication protocol to increase/decrease resources............................... 72 5.5 The workload monitor’s and resource allocation’s workflow for: a) successful (resource) request and b) failed (resource) request (VW – Vectorwise, MR – Map-Reduce)...................................... 74 5.6 YARN’s allocation overhead: TPC-H scale factor 100, 32 partitions, 8 "Stones" nodes, 128 mpl...................................... 78 6.1 Running with (w/) and without (w/o) short-circuit when block colloc. is enabled. TPC-H benchmark: 16-partitions for Order and Lineitem, 8 "Rocks" workers, 32 mpl.......................................... 83 6.2 Overall runtime performance: baseline vs. collocation versions, based on Table 6.3 84 6.3 Average throughput per query: baseline vs. collocation versions, related to Ta- ble 6.3........................................... 85 6.4 Overall runtime performance: baseline vs. collocation versions, based on Table 6.4 86 6.5 The worker nodes and (partition) responsibilities involved in query execution during (a) and (b). We use red color to emphasize the latter, plus the amount of threads per worker to satisfy the 32 mpl. Besides, we also show the state before the failure. The local responsibilities are represented with black color and the partial ones, from the two new nodes, with grey color.............. 87 6.6 Overall runtime performance: collocation vs. drm versions, based on Table 6.5.. 88 6.7 Overall runtime performance: Q1, Q4, Q6, Q9, Q12, Q13, Q19, and Q21, based on Table 6.5....................................... 89 6.8 Average throughput per query: collocation vs. drm versions, related to Table 6.5. 90 6.9 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.6. We use red color to highlight the latter information. The metadata for partition locations (see Section 3.2) is shown on the left and the responsibility assignment on the right............................................ 92 6.10 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.7’s half-set busy- ness runs. We use red color to highlight the latter information........... 93 6.11 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.8’s half-set busy- ness runs. We use red color to highlight the latter information........... 94 6.12 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.9’s half-set busy- ness runs. right. We use red color to highlight the latter information. With blue color we mark the remote responsibilities being assigned at runtime due to overloaded workers.................................... 96 6.13 Average CPU load (not. from [min to Max]) for drm, while running Q1 and Q13 side by side with 16, 32 and 64 Mappers and Reducers, on a full-set in left and half-set in right..................................... 96 LIST OF FIGURES iv 6.14 Average CPU load (not. from [min to Max]) for drmy, while running Q1 and Q13 side by side with 16, 32 and 64 Mappers and Reducers, on a full-set in left and half-set in right.................................. 97 List of Tables 1.1 A short summary of the related research in Hadoop-enabled MPP databases (including Vectorwise), comprising their architectural advantages and disadvan- tages.
Recommended publications
  • Voodoo - a Vector Algebra for Portable Database Performance on Modern Hardware
    Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware Holger Pirk Oscar Moll Matei Zaharia Sam Madden MIT CSAIL MIT CSAIL MIT CSAIL MIT CSAIL [email protected] [email protected] [email protected] [email protected] ABSTRACT Single Thread Branch Multithread Branch GPU Branch In-memory databases require careful tuning and many engi- Single Thread No Branch Multithread No Branch GPU No Branch neering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and 10 hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer { even more so for a query compiler. To facilitate performance- 1 oriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that ab- stracts the detailed architectural properties of the hard- ware, such as multi- or many-core architectures, caches and Absolute time in s 0.1 SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about 1 5 10 50 100 and tune than low-level C and related hardware-focused ex- Selectivity tensions (Intrinsics, OpenCL, CUDA, etc.). This enables Figure 1: Performance of branch-free selections based on our Voodoo compiler to produce (OpenCL) code that rivals cursor arithmetics [28] (a.k.a. predication) over a branching and even outperforms the fastest state-of-the-art in memory implementation (using if statements) databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cache- conscious processing, predication and vectorization (again PeR [18], Legobase [14] and TupleWare [9], have arisen.
    [Show full text]
  • A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns
    A Survey on Parallel Database Systems from a Storage Perspective: Rows versus Columns Carlos Ordonez1 ? and Ladjel Bellatreche2 1 University of Houston, USA 2 LIAS/ISAE-ENSMA, France Abstract. Big data requirements have revolutionized database technol- ogy, bringing many innovative and revamped DBMSs to process transac- tional (OLTP) or demanding query workloads (cubes, exploration, pre- processing). Parallel and main memory processing have become impor- tant features to exploit new hardware and cope with data volume. With such landscape in mind, we present a survey comparing modern row and columnar DBMSs, contrasting their ability to write data (storage mecha- nisms, transaction processing, batch loading, enforcing ACID) and their ability to read data (query processing, physical operators, sequential vs parallel). We provide a unifying view of alternative storage mecha- nisms, database algorithms and query optimizations used across diverse DBMSs. We contrast the architecture and processing of a parallel DBMS with an HPC system. We cover the full spectrum of subsystems going from storage to query processing. We consider parallel processing and the impact of much larger RAM, which brings back main-memory databases. We then discuss important parallel aspects including speedup, sequential bottlenecks, data redistribution, high speed networks, main memory pro- cessing with larger RAM and fault-tolerance at query processing time. We outline an agenda for future research. 1 Introduction Parallel processing is central in big data due to large data volume and the need to process data faster. Parallel DBMSs [15, 13] and the Hadoop eco-system [30] are currently two competing technologies to analyze big data, both based on au- tomatic data-based parallelism on a shared-nothing architecture.
    [Show full text]
  • VECTORWISE Simply FAST
    VECTORWISE Simply FAST A technical whitepaper TABLE OF CONTENTS: Introduction .................................................................................................................... 1 Uniquely fast – Exploiting the CPU ................................................................................ 2 Exploiting Single Instruction, Multiple Data (SIMD) .................................................. 2 Utilizing CPU cache as execution memory ............................................................... 3 Other CPU performance features ............................................................................. 3 Leveraging industry best practices ................................................................................ 4 Optimizing large data scans ...................................................................................... 4 Column-based storage .............................................................................................. 4 VectorWise’s hybrid column store ........................................................................ 5 Positional Delta Trees (PDTs) .............................................................................. 5 Data compression ..................................................................................................... 6 VectorWise’s innovative use of data compression ............................................... 7 Storage indexes ........................................................................................................ 7 Parallel
    [Show full text]
  • LDBC Introduction and Status Update
    8th Technical User Community meeting - LDBC Peter Boncz [email protected] ldbcouncil.org LDBC Organization (non-profit) “sponsors” + non-profit members (FORTH, STI2) & personal members + Task Forces, volunteers developing benchmarks + TUC: Technical User Community (8 workshops, ~40 graph and RDF user case studies, 18 vendor presentations) What does a benchmark consist of? • Four main elements: – data & schema: defines the structure of the data – workloads: defines the set of operations to perform – performance metrics: used to measure (quantitatively) the performance of the systems – execution & auditing rules: defined to assure that the results from different executions of the benchmark are valid and comparable • Software as Open Source (GitHub) – data generator, query drivers, validation tools, ... Audience • For developers facing graph processing tasks – recognizable scenario to compare merits of different products and technologies • For vendors of graph database technology – checklist of features and performance characteristics • For researchers, both industrial and academic – challenges in multiple choke-point areas such as graph query optimization and (distributed) graph analysis SPB scope • The scenario involves a media/ publisher organization that maintains semantic metadata about its Journalistic assets (articles, photos, videos, papers, books, etc), also called Creative Works • The Semantic Publishing Benchmark simulates: – Consumption of RDF metadata (Creative Works) – Updates of RDF metadata, related to Annotations • Aims to be
    [Show full text]
  • Micro Adaptivity in Vectorwise
    Micro Adaptivity in Vectorwise Bogdan Raducanuˇ Peter Boncz Actian CWI [email protected] [email protected] ∗ Marcin Zukowski˙ Snowflake Computing marcin.zukowski@snowflakecomputing.com ABSTRACT re-arrange the shape of a query plan in reaction to its execu- Performance of query processing functions in a DBMS can tion so far, thereby enabling the system to correct mistakes be affected by many factors, including the hardware plat- in the cost model, or to e.g. adapt to changing selectivities, form, data distributions, predicate parameters, compilation join hit ratios or tuple arrival rates. Micro Adaptivity, intro- method, algorithmic variations and the interactions between duced here, is an orthogonal approach, taking adaptivity to these. Given that there are often different function imple- the micro level by making the low-level functions of a query mentations possible, there is a latent performance diversity executor more performance-robust. which represents both a threat to performance robustness Micro Adaptivity was conceived inside the Vectorwise sys- if ignored (as is usual now) and an opportunity to increase tem, a high performance analytical RDBMS developed by the performance if one would be able to use the best per- Actian Corp. [18]. The raw execution power of Vectorwise forming implementation in each situation. Micro Adaptivity, stems from its vectorized query executor, in which each op- proposed here, is a framework that keeps many alternative erator performs actions on vectors-at-a-time, rather than a function implementations (“flavors”) in a system. It uses a single tuple-at-a-time; where each vector is an array of (e.g.
    [Show full text]
  • Dbmoto® for Vectorwise™
    ® ™ DBMoto for Vectorwise Real-time Data Replication and CDC for Analytic Database Systems DBMoto® for Vectorwise™ DBMoto® is the preferred solution for heterogeneous Data Replication, Change Data Capture Powerful, Flexible, Configurable and Data Transformation requirements in an enterprise environment. DBMoto supports Change Data Capture! Actian’s™ Vectorwise™ Analytic Database System through built-in integration capabilities that DBMoto® does not require any programming provide seamless data loading and data mirroring to Vectorwise targets. Whether you need to on the source or target database platforms deploy a new Vectorwise system, or upgrade from an incumbent system, DBMoto can help in order to deploy or run its powerful data make data migration an easy and understandable task. integration features. DBMoto is the solution of choice for fast, trouble-free, easy-to-maintain data integration DBMoto provides all functionality in easy projects. If you depend on data from multiple databases, you need a data integration solution GUI and wizard-based screens: no stored that supports major relational database systems, and that works out-of-the-box. Using an procedures to develop; and no proprietary ecient visual interface, intuitive wizards and easy-to-follow guides, DBMoto helps IT sta syntax to learn. implement the toughest replication requirements quickly and easily. DBMoto is mature and approved by enterprises ranging from midsized to Fortune 1000 businesses worldwide. Databases Supported as Sources: Achieving dependable data delivery in enterprise environments requires expertise in database • IBM DB2 for i, AS400 • MySQL servers. HiT Software has been delivering relational data access products since 1994, and • IBM DB2 for z/OS (OS/390) • Sybase ASE DBMoto incorporates proven integration technology to ensure high performance yet minimally • DB2 LUW • SQL Anywhere intrusive data synchronization, using Change Data Capture for maximum eciency.
    [Show full text]
  • Monetdb: Open-Source Columnar Database Technology Beyond Textbooks
    MonetDB: Open-source Columnar Database Technology Beyond Textbooks http://www.monetdb.org/ Stefan Manegold [email protected] http://homepages.cwi.nl/~manegold/ Why? Motivation (early 1990s) ● Relational DBMSs dominate since the late 1970's / early 1980's ● IBM DB2, MS SQL Server, Oracle, Ingres, ... ● Transactional workloads (OLTP, row-wise access) ● I/O based processing ● But: ● Workloads change (early 1990s) ● Hardware changes (late 1990s) ● Data “explodes” (early 2000s) Why? Workload changes: Transactions (OLTP) vs ... Why? Workload changes: ... vs OLAP, BI, Data Mining, ... Why? Databases hit The Memory Wall . Detailed and exhaustive analysis for different workloads using 4 RDBMSs by Ailamaki, DeWitt, Hill, Wood in VLDB 1999: “DBMSs On A Modern Processor: Where Does Time Go?” . CPU is 60%-90% idle, waiting for memory: . L1 data stalls . L1 instruction stalls . L2 data stalls . TLB stalls . Branch mispredictions . Resource stalls Why? Hardware Changes: The Memory Wall Trip to memory = 1000s of instructions! Why? Hardware Changes: Memory Hierarchies +Transition Lookaside Buffer (TLB) Cache for VM address translation only 64 entries! How? Solution “We can't solve problems by using the same kind of thinking we used when we created them.” What? MonetDB ● Database kernel developed at CWI since 1993 ● Research prototype turned into open-source product ● Pioneering columnar database architecture ● Complete Relational/SQL & XML/XQuery DBMS ● Focusing on in-memory processing ● Data is kept persistent on disk and can exceed memory limits ● Aiming at OLAP, BI, data mining & scientific workloads (“read-dominated”) ● Supporting ACID transactions (WAL, optimistic CC) ● Platform for database architecture research ● Used in academia (research & teaching) & commercial environments ● Back-end for various DB research projects: Multi-Media DB & IR (“Tijah”), XML/XQuery (“Pathfinder”), Data Mining (“Proximity”), Digital Forensics (“XIRAF”), GIS (“OSM”), ..
    [Show full text]
  • (Spring 2019) :: Query Execution & Processing
    ADVANCED DATABASE SYSTEMS Query Execution & Processing @Andy_Pavlo // 15-721 // Spring 2019 Lecture #15 CMU 15-721 (Spring 2019) 2 ARCHITECTURE OVERVIEW Networking Layer SQL Query SQL Parser Planner Binder Rewriter Optimizer / Cost Models Compiler Scheduling / Placement Concurrency Control We Are Here Execution Engine Operator Execution Indexes Storage Models Storage Manager Logging / Checkpoints CMU 15-721 (Spring 2019) 3 OPERATOR EXECUTION Query Plan Processing Application Logic Execution (UDFs) Parallel Join Algorithms Vectorized Operators Query Compilation CMU 15-721 (Spring 2019) 4 QUERY EXECUTION A query plan is comprised of operators. An operator instance is an invocation of an operator on some segment of data. A task is the execution of a sequence of one or more operator instances. CMU 15-721 (Spring 2019) 5 EXECUTION OPTIMIZATION We are now going to start discussing ways to improve the DBMS's query execution performance for data sets that fit entirely in memory. There are other bottlenecks to target when we remove the disk. CMU 15-721 (Spring 2019) 6 OPTIMIZATION GOALS Approach #1: Reduce Instruction Count → Use fewer instructions to do the same amount of work. Approach #2: Reduce Cycles per Instruction → Execute more CPU instructions in fewer cycles. → This means reducing cache misses and stalls due to memory load/stores. Approach #3: Parallelize Execution → Use multiple threads to compute each query in parallel. CMU 15-721 (Spring 2019) 7 MonetDB/X100 Analysis Processing Models Parallel Execution CMU 15-721 (Spring 2019) 8 MONETDB/X100 Low-level analysis of execution bottlenecks for in- memory DBMSs on OLAP workloads. → Show how DBMS are designed incorrectly for modern CPU architectures.
    [Show full text]
  • Vectorwise: Beyond Column Stores
    Vectorwise: Beyond Column Stores Marcin Zukowski, Actian, Amsterdam, The Netherlands Peter Boncz, CWI, Amsterdam, The Netherlands Abstract This paper tells the story of Vectorwise, a high-performance analytical database system, from multiple perspectives: its history from academic project to commercial product, the evolution of its technical architecture, customer reactions to the product and its future research and development roadmap. One take-away from this story is that the novelty in Vectorwise is much more than just column-storage: it boasts many query processing innovations in its vectorized execution model, and an adaptive mixed row/column data storage model with indexing support tailored to analytical workloads. Another one is that there is a long road from research prototype to commercial product, though database research continues to achieve a strong innovative influence on product development. 1 Introduction The history of Vectorwise goes back to 2003 when a group of researchers from CWI in Amsterdam, known for the MonetDB project [5], invented a new query processing model. This vectorized query processing approach became the foundation of the X100 project [6]. In the following years, the project served as a platform for further improvements in query processing [23, 26] and storage [24, 25]. Initial results of the project showed impressive performance improvements both in decision support workloads [6] as well as in other application areas like information retrieval [7]. Since the commercial potential of the X100 technology was apparent, CWI spun-out this project and founded Vectorwise BV as a company in 2008. Vectorwise BV decided to combine the X100 processing and storage components with the mature higher-layer database components and APIs of the Ingres DBMS; a product of Actian Corp.
    [Show full text]
  • Exploring Query Execution Strategies for JIT, Vectorization and SIMD
    Exploring Query Execution Strategies for JIT, Vectorization and SIMD Tim Gubner Peter Boncz CWI CWI [email protected] [email protected] ABSTRACT at-a-time execution of analytical queries is typically more This paper partially explores the design space for efficient than 90% of CPU work [4], hence block-at-a-time often de- query processors on future hardware that is rich in SIMD livers an order of magnitude better performance. capabilities. It departs from two well-known approaches: An alternative way to avoid interpretation overhead is (1) interpreted block-at-a-time execution (a.k.a. \vector- Just-In-Time (JIT) compilation of database query pipelines ization") and (2) \data-centric" JIT compilation, as in the into low-level instructions (e.g. LLVM intermediate code). HyPer system. We argue that in between these two de- This approach was proposed and demonstrated in HyPer [9], sign points in terms of granularity of execution and unit of and has also been adopted in Spark SQL for data science compilation, there is a whole design space to be explored, workloads. The data-centric compilation that HyPer pro- in particular when considering exploiting SIMD. We focus poses, however, leads to tight tuple-at-a-time loops, which on TPC-H Q1, providing implementation alternatives (“fla- means that so far these architectures do not exploit SIMD, vors") and benchmarking these on various architectures. In since for that typically multiple tuples need to be executed doing so, we explain in detail considerations regarding oper- on at the same time. JIT compilation is not only beneficial ating on SQL data in compact types, and the system features for analytical queries but also transactional queries, and thus that could help using as compact data as possible.
    [Show full text]
  • Stisummit-Boncz-Final Fixed.Pptx
    DB vs RDF: structure vs correlation +comments on data integration & SW Peter Boncz Senior Research Scientist @ CWI Lecturer @ Vrije Universiteit Amsterdam Architect & Co-founder MonetDB Architect & Co-founder VectorWise STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation 1 Correlations Real data is highly correlated } (gender, location) ó firstname } (profession,age) ó income But… database systems assume attribute independence } wrong assumption leads to suboptimal query plan STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation 2 Example: co-authorship query DBLP “find authors who published in VLDB, SIGMOD and Bioinformatics” SELECT author FROM publications p1, p2, p3 WHERE p1.venue = VLDB and p2.venue = SIGMOD and p3.venue = Bioinformatics and p1.author = =p2.author p2.author and and p1.author = =p3.author p3.author and and p2.author = =p3.author p3.author STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation 3 Correlations: RDBMS vs RDF Schematic structure in RDF data hidden in correlations, which are everywhere SPARQL leads to plans with } many self-joins } whose hit-ratio is correlated (with eg selections) Relational query optimizers do not have a clue è all self-joins look the same to it è random join order, bad query plans L STI Summit, July 6, 2011 Riga. DB vs RDF: structure vs correlation 4 RDF Engines to the Next Level Challenge: solving the correlation problem. Ideas? } Interleave optimization and execution } Run-time sampling to detect true selectivities } Run-time query (re-)planning } Tackling when there are long join chains } Creating partial path indexes } Graph “cracking”: index build as side-effect of query } .
    [Show full text]
  • Vectorwise Developer Guide Technical White Paper
    Vectorwise Developer Guide Technical white paper 1 Table of Contents INTRODUCTION 2 SYSTEM SIZING 3 Determine the core count ................................................. 3 How much memory do I need? ......................................... 5 Sizing disk storage ............................................................ 7 DATABASE CONFIGURATION 11 Access to data ................................................................ 11 Storage allocation ........................................................... 13 SCHEMA DESIGN 14 Indexing .......................................................................... 14 Constraints ...................................................................... 15 DATA LOADING 16 Initial data load ................................................................ 16 Incremental data load ..................................................... 19 HIGH AVAILABILITY 23 Hardware protection ....................................................... 23 Backup and restore ......................................................... 23 Standby configuration ..................................................... 26 MONITORING VECTORWISE 27 Actian Monitor ................................................................. 27 vwinfo utility .................................................................... 27 1 INTRODUCTION Welcome to the Vectorwise 2.5 Developer Guide. Vectorwise is a relational database that transparently utilizes performance features in CPUs to perform extremely fast reporting and data analysis to
    [Show full text]