Dynamic Resource Management in Vectorwise on Hadoop
Total Page:16
File Type:pdf, Size:1020Kb
Vrije Universiteit, Amsterdam Faculty of Sciences, Computer Science Department Cristian Mihai Bârcă, student no. 2514228 Dynamic Resource Management in Vectorwise on Hadoop Master Thesis in Parallel and Distributed Computer Systems Supervisor / First reader: Prof. Dr. Peter Boncz, Vrije Universiteit, Centrum Wiskunde & Informatica Second reader: Prof. Dr. Henri Bal, Vrije Universiteit Amsterdam, August 2014 Contents 1 Introduction 1 1.1 Background and Motivation..............................1 1.2 Related Work......................................2 1.2.1 Hadoop-enabled MPP Databases.......................4 1.2.2 Resource Management in MPP Databases..................7 1.2.3 Resource Management in Hadoop Map-Reduce...............8 1.2.4 YARN Resource Manager (Hadoop NextGen)................9 1.3 Vectorwise........................................ 11 1.3.1 Ingres-Vectorwise Architecture........................ 12 1.3.2 Distributed Vectorwise Architecture..................... 13 1.3.3 Hadoop-enabled Architecture......................... 16 1.4 Research Questions................................... 19 1.5 Goals.......................................... 20 1.6 Basic ideas....................................... 20 2 Approach 21 2.1 HDFS Local Reads: DataTransferProtocol vs Direct I/O.............. 22 2.2 Towards Data Locality................................. 24 2.3 Dynamic Resource Management........................... 26 2.4 YARN Integration................................... 28 2.5 Experimentation Platform............................... 31 3 Instrumenting HDFS Block Replication 33 3.1 Preliminaries...................................... 33 3.2 Custom HDFS Block Placement Implementation.................. 35 3.2.1 Adding (New) Metadata to Database Locations............... 36 3.2.2 Extending the Default Policy......................... 36 4 Dynamic Resource Management 42 4.1 Worker-set Selection.................................. 42 4.2 Responsibility Assignment............................... 47 4.3 Dynamic Resource Scheduling............................. 53 5 YARN integration 67 5.1 Overview: model resource requests as separate YARN applications........ 68 5.2 DbAgent–Vectorwise communication protocol: update worker-set state, increase/decrease resources................ 72 5.3 Workload monitor & resource allocation....................... 73 6 Evaluation & Results 80 6.1 Evaluation Plan..................................... 80 6.2 Results Discussion................................... 83 i List of Figures 1.1 Hadoop transition from v1 to v2........................... 10 1.2 The Ingres-Vectorwise (single-node) High-level Architecture............ 12 1.3 The Distributed Vectorwise High-level Architecture................. 14 1.4 The Vectorwise on Hadoop High-level Architecture (the session-master, or Vec- torwise Master, is involved in query execution).................... 17 2.1 The Vectorwise on Hadoop dynamic resource management approach in 7 steps.. 22 2.2 Local reads through DataNode (DataTransferProtocol)............... 23 2.3 Direct local reads (Direct I/O - open(), read() operations)............. 23 2.4 A flow network example for the min-cost problem, edges are annotated with (cost, flow / capacity).................................. 27 2.5 The flow network (bipartite graph) model used to determine the responsibility assignment........................................ 28 2.6 Out-of-band approach for Vectorwise-YARN integration............... 30 3.1 An example of two tables (R and S) loaded on 3 workers / 4 nodes, using 3-way partitioning and replication degree of 2: after the initial-load (upper image) vs. during a system failover (lower image)......................... 34 3.2 Instrumenting the HDFS block replication, in three scenarios, to maintain block (co-)locality to worker nodes.............................. 35 3.3 Baseline (default policy) vs. collocation (custom policy) – HDFS I/O throughput measured in 3 situations (1) healty state, (2) failure state (2 nodes down), (3) recovered state. Queries 1, 4, 6, and 12 on partitioned tables, 32-partitioning, 8 workers / 10 Hadoop "Rocks" nodes, 32 mpl..................... 40 4.1 Starting the Vectorwise on Hadoop worker-set through VectorwiseLauncher and DbAgent. All related concepts to this section are emphasized with a bold black color............................................ 43 4.2 The flow network (bipartite graph) model used to match partitions to worker nodes........................................... 46 4.3 Assigning responsibilities to worker nodes. All related concepts to this section are emphasized with bold black and red/blue/green colors (for single (partition) responsibilities)..................................... 48 4.4 The flow network (bipartite graph) model used to assign responsibilities to work- ers nodes......................................... 49 4.5 The flow network (bipartite graph) model used to calculate a part of the resource footprint (what nodes and table partitions we use for a query, given the worker’s available resource and data-locality). Note: fx is the unit with which we increase the flow along the path at step x of the algorithm.................. 57 ii LIST OF FIGURES iii 4.6 Selecting the (partition responsibilities and worker nodes involved in query ex- ecution. Only the first two workers from left are chosen (bold-italic black) for table scans; first one reads from two partitions (red and green), whereas the second reads just from one (blue). The third worker is considered overloaded and so, it is excluded from the I/O. Red/blue/green colors are used to express multiple (partition) responsibilities........................... 58 5.1 Timeline of different query workloads......................... 69 5.2 Out-of-band approach for Vectorwise-YARN integration. An example using 2 x Hadoop nodes that equally share 4 x YARN containers............... 70 5.3 Timeline of different query workloads (one-time release)............... 71 5.4 Database session example with one active query: the communication protocol to increase/decrease resources............................... 72 5.5 The workload monitor’s and resource allocation’s workflow for: a) successful (resource) request and b) failed (resource) request (VW – Vectorwise, MR – Map-Reduce)...................................... 74 5.6 YARN’s allocation overhead: TPC-H scale factor 100, 32 partitions, 8 "Stones" nodes, 128 mpl...................................... 78 6.1 Running with (w/) and without (w/o) short-circuit when block colloc. is enabled. TPC-H benchmark: 16-partitions for Order and Lineitem, 8 "Rocks" workers, 32 mpl.......................................... 83 6.2 Overall runtime performance: baseline vs. collocation versions, based on Table 6.3 84 6.3 Average throughput per query: baseline vs. collocation versions, related to Ta- ble 6.3........................................... 85 6.4 Overall runtime performance: baseline vs. collocation versions, based on Table 6.4 86 6.5 The worker nodes and (partition) responsibilities involved in query execution during (a) and (b). We use red color to emphasize the latter, plus the amount of threads per worker to satisfy the 32 mpl. Besides, we also show the state before the failure. The local responsibilities are represented with black color and the partial ones, from the two new nodes, with grey color.............. 87 6.6 Overall runtime performance: collocation vs. drm versions, based on Table 6.5.. 88 6.7 Overall runtime performance: Q1, Q4, Q6, Q9, Q12, Q13, Q19, and Q21, based on Table 6.5....................................... 89 6.8 Average throughput per query: collocation vs. drm versions, related to Table 6.5. 90 6.9 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.6. We use red color to highlight the latter information. The metadata for partition locations (see Section 3.2) is shown on the left and the responsibility assignment on the right............................................ 92 6.10 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.7’s half-set busy- ness runs. We use red color to highlight the latter information........... 93 6.11 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.8’s half-set busy- ness runs. We use red color to highlight the latter information........... 94 6.12 The worker nodes, partition responsibilities and the number of threads per node (out of max available) involved in query execution, for Table 6.9’s half-set busy- ness runs. right. We use red color to highlight the latter information. With blue color we mark the remote responsibilities being assigned at runtime due to overloaded workers.................................... 96 6.13 Average CPU load (not. from [min to Max]) for drm, while running Q1 and Q13 side by side with 16, 32 and 64 Mappers and Reducers, on a full-set in left and half-set in right..................................... 96 LIST OF FIGURES iv 6.14 Average CPU load (not. from [min to Max]) for drmy, while running Q1 and Q13 side by side with 16, 32 and 64 Mappers and Reducers, on a full-set in left and half-set in right.................................. 97 List of Tables 1.1 A short summary of the related research in Hadoop-enabled MPP databases (including Vectorwise), comprising their architectural advantages and disadvan- tages.