Greenplum Data Warehouse Technical
Total Page:16
File Type:pdf, Size:1020Kb
Data In. Decisions Out. Bart Sjerps Advisory Technology Consultant Oracle SME - EMEA [email protected] +31-6-27058830 5/20/2011 1 Blog: http://bartsjerps.wordpress.com “If I’d asked my customers what they wanted they’d have said a faster horse.” Henry Ford • I’m pretty sure that if Ford had asked his customers what they wanted, they’d have said something like faster horses and the reason is fairly simple: they couldn’t imagine anything else. • In fact, they didn’t want faster horses, they wanted a faster personal transportation method. It’s as simple as that. For Henry Ford, achieving this goal was absolutely impossible with a horse, so he came up with the idea of building a car that everybody could afford. Nobody knew they needed a car before they saw the Model-T (and knew they could afford it). T-Mobile Rob Strickland, CIO 5/20/2011 2 Database sizes over the years: -1996: 11 TB, in a Teradata Database -1999: 130 TB, in a Teradata Database - May 2008: 2 PB, Yahoo 5/20/2011 3 It’s Time for a Change . Yesterday’s Data Warehouse and Analytic Infrastructure The Greenplum Future Proprietary Commodity Expensive Cost-Effective Centralized, Monolithic Distributed Process-Heavy Self-Service Batch Real-Time Summarized Deep Slow Agile 4 •4 5/20/2011 4 Greenplum – True market disruption 20 Terabytes 70 Terabytes 100 Terabytes 20 kW, 8 Racks 20 kW, 6 Racks 12 kW, 2 Racks $20M $7M $1.8M 5/20/2011 5 Market Momentum • 170+ global enterprise customers • 100%+ Year-to-year growth 2009 • Acquired by EMC july 2009 • Growing more quickly than Netezza and Teradata • +$250 Million saved by customers choosing GP over Teradata • +5 Billion shares analyzed daily by Financial Markets using GP • +20 Trillion rows being mined for business value • +1Billion consumers receiving more secure and personalize services from GP customers 5/20/2011 6 Industry Recognition: 2009 Gartner Magic Quadrant Gartner: • Strengths – Scale *2007 was our first year on the MQ – Mixed workloads – Cloud ready – Self service – Low cost • Concerns – Company size • Fixed by EMC – R&D budget • Fixed by EMC Source: Gartner (January 2010) 5/20/2011 7 Customers by Industry Financial Services Telco Media & Internet Retail Gov’t & Health/Ins. 5/20/2011 8 Greenplum Database Data in. Decisions out. Fastest Advanced Data analytics Loading Data in In Database Analytics Decisions out Scatter/Gather Streaming™ for the Optimized for fast query execution Unified data access for greater world’s fastest data loading and linear scalability insight and value from data • Eliminate data load bottlenecks • Move processing closer to data • Enable parallel analysis across the enterprise • Shared nothing MPP scale-out • Clean and integrate new data architecture • Open platform with broad language support • Several loading options ranging • Computing is automatically from bulk load updates to micro- optimized and distributed across • Certified enterprise connectivity batching for near real-time resources and integration with most BI, ETL and management products processing • Provides the best concurrent multi-workload performance 5/20/2011 9 Greenplum Database Architecture Overview 5/20/2011 10 10 Data Computing Division Product Portfolio Greenplum Greenplum Greenplum Community Data Greenplum Database Computing Chorus Edition Appliance Enterprise Industry’s Data Cloud World’s most platform most Free entry powerful scalable level purpose- MPP analytic built database database database Virtualized, platform system self-service analytic infrastructure 5/20/2011 11 Deployment models • Greenplum Community Edition – Free downloadable – Limited to 2 segment servers – All software is enabled • Greenplum Software Only – I.e. run on Vsphere / Vblock – Or on standard (Intel) servers • Greenplum DCD Appliance – Pre-configured, tested, supported, plug & play – Huge bandwidth • DCD Appliance hybrid DAS / SAN 5/20/2011 12 Architecture of Greenplum DCA Flexible framework for processing large datasets SQL MapReduce Process large datasets with support for UDF’sUDF’s:: R,Java,C,Python,Perl ODBC, JDBC, OLEDB both SQL and MapReduce etc BI/ETL Tools Master servers optimize queries Master Master for the most efficient query execution Interconnect for continuous pipelining of data processing Segment Segment Segment Segment Segment Segment servers process queries … close to the data in parallel MPP Scatter /Gather streaming for fast loading of data 5/20/2011 13 Architecture • Based on PostgreSQL (open source) database – 15+ years of development – Feature-rich, mission critical-ready • Greenplum adds features on top of PostgreSQL – Very low development cost (compared to traditional RDBMS vendors) • Linear Scale-out • Parallel loading • Not depending on classic (OLTP) RDBMS tricks – Special indexes, materialized views, … 5/20/2011 14 Greenplum Database: Technical Stack CLIENT ACCESS 3rd PARTY TOOLS ADMIN TOOLS CLIENT ACCESS & ODBC, JDBC, OLEDB, etc. BI Tools, ETL Tools GP Performance Monitor TOOLS Data Mining, etc pgAdmin3 for GPDB LOADING & EXT. ACCESS STORAGE & DATA ACCESS LANGUAGE SUPPORT Petabyte-Scale Loading Hybrid Storage & Execution Comprehensive SQL -Oriented) PRODUCT Trickle Micro -Batching (Row - & Column Native MapReduce FEATURES Anywhere Data Access In-Database Compression SQL 2003 OLAP Extensions Multi-Level Partitioning Programmable Analytics Indexes – Btree, Bitmap, etc. GPDB ADAPTIVE Multi-Level Fault Tolerance Online System Expansion Workload Management SERVICES Shared-Nothing MPP Parallel Dataflow Engine CORE MPP ARCHITECTURE Parallel Query Optimizer gNet™ Software Interconnect Polymorphic Data Storage™ MPP Scatter/Gather Streaming™ 5/20/2011 15 What is MPP & Shared Nothing? MPP = Massively Parallel Processing • Two or more Servers (with own CPU/RAM/Disk) working on the same task • Multiple units of parallelism working together • Parallel Database Operations • Parallel CPU Processing • Segments = Greenplum Units of Parallelism (one Postgres database) ‘Shared Nothing’ Architecture • Each Segment is a separate Postgres Database • Segments only operate on their portion of the data • Segments are self-sufficient • Dedicated CPU Processes • Dedicated storage that is only accessible by the Segment 5/20/2011 16 Shared---Nothing-Nothing Architecture Massively Parallel Processing (MPP) • Most scalable database architecture – Optimized for BI and analytics Interconnect • Provides automatic parallelization – No need for manual partitioning or tuning – Just load and query like any database • Tables are distributed across segments – Each has a subset of the rows • Extremely scalable and I/O optimized Loading – All nodes can scan and process in parallel – No I/O contention between segments • Linear scalability by adding nodes – Each adds storage, query performance and loading performance 5/20/2011 17 Greenplum Database Master Node • Stores no user data • Manages global system catalog • Provides single view of multiple, independent postgres databases • Performs user authentication, query parsing/optimizing, error messaging, returns result sets to the Client • Most importantly : Creates MPP-optimized query plan for broadcast to GP cluster 5/20/2011 18 Anatomy of a Segment Node Four Postgres Databases Running Within One Segment Host server Segment Segment Segment Segment Database Database Database Database Open Source Open Source Open Source Open Source Postgres Postgres Postgres Postgres Red Hat / SuSE / Centos Linux or Solaris Primary A1 Primary A2 Primary A3 Primary A4 Core Core Mirror A4 Mirror A1 Mirror A2 Mirror A3 6 SAS/SATA Drives 1A 2A Primary B1 Primary B2 Primary B3 Primary B4 Gig/E Intel Intel G6 G6 Mirror B4 Mirror B1 Mirror B2 Mirror B3 6 SAS/SATA Drives Core Core Gig/E 1B 2B Primary C1 Primary C2 Primary C3 Primary C4 Mirror C4 Mirror C1 Mirror C2 Mirror C3 6 SAS/SATA Drives RAM 48GB Primary D1 Primary D2 Primary D3 Primary D4 Mirror D4 Mirror D1 Mirror D2 Mirror D3 6 SAS/SATA Drives RAID 5 Sets 5/20/2011 19 Greenplum Database How a distributed database works 5/20/2011 20 20 Data Distribution : The Key to Parallelism Strategy: spread data evenly across as many nodes (and disks) as possible Order ID Order # # OrderOrder Date Date OrderOrder CustomerCustomer 43 Oct 20 2005 12 64 Oct 20 2005 111 45 Oct 20 2005 42 46 Oct 20 2005 64 77 Oct 20 2005 32 48 Oct 20 2005 12 50 Oct 20 2005 34 56 Oct 20 2005 213 63 Oct 20 2005 15 44 Oct 20 2005 102 53 Oct 20 2005 42 55 Oct 20 2005 55 5/20/2011 21 Distribution Policies •Hash Distribution – CREATE TABLE … DISTRIBUTED BY (column [,…]) – Keys of the same value always sent to the same segments Round-Robin Distribution – CREATE TABLE … DISTRIBUTED RANDOMLY – Rows with columns of the same value not necessarily on the same segment 5/20/2011 22 Planning & Dispatching a Query Master=Query Dispatch (QD) Segment=Query Execution (QE) 5/20/2011 23 Further Improve Scan Times SELECT COUNT(*) FROM orders WHERE order_date >= ‘Oct 20 2005’ AND order_date < ‘Oct 27 2005’ Segment 1A Segment 1B Segment 1C Segment 1D Segment 1A Segment 1B Segment 1C Segment 1D Segment 2A Segment 2B Segment 2C Segment 2D Segment 2A Segment 2B Segment 2C Segment 2D VS Segment 3A Segment 3B Segment 3C Segment 3D Segment 3A Segment 3B Segment 3C Segment 3D Hash Partition Multi-Level Partition 5/20/2011 24 Greenplum Database Key Features and Differentiators 5/20/2011 25 25 Greenplum Database: Core Architecture & Dynamic Services Dynamic GPDB DYNAMIC Self-Healing Online System Workload SERVICES Fault Tolerance Expansion Management Parallel Dataflow Engine Shared-Nothing MPP gNet™ Software CORE MPP Parallel Query Optimizer