Data In. Decisions Out.

Bart Sjerps Advisory Technology Consultant Oracle SME - EMEA [email protected] +31-6-27058830 5/20/2011 1 Blog: http://bartsjerps.wordpress.com “If I’d asked my customers what they wanted they’d have said a faster horse.” Henry Ford

• I’m pretty sure that if Ford had asked his customers what they wanted, they’d have said something like faster horses and the reason is fairly simple: they couldn’t imagine anything else. • In fact, they didn’t want faster horses, they wanted a faster personal transportation method. It’s as simple as that. For Henry Ford, achieving this goal was absolutely impossible with a horse, so he came up with the idea of building a car that everybody could afford. Nobody knew they needed a car before they saw the Model-T (and knew they could afford it). T-Mobile Rob Strickland, CIO

5/20/2011 2 sizes over the years:

-1996: 11 TB, in a Database

-1999: 130 TB, in a Teradata Database

- May 2008: 2 PB, Yahoo

5/20/2011 3 It’s Time for a Change . . .

Yesterday’s Data Warehouse and Analytic Infrastructure The Greenplum Future Proprietary Commodity

Expensive Cost-Effective

Centralized, Monolithic Distributed

Process-Heavy Self-Service

Batch Real-Time

Summarized Deep

Slow Agile

4 •4 5/20/2011 4 Greenplum – True market disruption

20 Terabytes 70 Terabytes 100 Terabytes 20 kW, 8 Racks 20 kW, 6 Racks 12 kW, 2 Racks $20M $7M $1.8M

5/20/2011 5 Market Momentum

• 170+ global enterprise customers • 100%+ Year-to-year growth 2009 • Acquired by EMC july 2009 • Growing more quickly than Netezza and Teradata • +$250 Million saved by customers choosing GP over Teradata • +5 Billion shares analyzed daily by Financial Markets using GP • +20 Trillion rows being mined for business value • +1Billion consumers receiving more secure and personalize services from GP customers

5/20/2011 6 Industry Recognition: 2009 Gartner Magic Quadrant

Gartner: • Strengths

– Scale *2007 was our first year on the MQ – Mixed workloads – Cloud ready – Self service – Low cost • Concerns – Company size • Fixed by EMC – R&D budget • Fixed by EMC

Source: Gartner (January 2010) 5/20/2011 7 Customers by Industry

Financial Services

Telco

Media & Internet

Retail

Gov’t & Health/Ins.

5/20/2011 8 Greenplum Database Data in. Decisions out.

Fastest Advanced Data analytics Loading

Data in In Database Analytics Decisions out

Scatter/Gather Streaming™ for the Optimized for fast query execution Unified data access for greater world’s fastest data loading and linear scalability insight and value from data • Eliminate data load bottlenecks • Move processing closer to data • Enable parallel analysis across the enterprise • Shared nothing MPP scale-out • Clean and integrate new data architecture • Open platform with broad language support • Several loading options ranging • Computing is automatically from bulk load updates to micro- optimized and distributed across • Certified enterprise connectivity batching for near real-time resources and integration with most BI, ETL and management products processing • Provides the best concurrent multi-workload performance

5/20/2011 9 Greenplum Database Architecture Overview

5/20/2011 10 10 Data Computing Division Product Portfolio

Greenplum Greenplum Greenplum Community Data Greenplum Database Computing Chorus Edition Appliance

Enterprise Industry’s Data Cloud World’s most platform most Free entry powerful scalable level purpose- MPP analytic built database database database Virtualized, platform system self-service analytic infrastructure

5/20/2011 11 Deployment models

• Greenplum Community Edition – Free downloadable – Limited to 2 segment servers – All software is enabled • Greenplum Software Only – I.e. run on Vsphere / Vblock – Or on standard (Intel) servers • Greenplum DCD Appliance – Pre-configured, tested, supported, plug & play – Huge bandwidth • DCD Appliance hybrid DAS / SAN

5/20/2011 12 Architecture of Greenplum DCA

Flexible framework for processing large datasets

SQL MapReduce Process large datasets with support for UDF’sUDF’s:: R,Java,C,Python,Perl ODBC, JDBC, OLEDB both SQL and MapReduce etc BI/ETL Tools

Master servers optimize queries Master Master for the most efficient query execution

Interconnect for continuous pipelining of data processing

Segment Segment Segment Segment Segment Segment servers process queries … close to the data in parallel

MPP Scatter /Gather streaming for fast loading of data

5/20/2011 13 Architecture

• Based on PostgreSQL (open source) database – 15+ years of development – Feature-rich, mission critical-ready • Greenplum adds features on top of PostgreSQL – Very low development cost (compared to traditional RDBMS vendors) • Linear Scale-out • Parallel loading • Not depending on classic (OLTP) RDBMS tricks – Special indexes, materialized views, …

5/20/2011 14 Greenplum Database: Technical Stack

CLIENT ACCESS 3rd PARTY TOOLS ADMIN TOOLS

CLIENT ACCESS & ODBC, JDBC, OLEDB, etc. BI Tools, ETL Tools GP Performance Monitor TOOLS Data Mining, etc pgAdmin3 for GPDB

LOADING & EXT. ACCESS STORAGE & DATA ACCESS LANGUAGE SUPPORT Petabyte-Scale Loading Hybrid Storage & Execution Comprehensive SQL -Oriented) PRODUCT Trickle Micro -Batching (Row - & Column Native MapReduce FEATURES Anywhere Data Access In-Database Compression SQL 2003 OLAP Extensions Multi-Level Partitioning Programmable Analytics Indexes – Btree, Bitmap, etc.

GPDB ADAPTIVE Multi-Level Fault Tolerance Online System Expansion Workload Management SERVICES

Shared-Nothing MPP Parallel Dataflow Engine CORE MPP ARCHITECTURE Parallel Query Optimizer gNet™ Software Interconnect Polymorphic Data Storage™ MPP Scatter/Gather Streaming™

5/20/2011 15 What is MPP & Shared Nothing?

MPP = Massively Parallel Processing • Two or more Servers (with own CPU/RAM/Disk) working on the same task • Multiple units of parallelism working together • Parallel Database Operations • Parallel CPU Processing • Segments = Greenplum Units of Parallelism (one Postgres database)

‘Shared Nothing’ Architecture • Each Segment is a separate Postgres Database • Segments only operate on their portion of the data • Segments are self-sufficient • Dedicated CPU Processes • Dedicated storage that is only accessible by the Segment

5/20/2011 16 Shared---Nothing-Nothing Architecture Massively Parallel Processing (MPP)

• Most scalable database architecture – Optimized for BI and analytics

Interconnect • Provides automatic parallelization – No need for manual partitioning or tuning – Just load and query like any database • Tables are distributed across segments – Each has a subset of the rows • Extremely scalable and I/O optimized Loading – All nodes can scan and process in parallel – No I/O contention between segments • Linear scalability by adding nodes – Each adds storage, query performance and loading performance

5/20/2011 17 Greenplum Database Master Node

• Stores no user data

• Manages global system catalog

• Provides single view of multiple, independent postgres

• Performs user authentication, query parsing/optimizing, error messaging, returns result sets to the Client

• Most importantly : Creates MPP-optimized query plan for broadcast to GP cluster

5/20/2011 18 Anatomy of a Segment Node Four Postgres Databases Running Within One Segment Host server

Segment Segment Segment Segment Database Database Database Database

Open Source Open Source Open Source Open Source Postgres Postgres Postgres Postgres Red Hat / SuSE / Centos or Solaris

Primary A1 Primary A2 Primary A3 Primary A4

Core Core Mirror A4 Mirror A1 Mirror A2 Mirror A3 6 SAS/SATA Drives 1A 2A

Primary B1 Primary B2 Primary B3 Primary B4 Gig/E Intel Intel G6 G6 Mirror B4 Mirror B1 Mirror B2 Mirror B3 6 SAS/SATA Drives Core Core Gig/E 1B 2B Primary C1 Primary C2 Primary C3 Primary C4 Mirror C4 Mirror C1 Mirror C2 Mirror C3 6 SAS/SATA Drives RAM 48GB Primary D1 Primary D2 Primary D3 Primary D4 Mirror D4 Mirror D1 Mirror D2 Mirror D3 6 SAS/SATA Drives

RAID 5 Sets 5/20/2011 19 Greenplum Database How a distributed database works

5/20/2011 20 20 Data Distribution : The Key to Parallelism

Strategy: spread data evenly across as many nodes (and disks) as possible

Order ID Order # # OrderOrder Date Date OrderOrder CustomerCustomer

43 Oct 20 2005 12 64 Oct 20 2005 111 45 Oct 20 2005 42 46 Oct 20 2005 64 77 Oct 20 2005 32 48 Oct 20 2005 12 50 Oct 20 2005 34 56 Oct 20 2005 213 63 Oct 20 2005 15 44 Oct 20 2005 102 53 Oct 20 2005 42 55 Oct 20 2005 55

5/20/2011 21 Distribution Policies

•Hash Distribution – CREATE TABLE … DISTRIBUTED BY (column [,…]) – Keys of the same value always sent to the same segments

Round-Robin Distribution – CREATE TABLE … DISTRIBUTED RANDOMLY – Rows with columns of the same value not necessarily on the same segment

5/20/2011 22 Planning & Dispatching a Query Master=Query Dispatch (QD) Segment=Query Execution (QE)

5/20/2011 23 Further Improve Scan Times

SELECT COUNT(*) FROM orders WHERE order_date >= ‘Oct 20 2005’ AND order_date < ‘Oct 27 2005’

Segment 1A Segment 1B Segment 1C Segment 1D Segment 1A Segment 1B Segment 1C Segment 1D

Segment 2A Segment 2B Segment 2C Segment 2D Segment 2A Segment 2B Segment 2C Segment 2D VS

Segment 3A Segment 3B Segment 3C Segment 3D Segment 3A Segment 3B Segment 3C Segment 3D

Hash Partition Multi-Level Partition

5/20/2011 24 Greenplum Database Key Features and Differentiators

5/20/2011 25 25 Greenplum Database: Core Architecture & Dynamic Services

Dynamic GPDB DYNAMIC Self-Healing Online System Workload SERVICES Fault Tolerance Expansion Management

Parallel Dataflow Engine Shared-Nothing MPP gNet™ Software CORE MPP Parallel Query Optimizer ARCHITECTURE Interconnect Polymorphic Data Storage™ MPP Scatter/Gather (Row/Column/Compressed) Streaming™

5/20/2011 26 Key Technical Innovations

Scatter-Gather Data Streaming • Industry leading data loading capabilities

Online Expansion • Dynamically provision new servers with no downtime

Map-Reduce Support • Parallel programming on data for advanced analytics

Polymorphic Storage • Support for both row and column-oriented storage 5/20/2011 27 Greenplum Polymorphic Storage™ Flexible Row or Column Oriented Processing

Row- Column- Oriented Oriented

• Rather than take a side, we give customers the flexibility of both – Results consistent with industry/academic findings of row vs column benefits • Row-orientation typically better for general purpose DW – Avoid reassembly overhead that dominates in typical workloads • Column-orientation typically better for an important set of use cases – Accessing small # of cols from a wide table – e.g. certain Data Mining use cases • Table Orientation – Just specify ‘orientation=row’ or ‘orientation=column’ when creating a table – Gzip and LZ compression algorithms available with either orientation • Gives customers the choice of processing model for any table • Efficient pre-projection, parallel execution in either case

5/20/2011 28 High Availability Self healing and rapid recovery

Master server data protection • RAID protection for drive failures • Replicated transaction logs for server failure

Master Master On server failure • Standby server activated • Administrator alerted

Segment Server Data Protection • RAID protection for drive failures Segment Segment Segment … Segment • Mirrored segments for server failures On server failure • Mirrored segments take over with no loss of service • Fast online differential recovery

5/20/2011 29 Scatter Gather TM Streaming for the world’s fastest data loading speeds

• Parallel-everywhere approach to data loading • Data Scattered from all source systems to all Database nodes – Across hundreds or thousands of simultaneously parallel streams – Data can be transformed and processed on-the-fly (ELT or ETLT) • Gathering and storage of data takes place on all nodes simultaneously – Data automatically partitioned across nodes and optionally compressed • Supports both large batch and continuous near-real- time loading patterns – With negligible impact on concurrent database operations

5/20/2011 30 Industry’s Fastest Data Loading Rate

Scatter/Gather Streaming™ for the world’s fastest data loading

• Eliminate data load 5X bottlenecks 2X • Manage lightning- fast data flow TB per Hour per TB • Parallel everywhere

EMC Oracle Netezza Greenplum Exadata TwinFin DCA

5/20/2011 31 Greenplum 4.0:

The World’s Most Powerful Analytical Database

Extreme Scale on Elastic Expansion and Massively Parallel Commodity Hardware Continuous Uptime Analytic Processing

•From 100s of GBs to the •Add servers while online for •Unified parallel engine largest multi-Petabyte data more storage capacity and supports SQL, R, and warehouses -- scale is no performance MapReduce processing longer a barrier across 100s or 1000s of •Reliability and availability CPU cores •Software-only approach is features to accommodate all uniquely appliance ready levels of server, network and •Comprehensive SQL and OEM friendly storage failures support (SQL-92, SQL-99, SQL-2003 OLAP)

5/20/2011 Confidential 32

5/20/2011 32 Greenplum 4.0:

The World’s Most Powerful Analytical Database

Petabyte-Scale High-Performance Unified Analytics

• No fear of data growth or • Get answers faster than • Single platform for starting small ever before warehousing, marts, ELT, • Linear and cost-effective • Ensures consistent text mining, statistical scaling on commodity performance analysis as computing, hardware your data grows • Enable parallel analysis on any data, at all levels with SQL, MapReduce, R, etc.

5/20/2011 33 Greenplum vs OLTP DB architecture

Greenplum OLTP • Shared nothing – master aggregates data • Shared everything – Cache fusion • Scale out only limited by master servers • Scale out limited to ~ 4 nodes (if lucky) • Optimized for Analytical processing • Optimized for transaction processing (OLTP) • As few indexes as possible (you don’t know • Highly optimized indexes - Extensively tuned what you need anyway) for known queries • Divide & Conquer • Distribute workload (if possible)

5/20/2011• Limited locking mechanisms -> no OLTP • Full locking -> OLTP optimized 34 Greenplum DCA – Instant Price/Performance Leadership

EMC Greenplum DCA Oracle Exadata X2X2----88 Netezza TwinFin 12 Teradata 2580 (Full (1 rack) (Full Rack) (Full Rack) Rack) Architecture MPP Shared-Nothing MPP Shared-Disk MPP Shared-Nothing MPP Shared-Nothing 2 DB Servers 16 12 4 14 Storage 128 DB 96 192 32 112 Storage Intel E5460 Cores Intel E5670 Intel Nehalem Intel X7560 (3.16GHz) (2.93 GHz) (2.66 GHz) (2.26 GHz) + 96 FPGA Scan (GB/s) 24 GB/Sec 25 GB/Sec 10 GB/Sec 10 GB/Sec w/o compression Load (TB/Hr) >10TB/Hr 5TB/Hr 2 TB/Hr TBD Capacity (TB) 36 TB 28 TB 15 TB 32 TB Usable w/o Compression (600GB) (600 GB) (450 GB) Capacity (TB) 144 TB 112 TB 128 TB 20 TB Usable w/ Compression

Largest Multi-Rack 24 racks 8 racks 10 racks 10 racks Configuration Max DB Cores 4608 1024 960 352 Multi-rack Configuration

5/20/2011 35 Greenplum 4.0: High Efficiency through Smart Software

• Greenplum’s MPP Database has extreme scalability on VCE Infrastructure – Optimized for BI and analytics Interconnect – Fault-tolerant reliability and optimized performance using commodity CPUs, disks and networking • Provides automatic parallelization – No need for manual partitioning or tuning – Just load and query like any database – Tables are automatically distributed across nodes Loading • Extremely scalable and I/O optimized – All nodes can scan and process in parallel – No I/O contention between segments • Linear scalability by adding nodes – Each adds storage, query performance and loading performance

5/20/2011 36 Greenplum 4.0: Critical Mass Innovation Advanced Workload Management

Connection User-Based Resource Dynamic Query Management Queues Prioritization [NEW] •Control over how many •Each user is assigned to a •Patent pending technique of users can be connected. resource queue that dynamically balancing performs ‘admission control’ resources across running •Provides pooling (to allow of queries into the database queries large numbers) and caps (to restrict numbers if •Allows DBAs to control the •Allows DBAs to control desired) total number or total cost of query priorities in real-time, queries allowed in at any or determine default •Intelligently frees and point in time priorities by resource queue reacquires temporarily idle session resources

5/20/2011 37 Greenplum 4.0: Self-Healing Fault-Tolerance

Master 1. Segment server fails Servers Master Servers 2. Mirror segments take over, Network with no loss of service interconnect 3. Segment server is restored Segment or replaced Servers 4. Mirror segments restore primary via differential recovery (while online)

• Greenplum Database 4.0 enhances fault tolerance using In Sync a self-healing physical block replication architecture

Change Resync’ing • Key benefits of this architecture are: Tracking – Fast differential recovery and catch-up (always read-write) – Improved write performance and reduced network load

5/20/2011 38 Greenplum 4.0: SAN-Aware Fault-Tolerance

Master 1. Segment server fails Servers Master Servers 2. Mirror segments take over, Network with no loss of service interconnect 3. Segment server is restored Segment or replaced Servers

SAN • Greenplum Database 4.0 enhances fault tolerance for non- attached storage, focused first on EMC SAN • Key benefits of this architecture are: – Ability to scale compute servers separate from storage servers – No need for segment database mirroring – Enhanced archiving and disaster recovery support – Effective virtualization with server motion and load balancing 5/20/2011 39 Key Tools Partners

40

5/20/2011 40 Greenplum Data Computing Appliance

Performance, Scalability, Reliability and Reduced TCO for DW/BI Environments

Extreme Performance Reduced TCO Optimized for fast query Consolidate Data Marts execution and unmatched for lower costs data loading

Rapidly Deployable Private Cloud Ready Purpose-build data Data and computing are warehousing appliance automatically optimized and distributed Highly Available Self healing and fully redundant Advanced Backup and DR Leverage industry-leading Elastic Scalability Data Domain backup and Expand capacity and recovery 5/20/2011 performance online 41 Greenplum DCA Available Configurations Half rack (GP100) and Full rack (GP1000)

GP100 GP1000 Expansion Bus

8 X Segment Servers

Interconnect Bus 2X Master Servers

8 X Segment Servers

5/20/2011 42 Greenplum DCA Specifications Half rack (GP100) and Full rack (GP1000)

GP100 ---Half Rack GP1000 ---Full Rack

Master Servers 2 Master Servers 2 Master Servers

Segment Servers 8 Segment Servers 16 Segment Servers

Memory per Server 48 GB 48 GB Total Memory 384 GB 768 GB Segment HDD’s (SAS) 96 192 Useable Capacity 18 TB 36 TB (uncompressed) Useable Capacity 72 TB 144 TB (compressed) Scan Rate 12 GB/Sec 24 GB/Sec Data Load Rate 5TB/Hour 10TB/Hour

5/20/2011 43 Greenplum Data Computing Appliance Backup

• Backup and Recovery – With EMC Data Domain / Greenplum native utility • Reduces storage backup requirements – Reduces data volume by up to 30x • Fast, reliable data recovery – Reduced recovery time • Flexible and efficient – Designate intervals to backup

5/20/2011 44 Latest DCA Announcements April 5 th , 2011

5/20/2011 45 EMC Greenplum DCA is SAN Ready

Greenplum Enterprise Enterprise Greenplum DCA SAN SAN DCA SAN Replication

WAN ororor SAN

Synchronous or Asynchronous

5/20/2011 46 Overall DCD Partner Network…and growing

Hardware Vendors

BI / ETL Tools

Solutions & OEM

Consultants And Resellers

5/20/2011 47 Greenplum myths & facts

• Migrating to GP is hard – As much so (or worse) for migrating older Oracle versions to 11g – Similar for other DWH vendors (Teradata, Netezza, …) – Why not build a new DWH for flex queries and leave the existing DHW in place?

• Limited database functions (open source PostgreSQL) – Maybe but who needs them for analytics? – PostgreSQL gets more and more enterprise features

• No locking – That’s why it runs and loads FAAAAST… – But you cannot run OLTP/ERP on it (even if you wanted)

• Need to buy GP appliance – Not so, you can build yourself, or use Vblock, standard hardware ☺ – Or start with free downloadable Single Node edition (test / dev / functional POC) – More alternatives from EMC to come (with and/or Symm backend)

5/20/2011 48 Greenplum Chorus: Customer Example, Telecom

GO Database + EDC Chorus

100 TB EDW 1 Petabyte EDC

Customer Challenge: Greenplum1 Database Petabyte +EDC Chorus: • 100TB Teradata EDW focused on operational • Extracted data from EDW and others source reporting and financial consolidation systems to quickly assemble new analytic mart • EDW is single source of truth, under heavy • Generated a social graph from call detail governance and control records and subscriber data • Unable to support all of the critical initiatives • Within 2 weeks uncovered behavior where around data surrounding the business “connected” subscribers where 7X more likely • Customer loyalty and churn the #1 business to churn than average user initiative from the CEO on down • Now deploying 1PB production EDC with GP to power their analytic initiatives 5/20/2011 49 Customer Example: Zions Bancorporation Teradata Bake---Off-Off

• Business Problem – DW and data mart consolidation across banking regional bank operations – Improved query performance for both operational and ad-hoc reporting 80 – In-database analytics to support advanced data mining initiatives 60 (Min) • Existing Solution (Min) – Oracle 40 • Benefits over Teradata 20

– Open-systems, commodity HW Time Response – Significantly better TCO – Incremental scalability 0 – Better price-performance Previous DB Greenplum

“We turned to Greenplum because its massively parallel data warehousing approach is the only one robust and cost effective to grow with us over time.” - Walter Young, SVP Corporate Finance, Zions Bancorporation

5/20/2011 50 Customer Example: Franklin Templeton Netezza vs GP

• Business Problem – Exorbitant maintenance and support costs for Enterprise Data Warehouse – Poor data load and ad-hoc query performance on existing Oracle system – Scalable platform capable of consolidating 600 multiple decision support DBMS 500 • Existing Solution – Oracle 400

• Benefits over Netezza 300 – Open-systems, commodity HW 200 Response Time (Min) Response – Support model that fit with their existing data Time (Min) Response center operations 100 – Incremental scalability – Better price-performance 0 Previous DB Greenplum

“Queries that timed-out after 8 hours now run in less than 10 minutes.” - Baljinder Singh, Sr. Director Data Warehousing

5/20/2011 51 Customer Example: NASDAQ Enterprise Standard

• Business Problem – Analytic database platform standard across global exchange operations • Key Criteria – Mission critical reliability 2.5 – High-concurrency, mixed-workload – Incremental scalability 2

• Data Size 1.5 – 10TB - multi-hundred TB systems – Loading 1TB/day to 2TB/day 1 TB/day • Result 0.5 – 6 production systems deployed globally 0 Jan '08 Jan '09

“Greenplum offers strong scalability advantages due to its highly parallel model that enables us to simply add more servers as data volumes expand.” - Anna Ewing, CIO, NASDAQ OMX 5/20/2011 52 Customer Example: NYSE – Summary

• Business Problem • Processing and analyzing system performance data • Key Criteria • Scalability • Data Size • Increasing from 1TB/day to 2TB/day • Result • Processing time reduced from 12 hours (on SAS) to less than 5 minutes

“In our proof of concept, we saw Greenplum reach data loading speeds of over three terabytes per hour, and we know that the database can scale even further than that. Greenplum’s fast performance is critical for us.” - Steven Hirsch, Chief Data Officer, NYSE Euronext

5/20/2011 53 Customer Example: NYSE – Details

• Platform • 20 node system plus 2 ETL nodes for 100 TB data capacity • Hardware • SunFire x4500 -> x4540 • ETL SLA’s + in-Database Analytics Example: • Every minute during the trading day they load in a large amount of trade latency data and then run 'alerting' queries (about 5 of them) to monitor for suspicious/unusual activity.

• If anything is found then the proper people are notified in real- time.

• This is interesting because the large data load and the analytical queries must all finish in under one minute so that the next iteration of the cycle can begin.

• Other databases were not able to meet that time requirement.

5/20/2011 54 Driving the Future of Data Warehousing and Analytics

5/20/2011 Questions55 ?