Big Data Landscape for Databases
Total Page:16
File Type:pdf, Size:1020Kb
Big Data Landscape for Databases Bob Baran Senior Sales Enginee [email protected] ! May 12, 2015 Typical Database Workloads OLTP Applications Real-Time Web, Real-Time, Ad-Hoc Analytics Enterprise Data Mobile, and IoT Operational Warehouses Applications Reporting Typical • MySQL • MongoDB • MySQL • Greenplum • Teradata Databases • Oracle • Cassandra • Oracle • Paraccel • Oracle • MySQL • Netezza • Sybase IQ • Oracle Use Cases • ERP, CRM, Supply • Web, mobile, social • Operational • Exploratory • Enterprise Chain • IoT Datastores Analytics Reporting • Crystal Reports • Data Mining Workload • Real-time updates • Real-time updates • Real-time updates • Complex • Parameterized Strengths • ACID transactions • High ingest rates • Canned, queries reports against • High concurrency • High concurrency of parameterized requiring full historical data of small reads/ small reads/ writes reports table scans writes • Range queries • Range queries • Append only • Range queries Operational Analytical 2 Recent History of RDBMSs ▪ RDBMS Definition ▪ Relational with joins ▪ ACID transactions ▪ Secondary indexes ▪ Typically row-oriented ▪ Operational and/or analytical workloads ▪ By early 2000s ▪ Limited innovation ▪ Looked like Oracle and Teradata won… 3 Hadoop Shakes Up Batch Analytics ▪ Data processing framework ▪ Cheap distributed file system ▪ Brute force, batch processing through MapReduce ▪ Great for batch analytics ▪ Great place to dump data to look at later 4 NoSQL Shakes Ups Operational DBs ▪ NoSQL wave ▪ Companies like Google, Amazon and LinkedIn needed greater scalability & schema flexibility ▪ New databases developed by developers, not database people ▪ Provided scale-out, but lost SQL ▪ Worked well at web startups because: ▪ In some cases, use cases did not need ACID ▪ Willing to handle exceptions at app level 5 Convoluted Evolution of Databases NoSQL Databases 2010 Hadoop Scale-out 2005 SQL Databases 2013 Scale Out Scale Traditional Up ty Sc bili ala RDBMSs 1980s-2000s Hierarchical/ Network Databases 1970s Indexed Files (ISAM) 1960s Functionality 6 Mainstream user changes ▪ Driven by web, social, mobile, and Internet of Things ▪ Major increases in scale – 30% annual data growth ▪ Significant requirements for semi-structured data ▪ Though relatively little unstructured ▪ Technology adoption continuum Should I Why wouldn’t I What is it? use it? use it? Scale-out SQL DBs NoSQL for Cloud for operational web apps apps Hadoop technologies for analytics 7 Schema on Ingest vs. Schema on Read Schema Schema on Ingest on Read Data Stream Application • Structured data • Schema on Read if should always you only use data a remain structured few times a year • Add schema if data used regularly ▪ Even “schemaless” MongoDB requires “schema” - 10 Things You Should Know About Running MongoDB At Scale • By Asya Kamsky, Principal Solutions Architect at MongoDB • Item #1 – “have a good schema and indexing strategy” ‹#› Scale-out is the future of databases How do I scale? Scale Up Scale Out NoSQL NewSQL SQL-on- MPP Hadoop Hadoop Analytic RDBMS Engines 9 NoSQL Pros Cons ▪ Easy scale-out ▪ No SQL – requires retraining ▪ Flexible schema and app rewrites ▪ Easier web development with ▪ No joins – i.e., no cross row/ hierarchical data structures document dependencies (MongoDB) ▪ No reliable updates through ▪ Cross-data center replication transactions across rows/tables (Cassandra) ▪ Eventual consistency (Cassandra) ▪ Not designed to do aggregations required for analytics 10 NewSQL Pros Cons ▪ Easy scale-out ▪ Proprietary scale-out, ▪ ANSI SQL – eliminates unproven into petabytes retraining and app rewrites ▪ Must manage another ▪ Reliable updates through ACID distributed infrastructure transactions beyond Hadoop ▪ RDBMS functionality ▪ Can not leverage Hadoop ▪ Strong cross-data center ecosystem of tools replication (NuoDB) 11 NewSQL – In-Memory Pros Cons ▪ Easy scale-out ▪ Memory 10-20x more expensive ▪ High performance because ▪ Limited SQL everything in memory ▪ Limited cross-node transactions ▪ ACID transactions within nodes ▪ Proprietary scale-out, unproven into petabytes ▪ Must manage another distributed infrastructure beyond Hadoop ▪ Can not leverage Hadoop ecosystem 12 Operational RDBMS on Hadoop Pros Cons ▪ Easy scale-out ▪ Full table scans slower than MPP ▪ Scale-out infrastructure proven DBs, but faster than traditional into petabytes RDBMSs ▪ ANSI SQL – eliminates ▪ Existing HDFS data must be re- retraining and app rewrites loaded through SQL interface ▪ Reliable updates through ACID transactions ▪ Leverages Hadoop distributed infrastructure and tool ecosystem 13 MPP Analytical Databases Pros Cons ▪ Easy scale-out ▪ Poor concurrency models prevent ▪ Very fast performance for full support of real-time apps table scans ▪ Poor performance for range ▪ Highly parallelized, shared queries nothing architectures ▪ Need to redistribute all data to ▪ May have columnar storage add nodes (hash partitioning) (Vertica) ▪ May require specialized hardware ▪ No maintenance of indexes (Netezza) (Netezza) ▪ Proprietary scale out - can not leverage Hadoop ecosystem of tools 14 SQL-on-Hadoop – Analytical Engines Pros Cons ▪ Easy scale-out ▪ Relatively immature, especially ▪ Scale-out proven into compared to MPP DBs petabytes ▪ Limited SQL ▪ Leverages Hadoop distributed ▪ Poor concurrency models prevent infrastructure support of real-time apps ▪ Can leverage Hadoop ▪ No reliable updates through ecosystem of tools transactions ▪ Intermediate results must fit in memory (Presto) 15 Future: Hybrid In-Memory Architectures Memory Cache Pure with Disk In-Memory - Unsophisticated memory - Very expensive management Hybrid In-Memory - Flexible, cost-effective - Controlled by optimizer - In-memory materialized views? 16 Summary – Future of Databases ▪ Predicted Trends ▪ Scale-out dominates databases ▪ Developers stop worrying about data size and develop new data-driven apps ▪ Hybrid in-memory architecture becomes mainstream ▪ Predicted Winners ▪ Hadoop becomes de facto distributed file system ▪ NoSQL used for simple web apps ▪ Scale-out SQL RDBMSs replace traditional RDBMSs 17 Questions? Bob Baran Senior Sales Engineer [email protected] ! May 12, 2015 Powering Real-Time Apps on Hadoop Bob Baran Senior Sales Engineer [email protected] ! May 12, 2015 Who Are We? THE ONLY HADOOP RDBMS Power operational applications on Hadoop Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back Transactional – Real-time updates & ACID 10x Transactions Better ANSI SQL – Leverage existing SQL code, tools, & Price/Perf skills Flexible – Support operational and analytical workloads ‹#› What People are Saying… Recognized as a key innovator in databases Scaling out on Splice An alternative to The unique claim of … Splice Machine presented today’s RDBMSes, Machine is that it can run some major benefits Splice Machine effectively transactional over Oracle combines traditional relational applications database technology with Quotes ...automatic balancing between as well as support analytics on clusters...avoiding the costly the scale-out capabilities licensing issues. of Hadoop. top of Hadoop. Awards 21 Advisory Board Advisory Board includes luminaries in databases and technology Mike Franklin Roger Bamford Computer Science Chair, UC Former Principal Architect at Berkeley Oracle Director, UC Berkeley AmpLab Father of Oracle RAC Founder of Apache Spark Marie-Anne Neimat Ken Rudin Co-Founder, Times-Ten Database Head of Analytics at Facebook Former VP, Database Eng. at Former GM of Oracle Data Oracle Warehousing 22 Combines the Best of Both Worlds Hadoop ▪ Scale-out on commodity servers ▪ Proven to 100s of petabytes ▪ Efficiently handle sparse data ▪ Extensive ecosystem RDBMS ▪ ANSI SQL ▪ Real-time, concurrent updates ▪ ACID transactions ▪ ODBC/JDBC support ‹#› Focused on OLTP and Real-Time Workloads OLTP Applications Real-Time Web, Real-Time, Ad-Hoc Analytics Enterprise Data Mobile, and IoT Operational Warehouses Applications Reporting Typical • MySQL • MySQL • MySQL • Greenplum • Teradata Databases • Oracle • Oracle • Oracle • Paraccel • Oracle • MongoDB • Netezza • Sybase IQ • Cassandra Use Cases • ERP, CRM, Supply • Web, mobile, social • Operational • Exploratory • Enterprise Chain • IoT Datastores Analytics Reporting • Crystal Reports • Data Mining Workload • Real-time updates • Real-time updates • Real-time updates • Complex • Parameterized Strengths • ACID transactions • High ingest rates • Canned, queries reports against • High concurrency • High concurrency parameterized requiring full historical data of small reads/ of small reads/ reports table scans writes writes • Range queries • Append only • Range queries • Range queries 24 OLTP Campaign Management: Harte-Hanks Challenges Overview Digital marketing services provider Oracle RAC too expensive to scale Unified Customer Profile Queries too slow – even up to ½ hour Real-time campaign management Getting worse – expect 30-50% data growth OLTP environment with BI reports Looked for 9 months for a cost-effective solution Solution Diagram Initial Results Cross-Channel Campaigns Real-Time 10-20x price/perf Personalization with no application, BI or ETL rewrites ¼ cost with commodity scale out 3-7x faster Real-Time Actions through parallelized queries 25 Reference Architecture: Operational Data Lake Offload real-time reporting and analytics from expensive OLTP and DW systems Operational OLTP Data Lake Systems Data Stream or Warehouse Batch ERP Updates Executive Business CRM ETL Reports Supply Ad Hoc Chain