IBM Big SQL (With Hbase), Splice Major Contributor to the Apache Be a Major Determinant“ Machine (Which Incorporates Hbase Madlib Project
Total Page:16
File Type:pdf, Size:1020Kb
MarketReport Market Report Paper by Bloor Author Philip Howard Publish date December 2017 SQL Engines on Hadoop It is clear that“ Impala, LLAP, Hive, Spark and so on, perform significantly worse than products from vendors with a history in database technology. Author Philip Howard” Executive summary adoop is used for a lot of these are discussed in detail in this different purposes and one paper it is worth briefly explaining H major subset of the overall that SQL support has two aspects: the Hadoop market is to run SQL against version supported (ANSI standard 1992, Hadoop. This might seem contrary 1999, 2003, 2011 and so on) plus the to Hadoop’s NoSQL roots, but the robustness of the engine at supporting truth is that there are lots of existing SQL queries running with multiple investments in SQL applications that concurrent thread and at scale. companies want to preserve; all the Figure 1 illustrates an abbreviated leading business intelligence and version of the results of our research. analytics platforms run using SQL; and This shows various leading vendors, SQL skills, capabilities and developers and our estimates of their product’s are readily available, which is often not positioning relative to performance and The key the case for other languages. SQL support. Use cases are shown by the differentiators“ However, the market for SQL engines on colour of each bubble but for practical between products Hadoop is not mono-cultural. There are reasons this means that no vendor/ multiple use cases for deploying SQL on product is shown for more than two use are the use cases Hadoop and there are more than twenty cases, which is why we describe Figure they support, their different SQL on Hadoop platforms. 1 as abbreviated. Thus, for example, performance and Mapping the latter to the former is not we are using “EDW” as shorthand for the level of SQL a trivial task, as different offerings are products that support both transactional they offer. optimised for some purposes but not lookups and complex analytics, which others. are otherwise individual use cases. Also, The key differentiators between it excludes vendors targeting OLAP, products are the use cases they as the leaders in this market – Jethro support, their performance and the Data and Kyvos Insights – have distinct level of SQL they offer. While all of approaches that are not easily compared. ” Figure 1 – Use cases by performance and SQL support. Use cases include Hybrid Transactional and Analytic Processing (HTAP), a merger of the transactional look-ups and complex analytics (EDW: enterprise data warehouse), combined batch and real-time/streaming analytics (Lambda architectures), and machine learning (ML). OLAP and some other use cases are omitted. SQL 5 KEY: HTAP IBM EDW ML Presto Lambda 4 Kognitio Pivotal Spark MapR Esgyn Hortonworks Actian 3 Cloudera Splice Machine Spark 2 PERFORMANCE 2 345 3 A Bloor Market Report Paper Use cases e have identified six different There are several other uses cases use cases for SQL on Hadoop. where you might want to use SQL on W Some of these overlap one Hadoop but, often enough, Hadoop on another and there will also be instances its own will be enough. These use cases where a user wants more than one of include extract, load and transform these use cases running on the same (ELT) and archival, as well as (ad hoc) cluster. However, we believe that the data preparation. The last of these examples detailed provide the bedrock was identified as a use case by one for making decisions about potential of the vendors, although none of the solutions. suppliers – including the identifier – we The main use cases we have have spoken to, have claimed to target identified, in no particular order, are: it. The same applies to data discovery We have identified 1. Transactional look-ups. This will and similar use cases where you would “six different use often be combined with other probably be better off to rely on an cases for SQL on use cases. information/data catalogue running on your data lake. One vendor also 2. Hybrid transactional analytic Hadoop. Some suggested a use case as an operational processing (HTAP). of these overlap data store. one another and 3. Complex queries against large datasets. Typically involving many there will also be users. We might describe this as instances where a “traditional data warehousing” and, user wants more certainly, there are vendors aiming to than one of these replace enterprise data warehouses use cases running on (EDW) via this use case. Often combined with transactional the same cluster. look-ups. 4. Online analytic processing (OLAP). May be either multi-dimensional OLAP (MOLAP) or relational OLAP ” (ROLAP). 5. To support machine (and deep) learning. 6. A “collapsed” lambda (or kappa) architecture designed to support both batch and real-time (streaming) analytics. Will often be combined with either or both of OLAP and machine learning, © 2017 Bloor 4 Offerings roducts in this market tend to • Other MPP-based solutions. This fall into one of six categories category consists of Transwarp and P and in the following lists we Esgyn. The latter is is a descendant have highlighted those products we of Tandem NonStop, HP Neoview examine in more detail in this report. and other HPE-based warehousing The groupings consist of: developments. • Pure-play open source projects. • Specialist offerings. Mostly these This category includes Hive, HBase, are targeted at OLAP environments. Tajo, Phoenix, Ignite and Spark. In this category are Apache Kylin See also the OLAP-based projects (MOLAP) and Apache Lens (ROLAP) below. All of these are Apache as well as Kyvos Insights and Jethro projects. Of the less well-known Data. Splice Machine is also in this offerings Phoenix supports on- category but has rather broader Traditional data line transaction processing (OLTP) capabilities (see later). AtScale warehousing products“ running against HBase; Ignite is an will compete with products in this have been used as in-memory computing platform that category but is a “BI on Hadoop” the basis for SQL on is commercially supported (and was engine rather than a SQL on Hadoop originally developed) by GridGain. It platform: as such it is not discussed Hadoop platforms. is typically used either as a Hadoop further here. These include IBM accelerator and/or to provide • Others that are often referred to as Db2 (Big SQL), Oracle, immediate consistency. Tajo is a big SQL on Hadoop engines, but which Vertica, Pivotal HDB data warehouse. There have been no are not. Included in this category (HAWQ: effectively a new releases of Tajo for 18 months, are Splout SQL, which is really port of Greenplum), so we suspect that it is defunct. about data serving, and Concurrent Kognitio (which is • Vendor supported open-source Lingual, which is used for application projects. This group includes development. Druid, which started free-to-use) and Drill (supported by MapR), Presto life as an MDX engine (and which Actian VectorH. (Teradata/Starburst Data), HAWQ now has limited SQL support) is (Pivotal) and Trafodion (Esgyn). another data serving product with All of these, again, are Apache OLAP capabilities. Apache Calcite is projects. Also in this category a general-purpose SQL optimiser but are Impala (Cloudera) and Hive + not an engine per se. None of the ” LLAP (Hortonworks – live long products in this group are discussed and process - previously known as in this report. Stinger). Note that Drill does not In the vendor/product section of this have to run on Hadoop. report we include short descriptions of • Traditional data warehousing many, though not all, of the proprietary products that have been used as the products (open source or otherwise), basis for SQL on Hadoop platforms. with the exception of Oracle, Vertica and These include IBM Db2 (Big SQL), Transwarp, none of which responded Oracle, Vertica, Pivotal HDB (HAWQ: to our requests for information. While effectively a port of Greenplum), the omission of Oracle and Vertica is no Kognitio (which is free-to-use) and great loss (a straight line can be drawn Actian VectorH. VectorH is the odd across from their traditional products), one out here because Actian Vector we would have liked to include details is a symmetric multi-processing about Transwarp. (SMP) solution that has been developed into a massively parallel processing (MPP) environment. All the other products were MPP-based originally. 5 A Bloor Market Report Paper Performance benchmarks great many vendors in this To conclude this section – while not space have conducted and all products have been benchmarked A published benchmarks. and some have been benchmarked Some of these have been validated against different standards – it is clear by third parties, some of them have that Impala, LLAP, Hive, Spark and so been conducted by third parties, but on, perform significantly worse than the majority have not involved any products from vendors with a history independent authorities. Although TPC in database technology. Moreover, it (transaction processing council) tests is much more likely that companies have typically been the basis for these in the latter category will be able to benchmarks, none of them have been support all of your queries and run authenticated by TPC. The individual them successfully: the level of SQL The level of support product descriptions that follow outline support from the pure-play, Cloudera “for ANSI standard the results of the various benchmarks or Hortonworks products, tends to be SQL varies widely. that have been performed by different limited. vendors. We will therefore confine While on the subject of SQL support, IBM – not just in ourselves here to general comments. it is worth commenting that the level Big SQL and Db2, The first point that we would of support for ANSI standard SQL varies but across its product like to note is that TPC-DS (Decision widely.