MarketReport

Paper by Bloor by Paper December 2017 December Philip Howard SQL Engines on Hadoop SQL Market Report Market Author date Publish It is clear that“ Impala, LLAP, Hive, Spark and so on, perform significantly worse than products from vendors with a history in database technology.

Author Philip Howard” Executive summary

adoop is used for a lot of these are discussed in detail in this different purposes and one paper it is worth briefly explaining H major subset of the overall that SQL support has two aspects: the Hadoop market is to run SQL against version supported (ANSI standard 1992, Hadoop. This might seem contrary 1999, 2003, 2011 and so on) plus the to Hadoop’s NoSQL roots, but the robustness of the engine at supporting truth is that there are lots of existing SQL queries running with multiple investments in SQL applications that concurrent thread and at scale. companies want to preserve; all the Figure 1 illustrates an abbreviated leading business intelligence and version of the results of our research. analytics platforms run using SQL; and This shows various leading vendors, SQL skills, capabilities and developers and our estimates of their product’s are readily available, which is often not positioning relative to performance and The key the case for other languages. SQL support. Use cases are shown by the differentiators“ However, the market for SQL engines on colour of each bubble but for practical between products Hadoop is not mono-cultural. There are reasons this means that no vendor/ multiple use cases for deploying SQL on product is shown for more than two use are the use cases Hadoop and there are more than twenty cases, which is why we describe Figure they support, their different SQL on Hadoop platforms. 1 as abbreviated. Thus, for example, performance and Mapping the latter to the former is not we are using “EDW” as shorthand for the level of SQL a trivial task, as different offerings are products that support both transactional they offer. optimised for some purposes but not lookups and complex analytics, which others. are otherwise individual use cases. Also, The key differentiators between it excludes vendors targeting OLAP, products are the use cases they as the leaders in this market – Jethro support, their performance and the Data and Kyvos Insights – have distinct level of SQL they offer. While all of approaches that are not easily compared. ”

Figure 1 – Use cases by performance and SQL support. Use cases include Hybrid Transactional and Analytic Processing (HTAP), a merger of the transactional look-ups and complex analytics (EDW: enterprise data warehouse), combined batch and real-time/streaming analytics (Lambda architectures), and machine learning (ML). OLAP and some other use cases are omitted. SQL

5 KEY: HTAP IBM EDW ML Presto Lambda 4 Kognitio Pivotal Spark

MapR Esgyn

Hortonworks Actian 3 Cloudera Splice Machine Spark

2 PERFORMANCE

2 345

3 A Bloor Market Report Paper Use cases

e have identified six different There are several other uses cases use cases for SQL on Hadoop. where you might want to use SQL on W Some of these overlap one Hadoop but, often enough, Hadoop on another and there will also be instances its own will be enough. These use cases where a user wants more than one of include extract, load and transform these use cases running on the same (ELT) and archival, as well as (ad hoc) cluster. However, we believe that the data preparation. The last of these examples detailed provide the bedrock was identified as a use case by one for making decisions about potential of the vendors, although none of the solutions. suppliers – including the identifier – we The main use cases we have have spoken to, have claimed to target identified, in no particular order, are: it. The same applies to data discovery We have identified 1. Transactional look-ups. This will and similar use cases where you would “six different use often be combined with other probably be better off to rely on an cases for SQL on use cases. information/data catalogue running on your data lake. One vendor also 2. Hybrid transactional analytic Hadoop. Some suggested a use case as an operational processing (HTAP). of these overlap data store. one another and 3. Complex queries against large datasets. Typically involving many there will also be users. We might describe this as instances where a “traditional data warehousing” and, user wants more certainly, there are vendors aiming to than one of these replace enterprise data warehouses use cases running on (EDW) via this use case. Often combined with transactional the same cluster. look-ups. 4. Online analytic processing (OLAP). May be either multi-dimensional OLAP (MOLAP) or relational OLAP ” (ROLAP). 5. To support machine (and deep) learning. 6. A “collapsed” lambda (or kappa) architecture designed to support both batch and real-time (streaming) analytics. Will often be combined with either or both of OLAP and machine learning,

© 2017 Bloor 4 Offerings

roducts in this market tend to • Other MPP-based solutions. This fall into one of six categories category consists of Transwarp and P and in the following lists we Esgyn. The latter is is a descendant have highlighted those products we of Tandem NonStop, HP Neoview examine in more detail in this report. and other HPE-based warehousing The groupings consist of: developments. • Pure-play open source projects. • Specialist offerings. Mostly these This category includes Hive, HBase, are targeted at OLAP environments. Tajo, Phoenix, Ignite and Spark. In this category are See also the OLAP-based projects (MOLAP) and Apache Lens (ROLAP) below. All of these are Apache as well as Kyvos Insights and Jethro projects. Of the less well-known Data. Splice Machine is also in this offerings Phoenix supports on- category but has rather broader Traditional data line transaction processing (OLTP) capabilities (see later). AtScale warehousing products“ running against HBase; Ignite is an will compete with products in this have been used as in-memory computing platform that category but is a “BI on Hadoop” the basis for SQL on is commercially supported (and was engine rather than a SQL on Hadoop originally developed) by GridGain. It platform: as such it is not discussed Hadoop platforms. is typically used either as a Hadoop further here. These include IBM accelerator and/or to provide • Others that are often referred to as Db2 (Big SQL), Oracle, immediate consistency. Tajo is a big SQL on Hadoop engines, but which Vertica, Pivotal HDB data warehouse. There have been no are not. Included in this category (HAWQ: effectively a new releases of Tajo for 18 months, are Splout SQL, which is really port of Greenplum), so we suspect that it is defunct. about data serving, and Concurrent Kognitio (which is • Vendor supported open-source Lingual, which is used for application projects. This group includes development. Druid, which started free-to-use) and Drill (supported by MapR), Presto life as an MDX engine (and which Actian VectorH. (Teradata/Starburst Data), HAWQ now has limited SQL support) is (Pivotal) and Trafodion (Esgyn). another data serving product with All of these, again, are Apache OLAP capabilities. is projects. Also in this category a general-purpose SQL optimiser but are Impala (Cloudera) and Hive + not an engine per se. None of the ” LLAP (Hortonworks – live long products in this group are discussed and process - previously known as in this report. Stinger). Note that Drill does not In the vendor/product section of this have to run on Hadoop. report we include short descriptions of • Traditional data warehousing many, though not all, of the proprietary products that have been used as the products (open source or otherwise), basis for SQL on Hadoop platforms. with the exception of Oracle, Vertica and These include IBM Db2 (Big SQL), Transwarp, none of which responded Oracle, Vertica, Pivotal HDB (HAWQ: to our requests for information. While effectively a port of Greenplum), the omission of Oracle and Vertica is no Kognitio (which is free-to-use) and great loss (a straight line can be drawn Actian VectorH. VectorH is the odd across from their traditional products), one out here because Actian Vector we would have liked to include details is a symmetric multi-processing about Transwarp. (SMP) solution that has been developed into a massively parallel processing (MPP) environment. All the other products were MPP-based originally.

5 A Bloor Market Report Paper Performance benchmarks

great many vendors in this To conclude this section – while not space have conducted and all products have been benchmarked A published benchmarks. and some have been benchmarked Some of these have been validated against different standards – it is clear by third parties, some of them have that Impala, LLAP, Hive, Spark and so been conducted by third parties, but on, perform significantly worse than the majority have not involved any products from vendors with a history independent authorities. Although TPC in database technology. Moreover, it (transaction processing council) tests is much more likely that companies have typically been the basis for these in the latter category will be able to benchmarks, none of them have been support all of your queries and run authenticated by TPC. The individual them successfully: the level of SQL The level of support product descriptions that follow outline support from the pure-play, Cloudera “for ANSI standard the results of the various benchmarks or Hortonworks products, tends to be SQL varies widely. that have been performed by different limited. vendors. We will therefore confine While on the subject of SQL support, IBM – not just in ourselves here to general comments. it is worth commenting that the level Big SQL and Db2, The first point that we would of support for ANSI standard SQL varies but across its product like to note is that TPC-DS (Decision widely. IBM – not just in Big SQL and range – is much Support) tests are not just an indicator Db2, but across its product range – is the most advanced of performance but also of SQL support. much the most advanced vendor in this TPC-H, on the other hand, is based on SQL respect. Conversely, there are a number vendor in this 92, which is hardly up-to-date. We are of products whose ANSI support dates respect. disappointed with Actian, therefore, that it back to the last century. There are a number is focused on TPC-H and not TPC-DS. of products whose The second point is that many tests ANSI support dates are done using relatively small datasets and a single processing thread, when to the last century. what you are really want is multiple users running against large sets of data. IBM, for example, has demonstrated that while Spark is perfectly capable of running all TPC-DS queries at small scale ” it breaks down as you scale up. Thirdly, some vendors, notably Hortonworks and Cloudera, both of which have been guilty of publishing partial results. For example, just selecting (no doubt the best ones) 15 of the 99 TPC-DS test to report on.

© 2017 Bloor 6 Product suitability

hile performance may be a 4. Online analytic processing (OLAP). major determinant in buying May be either multi-dimensional W decisions, it is only relevant OLAP (MOLAP) or relational OLAP when comparing apples with apples, (ROLAP). The vendor-based products and the products covered in this paper in this area are much stronger than constitute an entire fruit bowl. In this any open source offerings. Kyvos section we therefore match products to Insights, Jethro Data and Splice use cases. Machine are the vendors to consider. 1. Transactional look-ups. This will often 5. To support machine (and deep) be combined with use case 3 (see learning. Pivotal, IBM and Splice above). Various products, often in Machine are the companies most conjunction with HBase are suitable active in this area, but where IBM here. Notable contenders would be relies on Spark MLlib, Pivotal is a While performance may IBM Big SQL (with HBase), Splice major contributor to the Apache be a major determinant“ Machine (which incorporates HBase MADLib project. Splice Machine ships in buying decisions, into its Lambda architecture), Esgyn with MLlib. it is only relevant and Actian VectorH. 6. Lambda architectures. Both MapR and when comparing 2. Hybrid transactional analytic Splice Machine are in the business of processing (HTAP). This is a major “collapsing” lambda architectures to apples with apples, focus area for both Esgyn and support batch and streaming analytics and the products MapR Drill. Splice Machine is also into a single platform. In the case of covered in this paper a suitable contender here, though Splice Machine, a Spark processing constitute an entire its emphasis is slightly different engine is embedded into the platform. fruit bowl. (more on leveraging transactional In this context it is worth commenting data for predictive analytics than that independent benchmarking has embedding analytics into operational found that tight integration with applications). The InterSystems IRIS Spark results in a 11x performance Data Platform also competes here improvement compared to simply though it is not based on Hadoop (but connecting to Spark. We would expect ” is a clustered solution) as do others. an embedded engine to do even better.

3. Complex queries against large There are two other use cases worth datasets. Typically involving many commenting on. Esgyn has identified users. We might describe this as operational data stores as a target use “traditional data warehousing” case. More interestingly, MapR Drill and, certainly, there are vendors supports queries against semi-structured aiming to replace enterprise data data such as JSON, as well as structured warehouses (EDW) via this use case, data. It has extended its SQL support to often combined with transactional allow this. Competitors to MapR for this look-ups. All the “ported” data sort of functionality tend to come from warehouses play in this space, as other environments: the InterSystems do Actian VectorH, Esgyn and Splice IRIS data platform, for example, Machine. Kognitio comes out well encompasses the same capabilities and in the benchmark studies we have extends to unstructured data. investigated.

7 A Bloor Market Report Paper Conclusion

QL on Hadoop is all about Hadoop products, though these are not horses for courses and, in intended to represent an exhaustive S this paper, we have discussed list. Specifically, we have concentrated both the horses and the courses. Table on scale-out clustered solutions and 1 highlights our results. Readers should have omitted products such as IBM recognise that you can do OLAP, for Informix or SAP HANA, both of which example, with any EDW product, but the target HTAP (for example), because they likes of Kyvos and Jethro will typically employ architectures that are a long way provide better performance, hence removed from Hadoop. our recommendations. We have also suggested some SQL but not Hadoop- based vendors that you might like to consider as alternatives to the SQL on

Use case Transactional HTAP Complex OLAP M/L Lambda Mixed data

Recommended IBM Big SQL Esgyn Kognitio Jethro Data Pivotal Splice Machine MapR Drill Actian VectorH MapR Drill IBM Big SQL Kyvos IBM Big SQL MapR Drill Splice Machine Presto

Others Esgyn All EDW Splice Machine Splice Machine All EDW Some EDW

Non-Hadoop InterSystems AtScale InterSystems Druid

© 2017 Bloor 8 Product Sheet A Bloor Market Report Paper

has probably got the performance edge against edge got the performance has probably its rivals. all of and at least some isn’t everything Performance specialised are offering vendors competitive supportingcapabilities that go beyond workloads”. “traditional EDW is targeting the general-purpose EDW Actian this is However, running on Hadoop. market EDW traditional with many space, a crowded The also addressing this market. vendors as known been has not traditionally company where this market in the high end of a player has so the company solutions are MPP-based, issue in getting its presence a clear marketing known. While the performance figures quoted by Actian by figures quoted While the performance TPC-DS, and not based on are out-of-date This impressive. nonetheless, the results are, for is a serious contender VectorH suggests that (EDW) data warehousing traditional enterprise drawing a line through other Indeed, workloads. and vendors both by conducted – benchmarks VectorH suggest that would – independently Actian 150 Suite 2300 Geng Rd., 94303 CA Alto, Palo www.actian.com Threats • • for Recommended transactional look-ups. HTAP, for Also appropriate environments that include workload Mixed analytics. complex Strengths •

Actian has performed benchmark testing benchmark has performed Actian VectorH, like Vector is an in-memory, columnar columnar is an in-memory, Vector like VectorH, 9 are 22 TPC-H queries but 99 in the TPC-DS standard) TPC-DS queries but 99 in the TPC-H are 22 the supported by syntax the SQL test and do more to various engines. TPC-H tests rather than the more common TPC-DS, TPC-DS, rather than the more common tests TPC-H SQL as more suitable for regarded which are usually These are more onerous (there on Hadoop engines. However, there are a couple of caveats to be made to caveats of there are a couple However, the benchmarking was Firstly, about these figures. of 2016 and is during the first half performed these were Secondly, date. out of somewhat therefore comparing the performance of VectorH with VectorH of the performance comparing typically VectorH and Spark. HAWQ Impala, Hive, magnitude. an order of the others by outperforms performing query operations. Both Spark and Scala Both Spark operations. query performing are supported. differentiators is the vectorised processing that it processing vectorised is the differentiators (in memory) It uses positional delta trees supports. while updates that minimise the impact of system (HDFS). Apache YARN is supported and YARN Apache (HDFS). system and role-based. is both row available security its major one of As its names suggests, database. Parquet and ORC files, which are treated as external as which are treated files, and ORC Parquet storage a proprietary has developed Actian tables, file the Hadoop distributed of on top mechanism product, it also leverages the query planner and the query it also leverages product, as Ingres be known optimiser from what used to to access While there is X. Actian called but is now on either a Cloudera or Hortonworks platform withor Hortonworkson either a Cloudera platform While the principle MapR supported upon request. Vector on the the product are based of components solution that uses Hadoop for clustering purposes. So, So, purposes. clustering for Hadoop uses that solution than just a portthis represents more from one MPP implemented It is typically another. environment to engine. However, Vector uses a symmetric multi- uses a symmetric Vector However, engine. that scales based architecture (SMP) processing on the other hand is, VectorH, up rather than out. (MPP) parallel processing a massively necessity, by Actian VectorH (Vector for Hadoop) is Actian’s Vector Vector Actian’s is Hadoop) for (Vector VectorH Actian SQL as a product ported the Hadoop platform to Actian VectorH VectorH Actian Product Sheet 10

there a significant is market for this capability. Almost all proprietary vendors within the SQL on Hadoop space concentrate on one or another aspect of business intelligence or analytics. While they might have transaction support this not is an area they focus on. EsgynDB has a major advantage its in support for HTAP. Given its background and history we have no doubt that EsgynDB robust is and richly featured. Performance benchmarks are one thing but the true test concurrency is and here we would expect Esgyn to have a significant advantage. is,It for example, one of the few SQL on Hadoop engines to provide spill-to-disk capabilities when there is memory pressure. Esgyn targeting is other environments as well as has it Hadoop on transaction For processing HTAP. similar advantages However, for purely as for HTAP. competition, more much faces it workloads analytic from both open source and proprietary vendors. The company also claims benefits for use as an operational data store, though we are not sure that Esgyn 117 Suite 691 South Milpitas Boulevard, 95035 CA Milpitas, www.esgyn.com Recommended for Recommended HTAP. Also for appropriate Mixed workload and other traditional environments. Strengths • • Threats • per minute. In terms of query performance, EsgynDB runs all queries 99 TPC-DS (which by is no means always true of other SQL on Hadoop engines) with an average performance advantage of times 2.7 versus Impalaversion 2.2 and 5 times versus Hive 1.2.1. We should add by way of a caveat that benchmarks are moving feasts, as vendors introduce new capabilities, but of course this applies as much to Esgyn as to Hortonworks. or Cloudera

From a performance perspective Esgyn has EsgynDB a massively is parallel processing © 2017 Bloor processing between 6 and node 13 deployments, with the latter supporting just over 345,000 transactions conducted benchmarks using both and TPC-C TPC-DS. In the case of the former, the product has demonstrated linear scalability for transaction on ANSI 2003 with extensions that support Oracle PL/ SQL and Teradata functions, to encourage users to move off those platforms and onto EsgynDB. load balancing and so on. There a distributed is query manager to support transaction processing and the product fully is ACID compliant. SQL support based is parallel resources as required part (a of its workload management), the sorts of high availability that you would expect from a product with heritage, a Tandem and various streaming engines. In terms of other features, the product supports a wealth of capabilities, including secondary indexes, the ability to assign case, you can embed code into your UDF so that you can graph join data with data managed directly by EsgynDB. The product also integrates with and leverages technologies such as Hibernate, Spark tables. supported Currently external environments include and JanusGraph (which a is Foundation open source graph database). In the latter HBase, Parquet, ORC and Avro formats. The product supports extensible user defined functions (UDFs) that allow external environments to be treated as (MPP) based database that typically leverages Trafodion tables but which also supports Hive, Supermicro, available as an appliance. EsgynDB has been deployed a variety in of areas, including Internet of Things (IoT), Banking and Insurance, Telecom, Manufacturing, Internet Security Smart and Cities. data warehouse, but its primary differentiator its is support The product for HTAP. available is both the in cloud and on premises. It also, is conjunction in with of supporting that is HTAP Esgyn suitable is for use environments including processing range of a across acting as anoperational data store or as enterprise and multi-structured datasets. It a commercial is implementation of the database, to which Esgyn the is prime contributor. The implication 4th generation of development, and which a SQL is on Hadoop engine focused on (hybrid HTAP transactional and analytic processing), supporting mixed workloads founded staff byex-HP that had worked on successive generations of that company’s database technology, starting with NonStop Tandem SQL, via HP’s enterprise data warehouses, to EsgynDB, which represents the Esgyn Corporation, thedeveloper of EsgynDB, based is thein UnitedStates and China. The company was EsgynDB EsgynDB Product Sheet A Bloor Market Report Paper

IBM Armonk Orchard Road, 1 New USA 10504-1722, York New www.ibm.com perceptual issue rather than anything else, but it real is nevertheless. Unlike many (not all) providers of SQL on Hadoop engines, IBM quite is clear about its products’ strengths and it firmly is focused complexon analytics serving multiple users (which you might think of as traditional data warehousing), which may be extended to include transactional/fast deployed is SQL environments when Big look-up conjunctionin with HBase. The heritage Db2 for Big SQL means that we would have no qualms about recommending it for high availability, resilience, security, load balancing and all those other capabilities that you would expect from a product with the maturity longevity Db2. and of Big SQL provides both row- and column-level security. This unlike is most competitors, which typically support one or the other but not both. In principle, there should be no particular threats to Big SQL. However, we have the impression that, outside the company’s user base, companies do not really think of IBM as being the in Hadoop space. This also applies – despite ample evidence to the contrary – when it comes to open source projects. This a is Big SQL has the most advanced ANSI standard SQL implementation of any product this in market. Most other vendors, when asked, are vague about what they support or mutter about SQL 2003 (if you’re lucky). There also is substantial support for (over 95%) Oracle PL/SQL: a capability inherited from as well Db2, as the SQL used by Netezza (IBM). Recommended for Recommended Transactional look-ups, complex analytics, mixed workload environments, learning. machine • • • Threats • Strengths •

IBM has previously run benchmarks against Apart from that, Big SQL based is and on Db2

11 of the resources required by Spark SQL. Moreover, although Spark can execute queries all TPC-DS at smaller scale, it was unable to do so at this level, completing of the only 83 99 queries. HAWQ. A more recent benchmark test comparing Big SQL and Spark 2.1, with a 100 TB database, found that Big SQL was 3.2 faster and used only a third official results) TPC-DS but “based on” TPC-DS. The most interesting outcome a claimed is 3.5 times performance Apache to compared improvement were based both on TPC-DS on single a stream of queries and for four concurrent queries, both in cases running against TB a 10 database. These were not official results (in fact, no one has posted streaming analytics. Hive its (on own), Impala, HAWQ and Spark. These Spark, which will be useful for if, example, you want extend supportto warehouse to environment your machine learning, or if your data lake to is support leverages capabilities Db2 such as the optimiser Db2 and the federation Db2 capabilities. The product does not just support processing via Spark – many platforms do that – but directly integrates with IBM, as opposed to SQL that does not comply with any ANSI standard if you are using Phoenix. Big SQL provides a unified security architecture as opposed to two distinct security models. And finally, it means ANSI standard (2011) SQL support from along with HBase, but this would mean separate connections for Hive and HBase, whereas Big SQL does this with one connection as well as allowing transparent joins across Hive and HBase. Similarly, Hive and, optionally (ifyou want transactional look-ups), Apache HBase. In the latter case, the alternative would typically be to use products, announced recently but the company’s partnership with Hortonworks, means that these other products have now been, or are being, replaced. Big SQL runs conjunction in with Apache Big SQL IBM’s is SQL on Hadoop engine. It used to be accompanied by BigInsights and other “Big” IBM Big SQL Big SQL IBM Product Sheet 12

on mixed workloads: in effect, an enterprise data an enterprise in effect, workloads: on mixed are unlikely These products warehouse on Hadoop. Of the Jethro. of price/performance the match to these tend OLAP do offer other products that few that capabilities the sort indexing offer of not to performance meaning that drill-down Jethro does, large. very get to cubes tend Also, will suffer. The partnership with Hortonworks should help to adoption. drive is sort old-fashioned and not up with the of OLAP such analytics technologies most hyped currently This analytics and machine learning. as predictive can things well can mean that doing the basic of flavour the latest of in favour get overlooked on doing that focuses As a company the month. can, as it possibly something traditional as well for being overlooked Jethro Data is in danger of be to be perceived that may other technologies but we view, with this don’t agree We “sexier”. are swayed that some people recognise to have fashionable. what’s by The auto-cubing feature means that you doyou means that feature The auto-cubing cubes in advance. your OLAP define to not have re-define them if they to don’t have you Moreover, change. on OLAP offering vendors few There are relatively are the suppliers in this space Most of Hadoop. parallel implementations that focus massively Jethro 535 Suite Avenue Columbus 157 100235 NY York, New www.jethro.io • Threats • for Recommended OL AP. Strengths • •

tableau.jethrodata.com http://jethrodata.qlik.com/

Jethro Data has recently partnered with Jethro Data has recently We are not aware of any comparative benchmarks comparative any are not aware of We © 2017 Bloor Warehouse solution. Warehouse 2.9 billion rows and six dimensions. and six 2.9 billion rows Data Enterprise that company’s extend Hortonworks to (log in demo/demo) and (log in demo/demo) These run in the cloud with 1TB of (no login needed). approximately a fact table containing data, raw comparing Jethro with other SQL on Hadoop engines. on Hadoop engines. Jethro with other SQL comparing benchmarks that live does two the company However, at yourself can run for you associated with – the auto-cube. There are similar the auto-cube. – with associated a adding incremental data to for involved processes indexes. cube as for Moreover, it will also serve variants of that query from that query variants of it will also serve Moreover, filters treating any does by which it the auto-cube, but – to as external country) by queries example, (for that you get a query from Tableau (say), Jethro creates (say), Tableau from get a query you that Jethro Thereafter, cube using indexes. the relevant it has created. from the auto-cube that query serves data which are then appended to the original indexes. the original indexes. data which are then appended to as as well merge process index There is a background and handle duplicates. identify capabilities to is that the first time work that auto-cubes The way only approach to avoid the need for locking when for the need avoid approach to only addyou When the database. to there is an update on this incremental indexes Jethro creates data, new are measures) are indexed. Further, there is an index there is an index Further, are measures) are indexed. indexes ensure performance, In order to indexes. of an append- takes and the company locked, are never and queries built using MicroStrategy, Tableau, Qlik Qlik Tableau, and queries built using MicroStrategy, change. should run against Jethro without and so on, “price” (things like and measures all columns In Jethro, the optimiser will use the former for aggregated for the former the optimiser will use more granular queries or, for queries and the latter Existing applications both. where that is appropriate, – performance – that is most significant. To get the that is most significant. – performance – two Jethro offers like, would you sort performance of where and indexing, capabilities: auto-cubes primary That is to say, on online analytic processing (OLAP) (OLAP) processing on online analytic say, That is to through down drill to the ability necessarily, with, to do this at And detail. fine level cubes to OLAP your is this last point It scale and with high performance. Jethro is a SQL on Hadoop engine specialising in on on Hadoop engine specialising Jethro is a SQL “traditional business intelligence”. might call what we Jethro Data Data Jethro Product Sheet A Bloor Market Report Paper

Because Kognitio has always been an in-memory in-memory an has always been Kognitio Because that is However, expensive. it has been database, come have prices as memory no longer the case, adopting a company with the Moreover, down. should be Kognitio model, licensing free-to-use historically. than it has been placed much better in the vendors other (but not all) many Like space, data warehousing (EDW) enterprise has ported Hadoop. its product onto Kognitio will be up This means that the company the same suppliers that it has of against many Unfortunately, against. competed historically known are all larger and better vendors these while this shouldn’t be a and, than Kognitio is. it often factor, deciding Kognitio Road Cookham Park, Waterside 3a Kingdom United 1RB, Bracknell RG12 www.kognitio.com seventy queries. seventy Strengths • Threats • for Recommended workload environments that combine Mixed transactional look-ups with complex analytics. support the appropriate SQL syntax) and Spark did and Spark syntax) support SQL the appropriate Of the 95 queries of the queries. fifteen not complete on all but eight of was fastest Kognitio completed, performance, comparative of As an example these. all by completed that were single stream queries for more in slightly completed Kognitio (70), vendors longer than 50 minutes Impala took than 13 minutes, when it had taken Spark up counting gave and we of the first thirteen just the longer than Impala for

From a performance perspective the company the company perspective From a performance Kognitio targets complex queries against targets complex Kognitio The history of the Kognitio data warehousing the Kognitio of The history

13 four queries, Impala failed on more than a quarter queries, four it does not because queries (mostly all the of Kognitio was the only product to complete all 99 complete product to was the only Kognitio them (Impala on 92 of queries and it was fastest For the ten one). for Spark queries, six for fastest for “long running” was Kognitio streams test, query DS query set for both a single query stream and ten stream and ten both a single query set for DS query In both cases the data streams. query concurrent stream, For the single query was set at 1TB. volume Management Associates) comparing Kognitio Kognitio comparing Associates) Management In each case the 2.0. Impala 2.6.0 and Spark 8.1.50, TPC- running the was used, same 12 node cluster situations where you don’t have enough memory don’t have you situations where Enterprise by benchmarks (validated ran a series of language, such as R or Python, as well as SQL. as SQL. as well such as R or Python, language, JSON and Parquet. Support ORC, for is provided for that caters capability streaming There is a query say, given the company’s longevity, Kognitio has Kognitio longevity, the company’s given say, management, workload optimiser, sophisticated It load balancing and so forth. high availability, supported any supports for parallel processing database Hadoop cluster (all updating within seven within seven (all updating database Hadoop cluster a hundred individual analytic and over seconds) to Needless which are complex). queries (some of large datasets. For example, one of its clients has its of one For example, large datasets. running against a 9 PB dashboards Tableau 10,000 the product is still available. And, moreover, the moreover, And, the product is still available. (with, has made the database free-to-use company support). paid-for optionally, and, most recently (in 2016), the company ported company the (in 2016), most recently and, changing Hadoop (which involved onto Kognitio with work it would model so that the storage of version though the stand-alone YARN), Apache in the mid-90s the company adopted standard standard adopted in the mid-90s the company to in the mid-2000s it moved chipsets, industry hardware from its own and away blade computing product has been one of gradually moving away away moving gradually one of product has been has roots as the industry from its proprietary For example, caught up with its requirements. databases, Kognitio could reasonably claim to be claim to reasonably could Kognitio databases, this market. of the progenitor the database employed a massively parallel a massively the database employed was in-memory. and that processing architecture hype about in-memory the recent given Indeed, was known as WhiteCross Systems. In the early In the early Systems. WhiteCross as was known today would what we introduced 90s the company running on proprietary call a database appliance, from the outset, are that, features Key hardware. Kognitio was founded in 1987 as a data in 1987 was founded Kognitio the company that time, At vendor. warehousing Kognitio Product Sheet 14

performance of Impala which, as we have seen, seen, have as we Impala which, of performance is not impressive. is surprising. Kyvos by offered The scalability It has users (not all the same ones) ingesting with more than per day, billion rows two more than 300 with 200 billion fact rows, a cardinality with dimensions and attributes, and a maximum cube as much as 1.2 billion, of are impressive Those 20 terabytes. of size figures. Kyvos Insights specialises in OLAP. There are in OLAP. Insights specialises Kyvos (Apache Kylan projects OLAP open source as far but, [ROLAP]) Apache Lens and [MOLAP] support there is no commercial know, as we There are also these engines. either of for engines on top built OLAP that have companies engines but these aren’t true SQL Impala, of these are dependent on the case, in any and, Kyvos Insights Kyvos 130 Suite Avenue, 720 University 95032 CA Gatos, Los www.kyvosinsights.com • Threats on Hadoop engines fall into SQL of Providers (including offerings open source categories: three Cloudera and so HortonWorks, those supported by ported data that are essentially products forth), IBM and from big beasts like warehouse offerings such as Kyvos. providers and specialist Oracle, in the middle are in danger of These companies has though Kyvos on both sides, getting squeezed the advantage that it has identified a niche that the are not targeting. vendors major for Recommended OLAP, especially when cubes have extreme attributes. Strengths •

From a performance perspective Kyvos has run Kyvos perspective From a performance The way that Kyvos Insights works is that it works Insights that Kyvos The way © 2017 Bloor Performance was also tested against Spark – but – against Spark was also tested Performance even performed which, – 1.x version this was Spark than Impala. worse of magnitude faster than Impala for 1 and 10 than Impala for faster magnitude of Impala timed out (more For 50 requests requests. the thirteen queries. on nine of than 30 minutes) performance of subsequent ones), so these tests so these tests ones), subsequent of performance run with Thirteen queries were “cold”. run were cut a To user requests. 10 and 50 concurrent 1, orders two was typically short: Kyvos long story of 30 million. To make things more equitable the more equitable things make To 30 million. of which its caching capabilities, turned off company the improve queries to retain data from previous benchmarks (based on TPC-DS) using a star schema TPC-DS) benchmarks (based on such as (the product also supports other schemas, and a cardinality with 100 billion rows snowflakes) Kerberos and Active Directory. Active-active load Active-active Directory. Active and Kerberos are provided. balancing and high availability dimensions, as well as sliding windows (queries (queries windows as sliding as well dimensions, in work The engine can within a time frame). with products and integrates Yarn with conjunction as LDAP, as well and Ranger such as Sentry product supports incremental updates so that you product supportsyou so that incremental updates them. just update cubes, your rebuild to don’t need changing the product supports slowly Similarly, principles rather than expected queries. We prefer prefer We queries. principles rather than expected your rebuild to don’t have you this: it means that The queries. startyou get unexpected cubes if to builds all possible dimensions for your cube and cube your for builds all possible dimensions Hadoop cluster. your these across then distributes but this is based on generic Data is pre-aggregated (in principle, any of them) and the company has them) and the company of any (in principle, and Qlik. Tableau partnerships with on-line analytic processing (MOLAP) at scale. at scale. (MOLAP) processing on-line analytic and is available and SQL It supports both MDX The product in the cloud. both on-premises and tools business intelligence various with works Kyvos Insights, now in version 4.0, is a Hadoop is 4.0, version in now Insights, Kyvos in multi-dimensional that specialises deployment Kyvos Insights Insights Kyvos Product Sheet A Bloor Market Report Paper only) open source product in this market to be to market product in this only) open source on HTAP. focused The support as semi-structured as well for structured (relational) data is a significant benefit. on Hadoop engines, SQL other open source Like pressure MapR is facing increased competitive optimisers and the SQL vendors, from proprietary will mean thatthat these suppliers can provide when used be out-performed Drill will typically data. structured with only Of the big three providers of commercial commercial of providers Of the big three in – been MapR has usually Hadoop support, advanced. the most technically – our opinion has data platform converged The company’s of support to variety a wide extended been workloads. MapR Drill is the leading (though not the MAPR 350 Holger Way 95134 CA San Jose, www.mapr.com • Threats • for Recommended Mixed data analytics, “collapsed” lambda HTAP, architectures. Strengths • •

For reference, it is worth noting that Apache it is worth noting that For reference, The final result of this architectural approach is of this architectural The final result is commercially supported by supported by Apache Drill is commercially

15 Drill is based on Google Dremel. optimiser in MapR-DB – relied on by Drill – but this – Drill relied on by – optimiser in MapR-DB this is The implication of optimiser. is not a SQL not be optimal. may join strategies that things like performance, resilience, scalability and so on. and so on. scalability resilience, performance, which is this, to there is one downside However, There is an optimiser in Drill. that there is no SQL across from Drill to other environments. across from Drill to for Data Platform that Drill relies on the Converged MapR stresses consistency, its functionality. a lot of should be well-suited to hybrid transactional and to should be well-suited there is Thirdly, programming (HTAP). analytic can query you so that built in, capability federation support for POSIX, combined with traditional combined support POSIX, for support for capabilities (not least, analytic means that the environment indexes) secondary architecture. That is, it will support both batch That is, architecture. same and real-time (streaming) analytics on the the Secondly, fashion. in an integrated platform, read/write), MapR-DB (a multi-model NoSQL (a multi-model NoSQL MapR-DB read/write), has architecture This and MapR Streams. database), it effectively Firstly, important implications. several lambda “collapsed” a means that MapR can provide MapR. And, in that context, it runs on the MapR in that context, And, MapR. which includes MapR-FS Data Platform Converged for high performance file storage compliant (POSIX competitors. There are extensions to SQL that have that have SQL to There are extensions competitors. support to implemented semi-structured data. been Its SQL support is ANSI 2003 compliant, which is ANSI 2003 compliant, support is Its SQL rivals its open source of than many more advanced its proprietary as some of but not as up-to-date describing datasets (also including CSV and and including CSV describing datasets (also as structured data data as well formats), Parquet tables. Apache HBase or MapR-DB using either Hadoop. It will also run on top of document of on top It will also run Hadoop. so it is supports JSON, databases and natively it supports since engine, arguable that it is SQL+ self- queries against semi-structured and SQL Apache Drill is an open source SQL engine that SQL source Apache Drill is an open including platforms, NoSQL of variety will run on a MapR Drill MapR Product Sheet 16 Given that Greenplum was first introduced was first introduced that Greenplum Given HDB to expect would we ago, a decade over security, all the sortshave high availability, of ACID load balancing and so forth (including from a expect would you that compliance) product with that sort longevity. of a distinct advantage Pivotal While MADlib gives used to analytics (what we when complex it is otherwise call data mining) are required, ported that have vendors a number of one of products onto data warehousing proprietary have supplier to Nor is it the only Hadoop. Its competitors its product. open-sourced have the same as they (threats) are therefore open-source course, of plus, always been engines. not based on proprietary developments The MADlib capabilities offered by Pivotal Pivotal by The MADlib capabilities offered Pivotal are a major advantage if deploying are While there data scientists. use by HDB for ported – in effect – that have vendors of plenty data warehousing technology a proprietary that can offer few very there are Hadoop, onto MADlib. to anything comparable Pivotal Floor Fifth Street, 875 Howard USA 94103, CA San Francisco, www.pivotal.io • Threats • for Recommended Mixed workload environments, machine learning. Strengths •

Pivotal has not published any recent benchmark benchmark recent has not published any Pivotal The product integrates with HCatalog (for (for with HCatalog The product integrates Major features of Pivotal HDB include dynamic HDB include dynamic Pivotal of Major features In addition, and this is a major differentiator for for is a major differentiator and this In addition, © 2017 Bloor competitors it benchmarked itself against could itself against could it benchmarked competitors This ability those queries. a third of even complete confirmed been queries has TPC-DS supportto all benchmarks. third party by later. Perhaps more interesting was that Pivotal was that Pivotal more interesting Perhaps later. TPC-DS run all successfully HDB was able to the three when none of in 2014, back queries even 2014 when it’s results suggest that it was 2014 when it’s would we However, than Impala. times faster six years on these figures three rely not necessarily ANSI standard 2003. ANSI standard 2003. There are some available from figures. performance YARN by providing “Resource “Qs” that provide that provide “Qs” “Resource providing by YARN as workload capabilities as well multi-tenancy support is for SQL The product’s prioritisation. Ranger (for security), and both Apache and both security), (for Ranger resource (for YARN Apache administration) and (for contributor is a significant Pivotal management). and it has extended project Ambari Apache the to Greenplum Orca query optimiser, amongst others. amongst others. optimiser, Orca query Greenplum Apache data), with Hive-based interoperability when the product is deployed in the cloud, data cloud, in the when the product is deployed processing query allow capabilities that federation the and the use of environments, across external metadata cache, automatic elastic query execution execution query elastic automatic metadata cache, as required), spin up more resources, you (where capabilities expansion/shrink cluster dynamic currently available within MADlib and you can call you available within MADlib and currently Python and Java programs. these from R, an HDFS processing, query in-memory pipelining, Greenplum and HDB (HAWQ) are. MADlib processes MADlib processes are. and HDB (HAWQ) Greenplum and run within are parallelised (where relevant) functions are Around 40 different engine. the SQL Pivotal has contributed to Apache). This provides This provides Apache). to has contributed Pivotal and runs machine learning capabilities using SQL both which databases, against PostgreSQL-based Pivotal, the company also supports Apache MADlib also supports the company Pivotal, in conjunction – an in-house development (again, which – universities done at several with work open-sourced as Apache HAWQ and GemFire as Apache HAWQ as open-sourced Apache Geode. of Pivotal Greenplum running on Hadoop. It is It running on Hadoop. Greenplum Pivotal of which also a part Suite, the Pivotal of data warehouse and the includes the Greenplum been HDB has data grid. GemFire in-memory Pivotal HDB is a SQL on Hadoop engine that on Hadoop HDB is a SQL Pivotal parallel implementation started as a massively life Pivotal HDB Pivotal Product Sheet A Bloor Market Report Paper

should provide significant performance advantages. should provide significant performance One independent benchmark of Spark performance when was that it performed eleven times better tightly integrated with a host database rather Splice than just connected to it. We would expect Machine’s approach to perform even better. of Splice Machine has an impressive range there is a capabilities. However, given this breadth, and take danger that the company could lose focus too much of a scatter gun approach. We therefore think that it is right for the company to focus on a particular market segment – in this case predictive processing – but it needs to retain that focus going forward. There is increasing interest in both lambda and There is increasing interest in both lambda kappa architectures, but they are extremely fashion complex to implement in a do-it-yourself able and require very significant skills. Being to “collapse” these architectures into a single Machine. platform is a major advantage for Splice Having Spark internally within Splice Machine Splice Machine Splice #300 Street, 612 Howard 94105 CA San Francisco, www.splicemachine.com Threats • Recommended for “Collapsed” lambda architecture, HTAP, machine learning, predictive analytics. Also appropriate for OLAP. Strengths • •

The company has not published any SQL support is ANSI standard and there is also This last feature is important with respect to This last feature is important with respect

17 competitive benchmarks but there are some impressive customer performance figures. analytics. support for Oracle’s PL/SQL extensions. embeds that into analytics – obviously, based on the name, with the primary intention of supporting processes such as predictive (and prescriptive) where HTAP basically does some OLAP type processing and then embeds that into operational data, while OLPP takes transactional data and Splice Machine’s target market, which the company describes as On-line Predictive Processing (OLPP). This can best be described as an inverted form of hybrid transactional analytic processing (HTAP) what the company calls “timeline tables”, which what the company calls “timeline tables”, effectively provide time series capabilities. Zeppelin notebooks and you can use traditional Zeppelin notebooks and you can use traditional R and SQL analytic tools as well as Scala, Python, support for so on. One further significant feature is premises and in the cloud (Amazon S3) and, in premises and in the cloud (Amazon S3) and, as the latter case, there are significant “database a service” capabilities. Splice Machine supports concurrency control), through on-line analytic concurrency control), through on-line analytic processing (OLAP) to complex and streaming ships analytics, and machine learning (the product on with Apache MLlib). The product runs both The intention is to support the complete gamut The intention is to support the complete look-ups of analytic capability from transactional (the product is ACID compliant with multi-version supported). A cost-based optimiser in each node supported). A cost-based optimiser in each the routes queries appropriately, and this leverages secondary indexes that Splice Machine supports. each node in your cluster includes both Apache each node in your cluster HBase and engines with underlying are HDFS storage (both Apache Orc and Parquet engines (batch, streaming and serving) but Splice engines (batch, streaming these together into a Machine has “spliced” that, when implemented, single environment so is, it provides a single environment that combines is, it provides a single batch-based analytics with support for traditional Conventionally, a lambda streaming analytics. three different processing architecture requires Splice Machine is a SQL on Hadoop engine that Splice Machine is a SQL lambda architecture. That provides a “collapsed” Splice Machine Machine Splice Product Sheet 18

so the company has significant Hadoop so the company skills. implementation and consulting knows – Starburst and therefore – Teradata databases to what it is doing when it comes no have therefore We that support analytics. as high availability, such qualms about features query management, workload load balancing, and so on. optimisation, has taken Starburst) (and now Teradata its major to path compared a different Apache-based an opting for competitors: porting as opposed to approach, open source The Hadoop. onto solution a proprietary Presto, of this is that some features of downside data in an enterprise expect would you which though yet, are not there warehouse (EDW) will be in the future. that they are happy we its traditional at least some of Moreover, to made their products free have competitors and use. download Presto forms a part of the Teradata ecosystem. ecosystem. Teradata a part forms the of Presto Teradata the is supported it by In particular, span which means that queries can QueryGrid, Teradata and Teradata but also not just Presto Presto Moreover, implementations. Aster charge as a part of the free of is provided It is also worth Hadoop. for Appliance Teradata Think Big acquisition, mentioning Teradata’s Starburst Boston USA MA, www.starburstdata.com where there are mixed workloads, including both workloads, where there are mixed analytics. and complex transactional look-ups • Threats • for Recommended data warehousing environments Traditional Strengths •

Radiant Advisors ran some benchmarks for Advisors ran some benchmarks for Radiant Presto runs on both Cloudera and Hortonworks runs on both Cloudera Presto © 2017 Bloor tests run by other vendors suggest that this is still suggest that this vendors other run by tests Impala. not the case for Presto was capable of running more of the TPC-DS TPC-DS the running more of was capable of Presto understand that Presto We queries than Impala. but benchmarks TPC-H and TPC-DS run all will now often than the other way around. We would expect, expect, would We around. than the other way often that this situation is cannot confirm, though we is that at that time can say What we reversed. now Teradata, the results of which were published in Q2 which were the results of Teradata, At of date. the figures rather out That makes 2016. more Presto that time Impala was outperforming intention to introduce more batch-based capabilities more batch-based introduce to intention and not require can just run Presto you so that with Hive. running in conjunction Presto if you run out of memory. Also planned for a future Also planned for memory. run out of you if management workload there will be new release, There is also an groups. capabilities with resource disk, which is designed to support query processing processing support which is designed to query disk, There are a number memory. run out of you when a halt engines which grind to other in-memory of to the Apache Calcite project, which is more of a is more of which project, Apache Calcite the to that has Another major feature optimiser. generic is spill-to- Teradata by contributed previously been and Kerberos. A major upcoming feature is that there feature major upcoming A and Kerberos. result optimiser that is the cost-based will be a new It and Facebook. Teradata between a collaboration of as opposed Presto, for designed specifically has been with worker nodes, plus full in-memory capabilities plus full in-memory nodes, with worker There are processing. columnar vectorised and with support LDAP for capabilities, strong security MapR is promoting Drill and partly because of of because MapR is promoting Drill and partly The certification). for that MapR wants the money running is based around a co-ordinator architecture have ported their relational engines onto Hadoop. ported Hadoop. have relational engines onto their because Hadoop distributions but not MapR (partly via Starburst – SQL on Hadoop offering. This is a on Hadoop offering. SQL – via Starburst rivals such as approach from different markedly which all of and so forth, Pivotal HPE, Oracle, IBM, Teradata – provides commercial support it and commercial for provides – Teradata as are project, the Presto to is a major contributor and Uber. Twitter such as Facebook, companies available now albeit – Teradata’s It represents Presto is an open source product available under product available is an open source Presto from a spin-off – Data Starburst . an Starburst Presto Presto Starburst About the author PHILIP HOWARD Research Director/Information Management

hilip started in the computer databases and data warehousing, data industry way back in 1973 integration, data quality, master data P and has variously worked as management, data governance, data a systems analyst, programmer and migration, metadata management, and salesperson, as well as in marketing and data preparation and analytics. product management, for a variety of In addition to the numerous reports companies including GEC Marconi, GPT, Philip has written on behalf of Bloor Philips Data Systems, Raytheon and NCR. Research, Philip also contributes regularly After a quarter of a century of not to IT-Director.com and IT-Analysis.com and being his own boss Philip set up his own was previously editor of both Application company in 1992 and his first client was Development News and Operating Bloor Research (then ButlerBloor), with System News on behalf of Cambridge Philip working for the company as an Market Intelligence (CMI). He has also associate analyst. His relationship with contributed to various magazines and Bloor Research has continued since that written a number of reports published by time and he is now Research Director, companies such as CMI and The Financial focused on Information Management. Times. Philip speaks regularly at Information management includes conferences and other events throughout anything that refers to the management, Europe and North America. movement, governance and storage of Away from work, Philip’s primary data, as well as access to and analysis of leisure activities are canal boats, skiing, that data. It involves diverse technologies playing Bridge (at which he is a Life that include (but are not limited to) Master), and dining out.

19 A Bloor Market Report Paper Bloor overview Technology is enabling rapid business evolution. The opportunities are immense but if you do not adapt then you will not survive. So in the age of Mutable business Evolution is Essential to your success. We’ll show you the future and help you deliver it. Bloor brings fresh technological thinking to help you navigate complex business situations, converting challenges into new opportunities for real growth, profitability and impact. We provide actionable strategic insight through our innovative independent technology research, advisory and consulting services. We assist companies throughout their transformation journeys to stay relevant, bringing fresh thinking to complex business situations and turning challenges into new opportunities for real growth and profitability. For over 25 years, Bloor has assisted companies to intelligently evolve: by embracing technology to adjust their strategies and achieve the best possible outcomes. At Bloor, we will help you challenge assumptions to consistently improve and succeed.

Copyright and disclaimer This document is copyright © 2018 Bloor. No part of this publication may be reproduced by any method whatsoever without the prior consent of Bloor Research. Due to the nature of this material, numerous hardware and software products have been mentioned by name. In the majority, if not all, of the cases, these product names are claimed as trademarks by the companies that manufacture the products. It is not Bloor Research’s intent to claim these names or trademarks as our own. Likewise, company logos, graphics or screen shots have been reproduced with the consent of the owner and are subject to that owner’s copyright. Whilst every care has been taken in the preparation of this document to ensure that the information is correct, the publishers cannot accept responsibility for any errors or omissions.

© 2017 Bloor 20 21 A Bloor Market Report Paper Bloor Research International Ltd 20–22 Wenlock Road LONDON N1 7GU United Kingdom

Tel: +44 (0)20 7043 9750 Web: www.Bloor.eu email: [email protected]