• Cognizant 20-20 Insights

In-Memory Computing: Powering Enterprise High-Performance Computing To succeed in today’s modern digital era, organizations must embrace the next wave of hyperscale computing into mainstream business by considering in-memory computing technologies that not only bolster their large-scale data processing capabilities but accelerate the transformation of raw information into applied knowledge.

Executive Summary interconnect technology, etc. that enable IT orga- nizations to fast-track enterprise computing to Traditional high performance computing (HPC)/ better serve the ever-growing data needs of the supercomputing, analytics and mainstream real- business. time/batch computing are quickly converging. Mainstream workloads are crossing over the high Significant enthusiasm is building around the performance computing arena, demanding faster IMC paradigm for large-scale data analysis. His- analytics/batching, resource-intensive computa- torically, in-memory grid technologies were tions and algorithms. To succeed in today’s accel- primarily data-focused and used by the orga- erating digital world, enterprises must collect and nizations for distributed caching patterns to analyze mind-boggling amounts of data, in real achieve low latency reads of critical transac- time, and at ever-faster speeds that most legacy tional data. However, IMC technology is progres- enterprise HPC technologies and systems were sively emerging as a key empowering agent for not originally designed to accommodate. enterprises seeking to accelerate their real-time decision-making ability and agility, by enabling In our view, organizations need to embark on what Web-scale data processing, which are capabilities we call Enterprise HPC 2.0. This term refers to the necessary for staying relevant and competitive in ecosystem that leverages/utilizes various latest today’s digital era. commodity-hardware-based hyperscale grid tech- nologies such as in-memory computing (IMC), IMC’s impact is typically felt where organiza- compute and data grid technologies, streaming tions are creating new and more innovative ways analytics, graph analytics, etc. These are in con- of working. A dramatic reduction in memory junction with infrastructure advancements such hardware costs also favors the growth of IMC as solid state drives (SSD)-enabled technology, technologies. However, several factors continue GPGPU acceleration, general purpose Infiniband

cognizant 20-20 insights | november 2015 to slow the adoption at the enterprise, such as This white paper summarizes the features and a fragmented technology and vendor landscape, benefits of using IMC for large-scale data-set a lack of commonly agreed upon industry aggregations using multiple popular IMC standards, scarcity of skills and still-emerging approaches. The paper presents results from an industry best practices. internal study performed in which we created an evaluation scenario to compare various IMC Given that the technology remains in its adoles- approaches/technology architectures. The study cence, the selection of the right IMC technology results establish that simple migration to an IMC is critical to any strategic digital business trans- technology yields performance levels 13 times formation decision. Soaring enterprise workloads greater for a given batch workload previously and the use cases that make use of in-memory implemented using a disk-based architecture. processing are informing key decisions around This paper not only highlights the importance IMC technology platform selection. of embracing the IMC agenda for enterprise workloads but offers a formal methodology for A blind jump into the IMC technology valley will choosing the most appropriate IMC platform to fit not yield durable value. It requires clear and given business needs. effective analysis and understanding of workloads and business priorities, with a goal to increase In-Memory Computing: scalable performance and competitive benefits A Market Check for the business. This entails skilled experts to perform a focused evaluation. Furthermore, the Effective use of IMC technology along with a clear multitude of new and emerging products makes is strategy for adoption can help enterprises reap extremely challenging to select the right product multiple benefits. Figure 1 lists some of the key and approach. use cases across specific industries. While this is just an indication, the possibilities are abundant However daunting this decision may seem, it is of and are not limited to the specified list. utmost importance for organizations to use IMC technology to help address their ever-mounting There have been rapid innovations in the IMC high-performance and low-latency processing space recently to enable faster computation needs across the enterprise. and processing speeds. These include Hadoop

In-Memory Computing (Enterprise HPC 2.0)

Telecom Insurance Manufacturing ■ ■ Real-time ads Faster claim ■ Inventory placements. processing & management. modeling. ■ Real-time sentiment ■ Predictive analytics Retail Healthcare ■ Banking & Financial analysis. Faster actuarial to avoid unplanned ■ Real-time in-store ■ Services Faster medical science. downtime. analytics. ■ imaging processing. ■ Fraud detection. Real-time trading ■ decisions. Fast real-time ■ Genome analysis. loyalty offers. ■ Faster reporting.

Figure 1

cognizant 20-20 insights 2 MapReduce — a batch processing framework for storing large-scale data. However, it provides that has added support for an in-memory file a processing platform for large-scale in-memory system called Tachyon. In addition, IBM has computing and is said to provide performance added Apache Spark — an IMC system — to its z up to 100 times faster for certain applications1 Systems to bring analytics to mainframes. Also, and is being endorsed by IBM2 and Web SQL Server 2016 Community Technology Preview Services.3 2 adds IMC power. This has led to the availability of a plethora of IMC technology-based products. Figure 2 illustrates the evolution of IMC However, these products can be classified into technology, some of the popular products under various segments, based on their inherent archi- each segment and the typical workloads for which tecture and technological approaches. Moreover, they are best used. each IMC system is not applicable for every type Given the rapid pace of innovation, the IMC of enterprise workload. It is therefore imperative product landscape requires the latest skills and to have a clear understanding of the pros and a thorough understanding of a specific IMC cons of each of these system types in order to system’s architectural underpinnings to validate effectively select and utilize IMC systems and its fit and effective use for a given enterprise reap the business benefits. workload. Furthermore, with the multiple options IMC technology has evolved from its earliest available, enterprises can find it difficult to make avatar (distributed caching) to today’s integrated the best choice and use of an IMC technology to in-memory platform that provides storage, satisfy their high performance computing needs. compute and transactional services for large-scale To address these challenges, we — at the data sets. These systems fall under the pure-play Cognizant Hyperscale Computing (HPC) Lab — IMC technologies category. The “alternate IMC” have launched a structured methodology to help segment applies to products such as Apache enterprises realize value from the next wave of Spark, which, in our view, does not represent hyperscale computing using Enterprise HPC 2.0, all-encompassing in-memory technology in the which leverages in-memory computing grids. strict sense since it does not provide a platform

IMC Technology’s Progression

Alternate IMC

A platform for computing Pure Play IMC and transacting on large-scale data sets in A next-gen platform that parallel. integrates IMDG with IMCG and provides additional features like CEP, streaming etc. A RDBM system In-Memory that stores data in Compute Grid memory instead (IMCG) A data fabric across of on disk. large cluster of ■ Apache Spark servers for distributed In-Memory Data in-memory storage Fabric (IMDF) and management of large data sets. ■ Apache IgnIgnite In-Memory (GridGain) A cache that partitions its data (IMDB) among all cluster nodes. ■ SAP HANA In-Memory Data ■ Oracle Exalytics Grid (IMDG) ■ Exadata ■ Pivotal GemFire XD ■ MS SQL2014 Distributed ■ Oracleracle CoherenCoherence Caches ■ GigaSpacesaSpaces XAP ■ Memcachedd ■ ■ Ehcachecache ■ Infinispannispan (JBo(JBoss) ■ Pivotal GemFireire For real-time big data In-memory high speed For a single integrated initiatives, handling HPC alternative for existing platform for real-time big data For in-memory Distributed Key/Value payloads along the lines of disk-based RDBMS with management and computing, computation and Cache for Low Latency MapReduce, MPP with full SQL support, with no handling new HPC payloads processing of data access. partial SQL support. change to application. such as Streaming, CEP. stored in disks.

Figure 2

cognizant 20-20 insights 3 IMC Technology Selection Process

IMC Assessment Methodology

Establishment (Stage I) Refinement (Stage II)

1 2 3 4

Figure 3

IMC Value Creation: Methodology readily and easily supported by the product, apart from the in-memory caching features normally A clear process, as well as a framework, is required available with such products: to establish the business goals and successfully determine the best-fit IMC technology. This is • Bulk data loading. vital to garner the utmost value from an IMC-led transformation. Figure 3 depicts our process for • SQL support for easy and fast retrieval of data establishing and identifying the right IMC product with conditions. for the business. • SQL support for joining multiple data sets based on criteria. 1 2 3 4 Step 1: Discovery • Support for creating new tables/data sets dynamically on the fly with data from other The business use cases and the workloads to be tables/data sets. implemented via IMC technology play a crucial role in the selection of the products. So first the • Support for stored procedures/user-defined workload is chosen and key goals for implementa- functions/MapReduce to handle very specific tion are defined. aggregations. In-memory distributed computation capabilities. For this white paper, we studied a retail customer • analytics workload previously processed on 1 a 2 3 4 modern scalable batch model using Apache Pig, a Step 2: Analysis Hadoop MapReduce-based technology, which has Second, we needed to ascertain the segment of a disk-based architecture. The nature of the tech- IMC technology that would best suit the workload nology used for this implementation permitted the and identify a potential list of IMC systems from solution to be an offline and batch-based system. the category that readily support the evaluation To be better prepared to handle the disruptive criteria for specific use cases. This is carefully nature of the consumer behavior where latency chosen after deliberation with the enterprise’s implies loss of business, we preemptively wanted business and architect stakeholders. an alternative solution to support faster and/or near-real-time performance and support for the We then performed deep-dive fit and architec- customer’s customers. We devised an internal tural analysis on the selected list and determined study to transform the batch workload using the best-fit match based on the aforementioned multiple IMC technologies and successfully applied evaluation criteria. From the output of this appropriate IMC technology to make it faster. analysis, the final list of IMC systems that closely fit the requirements was determined. Further Next, we defined the key use cases that the proof-of-concept, proof-of-technology and bench- workload requires, which becomes the input marking were performed on the final list of IMC for the IMC system evaluation matrix. For quick systems to validate, establish and recommend the development of the use case and benchmark- best-fit IMC system for a given workload. ing, we wanted the following core features to be

cognizant 20-20 insights 4 And so, in our case, we selected an initial list of Fitment Analysis potential IMC products from the IMDG, IMDF and Next, we performed a comprehensive product alternate IMC segments, as we needed the capa- comparison and weighted scoring and ranking bilities like that of MapReduce to handle specific model on 20 different attributes and dimensions aggregations demanded by the chosen workload. based on the specific list of features that were Distributed caching systems lack these features most essential for quick development and bench- and an IMDB system like SAP HANA that primarily marking of the use case, as listed in Figure 5. This supports SQL workloads was not the right fit in methodology helped us to quickly shortlist one this case. data grid system each from the commercial and open source categories for our final evaluation. Figure 4 lists the IMC systems selected. As an In-memory data grids offer many other useful internal study, we chose a list of products rated features. IMC vendors have developed unique as top vendors and leaders in the given segment selling propositions for their products that need by various leading analysts from a good mix of to be compared, analyzed and leveraged on a commercial and open-source products. case-by-case basis.

Establishing the Short List The final considerations were based on the score ratings depicted in following two product Pure-Play Alternate comparison scoring figures. Figure 6 shows a IMC Technology IMC Technology comparison between three commercial data grids and offers a comparison between three open- Commercial Others source data grids selected from the previous step, ■ Pivotal GemFire XD ■ Apache Spark as depicted in Figure 4. ■ Oracle Coherence ■ GigaSpaces XAP Analysis Results For the final benchmark and evaluation, we Open Source chose Apache Spark as the first product, for its reputation as the next best IMC technology ■ Apache Ignite to replace the Hadoop MapReduce framework. ■ Apache Infinispan From the scoring process, from the commercial from JBoss ■ Apache Hazelcast category we selected Pivotal GemFire XD (the community version of the GemFire is now available as Apache Geode); the third product chosen from Figure 4 the open source category was Apache Ignite. Both of these products scored the highest as the

Scoring the Requirements

Category Weightage Percent Criteria

Bulk Data Loading, SQL Queries Support, Stored Procedures Support, Dynamic Data Set Creation, Txn Support, UDF Support, Features 60% SQL Joins, Sub Queries, JDBC Driver, Caching Patterns (Side Cache, In-line Cache), Replication, Guaranteed Delivery, Change Data Capture, Cloud Integration

System Application Server (Tomcat/Jetty) Integration, Administration Consoles Availability, Monitoring/Management Consoles Environment 25% Availability, HA & Fault Tolerance, Deployment & Configuration Setup Speed

Dev Programming Language Support (.Net/Java), Client SDKs/APIs Environment 15% Support, Spring Data Support Setup

Figure 5

cognizant 20-20 insights 5 The Comparative Matrix

Open Source IMC Product Comparison Commercial IMC Product Comparison

60% 60% % ■ Apache Ignite ■ GigaSpaces XAP 55 ■ Apache Hazelcast 45% ■ Oracle Coherence 45% ■ Jboss Infinispan ■ Pivotal GemFireXD 35%

25% 25% 25% 25% 25%

15% 15% 10% 10% 10% 10% 10%

Dev System Features Dev System Features Environment Environment Environment Environment Setup setup Setup setup

Figure 6

potential best-fit technology to meet our needs Hyperscale Application Platform, which allows (i.e., the other compared products did not support for fast setup and deployment and provides straightforward SQL joins or subqueries). monitoring facilities to gather the benchmark results. The system detail of each node and the We followed this with a detailed proof-of-concept IMC software details are shown in Figure 7. (PoC) and proof-of-technology (PoT) approach and compared the various aspects of the archi- The three systems were then configured with the tectures of the three IMC systems selected. default cluster settings to determine the as-is per- We then considered their features, differences formance of the IMC systems compared with tra- and relevance for supporting the large-scale ditional Hadoop MapReduce (MR) using Apache data aggregation required by the use case and Pig on Yarn 2.4.0. For all three validated this with a benchmarking process. systems, the only setting change we performed was to increase the IMC system process’s memory Performance Benchmarking parameters (JVM) such that the total cluster heap An identical computing cluster consisting of memory size was 250 GB for the in-memory data three nodes was provisioned using the Cognizant cache.

Node Details

Disk Space (TB) RAM (GB) CPU Cores CPU Clock Speed 2 128 32 2.6 - CentOS release 6.5

IMC System Version Apache Spark 1.3.1 Apache Ignite 1.2.0-incubating Pivotal GemFire XD 1.4.1

Figure 7

cognizant 20-20 insights 6 Benchmark Task optimal performance of each system the system Our study was to compare a batch workload, which configuration parameters must be tweaked based performed a good mix of various computations to on data size, workload types, hardware capacities, create new data sets, with computed fields based resource utilizations, etc. The metrics shown in on aggregations performed in previous steps. The Figure 8 would therefore change based on the original data was persisted in four different struc- system tuning and optimization techniques used. tured data sets with relational integrity between However, we expect only the execution times to them based on certain attributes/fields. The study be faster and the relative performance rating of was done on 50 GB of data with 500 million records these systems to be equivalent when measured using the traditional MR mode and compared with against each other. the twin approaches – using Alternate1 IMC Apache2 3 4 Spark and using IMDG New SQL products. Step 3: Recommendation Benchmark Execution Third, after creating PoCs and performance-relat- We executed each task three times for each IMC ed benchmarks, we can easily derive, validate and system and reported the average of the trials. recommend the best-fit IMC system for any given Each system executes the benchmark tasks workload. We can also consider where these tech- separately to ensure exclusive access to the nologies would potentially give the most durable cluster’s resources. During the tests, it was found benefit for enterprise workloads by performing that Apache Ignite, unlike the other three systems, such detailed analysis of their architectural did not provide out-of-the-box support for bulk aspects. ingestion of data from csv files and was unable to For the current workload, we established key handle the ingestion beyond 1 GB volume of data findings for each IMC system, as shown in Figure with its default cluster environment settings in a 9 (next ). The results provide evidence and stable manner. This prevented us from testing the confirm that using IMC technology accelerates system for task executions. computational performance that the enterprises Results can harness after due diligence and consider- ation. IMC technology can considerably improve Figure 8 depicts the overall performance numbers the overall processing times, from data loading of the three IMC systems under different task to execution. For the given use case and data scenarios. load, processing times improved 13-fold by simply It is important to note that although perfor- replacing the MapReduce-based batch system with mance tuning was not considered in our study, for an IMC technology. We found that Apache Spark was best suited for this particular scenario.

Performance Comparison Data Loading Times (50GB) 25

Aggregations/ Data Set Data Set 20 Data Set Filters Computations Joins Select/Create 10% 15 Percent Workload Workload 50% 30% 10% 10 Operations Mix 5 Time Taken (minutes) Output Input Data Input 0 Data Size Output Apache Pig Pivital GemFireXD Apache Spark Size Records (1 denormalized Records Count (4 datasets) Count

Metrics view) Data Set 50G 500 mil 300 mil Task Execution Times (50GB) 150G 25 20 Pre-IMC Post-IMC Total Performance Improvement 15 Execution Time Execution By Apache Spark 13hrs Time 10 Metrics 13x 15min 1hr 6sec 5 Performance Time Taken (hours)

0 Figure 8 Apache Pig Pivital GemFireXD Apache Spark

cognizant 20-20 insights 7 Functional Findings Pivotal GemFire XD Apache Spark Apache Ignite (incubating)

■ Ideal for low latency transactional ■ Ideal for iterative data analysis, caching ■ Ideal for big data analytics, fraud and operational workloads. intermediate data for real-time querying. detection, risk analytics, customer intelligence. ■ Easy to implement. ■ Ideal for live stream analytics and predictive workloads also involving ■ Single integrated platform with ■ Easy to administer and monitor. machine learning. additional capabilities such as Compute Grid, Service Grid, ■ Extensive SQL support. ■ Easy to implement. CEP Streaming.

■ Not ideal for analytical and predictive ■ Not ideal for transactional processing ■ Nascent stage and requires workloads in stand-alone mode. in stand-alone mode. maturing from incubation status.

■ Lacks support for running iterative ■ Rudimentary management and ■ No out-of-the-box CSV streamer loops based on a large number of monitoring consoles. for bulk data ingestion. keys from a specific collection. ■ Lacks support for in-memory ■ Large data loading times suffer ■ Processing times deteriorate due data storage. due to missing feature. to missing feature. ■ Not so easy to implement.

Figure 9

1 2 3 4 Step 4: Planning onciliation, number crunching) or real-time Finally, with the knowledge and validation stream processing (e.g., real-time analytics, achieved in the previous steps, we can then continuous calculation, fraud detection, click- successfully plan and create an effective IMC stream analytics). roadmap. • When opting for IMC systems from the open-source model, one way to proceed in a Key Recommendations fail-proof manner is to conduct a PoC and a Our analysis establishes that IMC is the future PoT to validate the system and then adopt the of computing and a key enabling technology for commercial counterpart of the same system to enterprise HPC workloads that require analytical, ensure stable system support. predictive and cognitive capabilities. Even though our study was limited to three As such, we recommend that: IMC systems, we recommend that enterprises consider a broader range of products for initial Although technology maturity is still uneven, • evaluation. This should be based on criteria most decision-makers must realize that IMC tech- critical to the business such as available expertise, nologies and architectures are well positioned business drivers for IMC adoption, preference to be adopted and utilized for their mainstream for IMC appliance model, cloud support, product businesses. support for post-implementation, mega-vendors, • Application development and other IT leaders small-size vendors and newer open-source must look at IMC technology to support a wide options for open integration. All of these consid- range of use cases including batch, analytics, erations are critical to the evaluation matrix. This transaction processing and event processing should be accompanied by the deep-dive-compar- rather than limiting the technology to distrib- ison scoring model approach similar to that which uted caching applications. we followed on a list of parameters such as most • Organizations would benefit by shifting to significant use cases, workload patterns of use IMC technology when they need to reengineer cases, short-term and long-term goals, ability to established applications to increase their per- realize ROI in next three to five years, etc. formance and scalability for fast transaction- A PoC/PoT on shortlisted products would further al data access (e.g., inventory management, reinforce the merits/demerits of any evaluated financial reference data, real-time transaction- product. This would help the enterprise to al data) or to offload workloads from legacy make an informed decision to adopt a new IMC systems performing heavyweight offline cal- technology that creates impact for their business. culations (e.g., pattern analysis, trade rec-

cognizant 20-20 insights 8 Looking Forward achieve competitive advantage. While we do not advise general replacement of all workloads and Albeit in-memory technology has been around traditional approaches by IMC technology, our for many years, the latest advancements around study suggests that organizations can reap a scale-out architecture, increased automation and high reward with the technology if the platform reduced memory costs have increased the tech- is properly vetted, selected and deployed. So, nology’s appeal to all enterprises. IMC innovation if you ask us, “what technology can accelerate continues to be unabated across the whole data processing 10x times and deliver real-time spectrum of IT market segments — from hardware business insights and information with high per- to application infrastructure to packaged formance and low latency?”, our answer would business applications. New in-memory technolo- be Enterprise HPC 2.0 and in-memory computing gies can support new and complex workloads technology. that organizations can confidently apply to

Footnotes 1 Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion, “Shark: SQL and Rich Analytics at Scale,” June 2013.

2 http://www.firstpost.com/business/ibms-apache-spark-push-plans-put-spark-bluemix-open-tech-cen- tre-2296260.html.

3 http://searchaws.techtarget.com/news/4500248624/Amazon-Elastic-MapReduce-moves-forward-with- Apache-Spark.

References • “Taxonomy, Definitions and Vendor Landscape for In-Memory Computing Technologies,” Gartner report. • “Hype Cycle for In-Memory Computing Technology, 2014,” Gartner report. • Noel Yuhanna, “Market Overview: In-Memory Data Platforms,” Forrester report, December 26, 2014.

cognizant 20-20 insights 9 About the Author Archana Rao is a Senior Technology Architect within Cognizant HyPerscale Computing Lab, a unit of the Cognizant Technology Labs business unit. She has 11-plus years of cross-industry IT experience developing and providing solutions, focusing on architecture and design of enterprise high perfor- mance computing (HPC) applications using various compute and data grid technologies such as Hadoop, Windows HPC, in-memory computing, search grids and NoSQL. Archana’s focus is on business enablement and transformation through HPC technology and architecture, where she has consulted with many clients implementing strategic technology transformation initiatives. She holds a B.E. in electrical engineering and electronics from University of Madras, Chennai. Archana can be reached at [email protected] | : @ArchanaRA0.

Acknowledgment Special thanks to Senthil Ramaswamy Sankarasubramanian, Director, Cognizant HyPerscale Computing Lab, a unit of Cognizant Technology Labs, for his invaluable feedback during the course of writing this paper.

About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process outsourcing services, dedicated to helping the world’s leading companies build stronger busi- nesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfac- tion, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 100 development and delivery centers worldwide and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.

World Headquarters European Headquarters India Operations Headquarters 500 Frank W. Burr Blvd. 1 Kingdom Street #5/535, Old Mahabalipuram Road Teaneck, NJ 07666 USA Paddington Central Okkiyam Pettai, Thoraipakkam Phone: +1 201 801 0233 London W2 6BD Chennai, 600 096 India Fax: +1 201 801 0243 Phone: +44 (0) 20 7297 7600 Phone: +91 (0) 44 4209 6000 Toll Free: +1 888 937 3277 Fax: +44 (0) 20 7121 0102 Fax: +91 (0) 44 4209 6060 Email: [email protected] Email: [email protected] Email: [email protected]

­­© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. TL Codex 1546