IBM Software September 2014

Big integration and Hadoop Best practices for minimizing risks and maximizing ROI for Hadoop initiatives 2 integration and Hadoop

Introduction Big data and Hadoop projects depend on collecting, moving, Apache Hadoop technology is transforming the economics and transforming, cleansing, integrating, governing, exploring and dynamics of big data initiatives by supporting new processes and analyzing massive volumes of different types of data from many architectures that can help cut costs, increase revenue and create different sources. Accomplishing all this requires a resilient, end- competitive advantage. An open source software project that to-end information integration solution that is massively scalable enables the distributed processing and storage of large data sets and provides the infrastructure, capabilities, processes and disci- across clusters of commodity servers, Hadoop can scale from a pline required to support Hadoop projects. single server to thousands, as demands change. Primary Hadoop components include the Hadoop Distributed File System for storing large files and the Hadoop distributed parallel processing framework (known as MapReduce). “By most accounts, 80 percent of the develop- ment effort in a big data project goes into However, by itself, Hadoop infrastructure does not present a data integration and only 20 percent goes complete big data integration solution, and there are both chal- lenges and opportunities to address before you can reap its bene- toward data analysis.” fits and maximize return on investment (ROI).

—Intel Corporation, “Extract, Transform, and Load Big Data with Apache Hadoop”1 The importance of big data integration for Hadoop initiatives The rapid emergence of Hadoop is driving a paradigm shift in how organizations ingest, manage, transform, store and analyze An effective big data integration solution delivers simplicity, big data. Deeper , greater insights, new products and speed, scalability, functionality and governance to produce services, and higher service levels are all possible through this consumable data from the Hadoop swamp. Without effective technology, enabling you to reduce costs significantly and gener- integration, you get “garbage in, garbage out”—not a good rec- ate new revenues. ipe for trusted data, much less accurate and complete insights or transformative results. IBM Software 3

As the Hadoop market has evolved, leading technology analysts agree that Hadoop infrastructure by itself is not a complete Critical success factor: Avoid hype and distinguish fact or effective big data integration solution (read this report that from fiction discusses how Hadoop is not a data integration platform). To fur- ther complicate the situation, some Hadoop software vendors During these emerging stages of the Hadoop market, carefully have saturated the market with hype, myths and misleading or consider everything you hear about Hadoop’s prowess. A contradictory information. significant gap exists between the myths and the realities of exploiting Hadoop, particularly when it comes to big data integration. There is a lot of industry hype claiming that any To cut through this misinformation and develop an adoption non- scalable extract, transform and load (ETL) tool plus plan for your Hadoop big data project, you must follow a best Hadoop equals a high-p erformance, highly scalable data practices approach that takes into account emerging technolo- integration platform. gies, scalability requirements, and current resources and skill levels. The challenge: create an optimized big data integration In reality, MapReduce was not designed for the high- approach and architecture while avoiding implementation performance processing of massive data volumes, but for pitfalls. finely grained fault tolerance. That discrepancy can lower overall performance and efficiency by an order of magnitude Massive data scalability: The overarching or more. requirement Hadoop Yet Another Resource Negotiator (YARN) takes the If your big data integration solution cannot support massive data resource management capabilities that were in MapReduce scalability, it may fall short of expectations. To realize the full and packages them so they can be used by other applications business value of big data initiatives, massive data scalability is that need to execute dynamically across the Hadoop cluster. essential for big data integration on most Hadoop projects. As a result, this approach makes it possible to implement Massive data scalability means there are no limitations on data massively scalable data integration engines as native Hadoop volumes processed, processing throughput, or the number of applications without having to incur the performance limita- processors and processing nodes used. You can process more tions of MapReduce. All enterprise technologies seeking to be data and achieve higher processing throughput simply by adding scalable and efficient on Hadoop will need to adopt YARN as part of their product road map. more hardware. The same application will then run without modification and with increased performance as you add hardware Before you start your integration journey, be sure you under- resources (see Figure 1). stand the performance limitations of MapReduce and how different data integration vendors address them. Learn more in the “Themis: An I/O- Efficient MapReduce” paper, which discusses this subject at length: http://bit.ly/1v2UXAT­ ­ ­ 4 Big data integration and Hadoop

Source Transform Cleanse Enrich EDW data

Sequential 4-way parallel 64-way parallel

Disk Disk

CPU CPU CPU CPU CPU

Memory Shared memory

Uniprocessor SMP systemMPP clustered system or GRID

Figure 1. Massive data scalability is a mandatory requirement for big data integration. In the big data era, organizations must be able to support an MPP clustered system to scale.

Requirements for supporting massive data scalability are not Critical success factor: Big data integration platforms must only linked to the emergence of the Hadoop infrastructure. support all three dimensions of scalability Leading vendors such as IBM and Teradata, and • Linear data scalability: A hardware and software system leading data integration platforms such as IBM® InfoSphere® delivers linear increases in processing throughput with linear Information Server have provided shared-nothing, massively increases in hardware resources. For example, an applica- parallel software platforms supporting massive data scalability for tion delivers linear data scalability if it can process 200 GB of years—for nearly two decades in some cases. data in four hours running on 50 processors, 400 GB of data in four hours running on 100 processors and so on. Over time, these vendors have converged on four common • Application scale-up:­ A measurement of how effectively the software architecture characteristics that support massive data software achieves linear data scalability across processors scalability, as shown in Figure 2. within one symmetric multiprocessor (SMP) system. • Application scale-out:­ A determination of how well the software achieves linear data scalability across SMP nodes in a shared-n othing architecture. IBM Software 5

A shared-nothing Implemented using That leverages data Resulting in a architecture software dataflow partitioning for linear design isolation data scalability environment Software is designed from Software dataflow enables Large data sets are Design a data processing the ground up to exploit a full exploitation of a partitioned across separate job once, and use it in any shared-nothing, massively shared-nothing architecture nodes and a single job hardware configuration parallel architecture by by making it easy to executes the same without needing to partitioning data sets implement and execute application logic against redesign and re-tune across computing nodes data pipelining and data all partitioned data. the job. and executing a single partitioning within a node application with the same and across nodes. application logic executing Software dataflow also against each data partition. hides the complexities of building and tuning parallel applications from users.

Figure 2. The four characteristics of massive data scalability.

Most commercial data integration software platforms were never needing to redesign and retune the job. These capabilities are designed to support massive data scalability, meaning they were critical to reducing costs by realizing efficiency gains. Without not built from the ground up to exploit a shared-nothing, them, the platform won’t be able to work with big data volumes. massively parallel architecture. They rely on shared memory multithreading instead of software dataflow.

The InfoSphere Information Server data integration portfolio Furthermore, some vendors do not support partitioning large supports the four massive data scalability architectural data sets across nodes and running a single data integration job characteristics. Learn more in the Forrester report, in parallel against separate data partitions, or the ability to design “Measuring The Total Economic Impact Of IBM InfoSphere a job once and use it in any hardware configuration without Information Server” at http://ibm.co/UX1RqB­ ­ ­ 6 Big data integration and Hadoop

Optimizing big data integration The right answer to these questions depends on an enterprise’s workloads: A balanced approach unique big data requirements. Organizations can choose among Because nearly all Hadoop big data use cases and scenarios first a parallel RDBMS, Hadoop and a scalable ETL grid for running require big data integration, organizations must determine how big data integration workloads. But no matter which method to optimize these workloads across the enterprise. One of the they select, the information infrastructure must meet one com- leading use cases for Hadoop and big data integration is offload- mon requirement: full support for massively scalable processing. ing big ETL workloads from the enterprise data warehouse (EDW) to reduce costs and improve query service-level agree - Some data integration operations run more efficiently inside or ments (SLAs). That use case raises the following questions: outside of the RDBMS engine. Likewise, not all data integration operations are well suited for the Hadoop environment. A well-

●● Should organizations offload all ETL workloads from the designed architecture must be flexible enough to leverage the EDW? strengths of each environment in the system (see Figure 3). ●● Should all big data integration workloads be pushed into Hadoop? ●● What is the ongoing role for big data integration workloads in an ETL grid without a parallel relational manage- ment system (RDBMS) and without Hadoop?

Run in the ETL grid Run in the database Run in Hadoop

Advantages Advantages Advantages Exploit ETL MPP engine Exploit database MPP engine Exploit MapReduce MPP engine Exploit commodity hardware and Minimize data movement Exploit commodity hardware and storage storage Leverage database for joins/ Free up capacity on database server Exploit grid to consolidate SMP servers aggregations Support processing of unstructured data Perform complex transforms (data Works best when data is already clean Exploit Hadoop capabilities for cleansing) that can’t be pushed into Free up cycles on ETL server persisting data (such as updating the RDBMS Use excess capacity on RDBMS server and indexing) Free up capacity on RDBMS server Database faster for some processes Enables low-cost archiving of Process heterogeneous data sources history data (not stored in the database) Disadvantages ETL server faster for some processes Expensive hardware and storage Disadvantages Degradation of query SLAs Can require complex programming Disadvantages Not all ETL logic can be pushed into MapReduce will usually be much ETL server slower for some processes RDBMS (with ETL tool or hand coding) slower than parallel database or (data already stored in relational tables) Can’t exploit commodity hardware scalable ETL tool May require extra hardware (low-cost Usually requires hand coding Risk: Hadoop is still a young technology hardware) Limitations on complex transformations Limited data cleansing Database slower for some processes

Figure 3. Big data integration requires a balanced approach that can leverage the strength of any environment. IBM Software 7

Here are three important guidelines to follow when optimizing HDFS system. This means there is no good way to manage data big data integration workloads: collocation in this environment. Data collocation is critical because it ensures data with the same join keys winds up on the same 1. Push big data integration processing to the data instead nodes, and therefore the process is both high-performing and of pushing the data to the processing: Specify appropriate accurate. processes that can be executed in either the RDBMS, in Hadoop and in the ETL grid. While there are ways to accommodate the lack of support for 2. Avoid hand coding: Hand coding is expensive and does not data collocation, they tend to be costly—typically requiring extra effectively support rapid or frequent changes. It also doesn’t processing and/or restructuring of the application. HDFS files support the automated collection of design and operational are also immutable (read only) and processing an HDFS file is metadata that is critical for data governance. similar to running a full table scan in that most often all the 3. Do not maintain separate silos of integration develop- data is processed. This should immediately raise a red flag for ment for the RDBMS, Hadoop and the ETL grid: This operations such as joining two very large tables, since the data serves no practical purpose and becomes tremendously will likely not be collocated on the same Hadoop node. expensive to support. You should be able to build a job once and run it in any of the three environments. MapReduce Version 1 is a parallel processing framework that was not specifically designed for processing large ETL work- Processes best suited to Hadoop loads with high performance. By default, data can be reparti- Hadoop platforms comprise two primary components: a distrib- tioned or re-collocated between the map and the reduce phase of uted, fault-tolerant file system called the Hadoop Distributed processing. To facilitate recovery, the data is landed on the node File System (HDFS), and a parallel processing framework called running the map operation before being shuffled and sent to the MapReduce. reduce operation.

The HDFS platform is very good at processing large sequential MapReduce contains facilities to move smaller reference data operations, where a “slice” of data read is often 64 MB or structures to each map node for some validation and enhance- 128 MB. Generally, HDFS files are not partitioned or ordered ment operations. Therefore, the entire reference file is moved unless the application loading the data manages this. Even if the to each map node, which makes it more appropriate for smaller application can partition and order the resulting data slices, there reference data structures. If you are hand coding, you must is no way to guarantee where that slice will be placed in the account for these processing flows, so it is best to adopt tools that generate code to push data integration logic down into MapReduce (also known as ETL pushdown). 8 Big data integration and Hadoop

Using ETL pushdown processing in Hadoop (regardless of the tool doing the pushing) can create a situation where a nontrivial Critical success factor: Consider data integration workload portion of the data integration processing must continue to run processing speeds in the ETL engine and not on MapReduce. This is true for several reasons: The InfoSphere Information Server shared-n othing, massively parallel architecture is optimized for processing large data

●● More complex logic cannot be pushed into MapReduce integration workloads efficiently with high performance. IBM InfoSphere DataStage®—a part of InfoSphere Information ●● MapReduce has significant performance limitations Server that integrates data across multiple systems using a ●● Data is typically stored in HDFS in a random sequential high- performance parallel framework—can process typical manner data integration workloads 10 to 15 times faster than MapReduce.2 All of these factors suggest that big data integration in a Hadoop environment requires three components for high-performance InfoSphere DataStage also offers balanced optimization for workload processing: the Hadoop environment. Balanced optimization generates Jaql code to run natively in the MapReduce environment. 1) A Hadoop distribution Jaql comes with an optimizer that will analyze the generated 2) A shared-nothing, massively scalable ETL platform (such as code and optimize it into a map component and a reduce the one offered by IBM InfoSphere Information Server) component. This automates a traditionally complex develop- 3) ETL pushdown capability into MapReduce ment task and frees the developer from worrying about the MapReduce architecture. All three components are required because a large percentage InfoSphere DataStage can run directly on the Hadoop nodes of data integration logic cannot be pushed into MapReduce rather than on a separate node in the configuration, which without hand coding, and because MapReduce has known some vendor implementations require. This capability helps performance limitations. reduce network traffic when coupled with IBM General Parallel File System (GPFS™)-FP O, which provides a POSIX-c ompliant storage subsystem in the Hadoop environment. A POSIX file system allows ETL jobs to directly access data stored in Hadoop rather than requiring use of the HDFS interface. This environment supports moving the ETL workload into the hardware environment that Hadoop is running on—helping to move the processing to where the data is stored and leverag- ing the hardware for both Hadoop and ETL processing.

Resource management systems such as IBM Platform™ Symphony can also be used to manage data integration workloads both inside and outside of the Hadoop environment.

This means that although InfoSphere DataStage may not run on the exact node as the data, it runs on the same high-sp eed backplane, eliminating the need to move the data out of the Hadoop environment and across slower network connections. IBM Software 9

ETL scalability requirements for supporting Hadoop ●● The organization continues to rely heavily on manual coding Many Hadoop software vendors evangelize the idea that any of SQL scripts for data transformations. non-scalable ETL tool with pushdown into MapReduce will ●● Adding new data sources or modifying existing ETL scripts is provide excellent performance and application scale-out for big expensive and takes a long time, limiting the ability to respond data integration—but this is simply not true. quickly to new requirements. ●● Data transformations are relatively simple because more Without a shared-nothing, massively scalable ETL engine complex logic cannot be pushed into the RDBMS using an such as InfoSphere DataStage, organizations will experience ETL tool. functional and performance limitations. More and more organi- ●● suffers. zations are realizing that competing non-scalable ETL tools ●● Critical tasks such as data profiling are not automated—and in with pushdown into MapReduce are not capable of providing many cases are not performed at all. required levels of performance in Hadoop. They are working ●● No meaningful data governance (data stewardship, data with IBM to address this issue because the IBM big data integra- lineage, impact analysis) is implemented, making it more tion solution uniquely supports massive data scalability for big difficult and expensive to respond to regulatory requirements data integration. and have confidence in critical business data.

Here are some of the cumulative negative effects from overreli- In contrast, organizations adopting massively scalable data inte- ance on ETL pushdown: gration platforms that optimize big data integration workloads minimize potential negative effects, leaving them in a better ●● ETL comprises a large percentage of the EDW workload. position to transform their business with big data. Because of the associated costs, the EDW is a very expensive platform for running ETL workloads. Best practices for big data integration ●● ETL workloads cause degradation in query SLAs, and Once you’ve decided to adopt Hadoop for your big data initia- eventually require you to invest in additional, expensive tives, how do you implement big data integration projects while EDW capacity. protecting yourself against Hadoop variability? ●● Data is not cleansed prior to being dumped into the EDW and is never cleansed once in the EDW environment, promoting poor data quality. 10 Big data integration and Hadoop

Working with numerous early adopters of Hadoop technology, IBM has identified five fundamental big data integration best practices. These five principles represent best-of- breed “When in doubt, use higher-level tools when- approaches for successful big data integration initiatives: ever possible.”

1. Avoid hand coding anywhere for any purpose —“Large- Scale ETL With Hadoop,” Strata+Hadoop World 2012 presentation given 2. One data integration and governance platform for the by Eric Sammer, Principal Solution Architect, Cloudera4 enterprise 3. Massively scalable data integration available wherever it needs to run 4. World-class data governance across the enterprise The first best practice is to avoid hand coding anywhere, for any 5. Robust administration and operations control across the aspects of big data integration. Instead, take advantage of graphi- enterprise cal user interfaces available with commercial data integration software to support activities such as: Best practice #1: Avoid hand coding anywhere for any purpose ●● Data access and movement across the enterprise Over the past two decades, large organizations have recognized ●● Data integration logic the many advantages of replacing hand coding with commercial ●● Assembling data integration jobs from logic objects data integration tools. The debate between hand coding versus ●● Assembling larger workflows data integration tooling has been settled, and many technology ●● Data governance analysts have summarized the significant ROI advantages3 to be ●● Operational and administrative management realized from adoption of world-class data integration software. By adopting this best practice, organizations can exploit the proven productivity, cost, time to value, and robust operational and administrative control advantages of commercial data inte- gration software while avoiding the negative impact of hand coding (see Figure 4). IBM Software 11

Hand coding Data integration tools

Transform/ restructure the data Read from an HDFS file in parallel Create new HDFS file, fully parallelized

IBM PureData™ System Join two HDFS files

A pre-built data integration solution can streamline the creation of data integration jobs from logic objects.

Diverse data access and integration requirements A pre-built data integration solution can help from across the enterprise map and manage data governance requirements spawn a complex assortment of UIs. across the enterprise.

Development using Development using data integration tools hand coding 2 days to write 30 man days to write % Graphical Almost 2,000 lines of code 87 savings in Self-documenting 71,000 characters development Reusability No documentation costs compared to hand coding More maintainable Difficult to reuse Improved performance Difficult to maintain

Source for hand coding and tooling results: IBM pharmaceutical customer example

Figure 4. Data integration software provides multiple GUIs to support various activities. These GUIs replace complex hand coding and save organizations significant amounts of development costs. 12 Big data integration and Hadoop

Best practice #2: One data integration and governance ●● Support a variety of data integration paradigms, including platform for the enterprise batch processing; federation; change data capture; SOA Overreliance on pushing ETL into the RDBMS (due to a lack enablement of data integration tasks; real-time integration of scalable data integration software tooling) has prevented many with transactional integrity; and/or self- service data integra- organizations from replacing SQL script hand coding and estab- tion for business users lishing meaningful data governance across the enterprise. Nevertheless, they recognize there are huge cost savings to be It also provides an opportunity to establish world-class data gov- had by moving large ETL workloads from the RDBMS to ernance, including data stewardship, data lineage and cross-tool Hadoop. However, moving from a silo of ETL hand coding in impact analysis. the RDBMS to a new silo of hand coding of ETL and Hadoop only doubles down high costs and long lead times. Best practice #3: Massively scalable data integration available wherever it needs to run Deploying a single data integration platform provides the oppor- Hadoop offers significant potential for the large-scale, distrib- tunity for organizational transformation by the ability to: uted processing of data integration workloads at extremely low cost. However, clients need a massively scalable data integration

●● Build a job once and run it anywhere on any platform in solution to realize the potential advantages that Hadoop can the enterprise without modification deliver. ●● Access, move and load data between a variety of sources and targets across the enterprise

Design job once

Run and scale anywhere

Outside Hadoop environment Within Hadoop environment

Case 1: Case 2: Case 4: Case 5: InfoSphere Information Server Push processing into Push processing into InfoSphere Information parallel engine running against parallel database MapReduce Server parallel engine Case 3: any traditional data source running against HDFS Move and without MapReduce process data in parallel between environments

Figure 5. Scalable big data integration must be available for any environment. IBM Software 13

Scenarios for running the data integration workload may ●● Users have to engage in complex hand coding to run more include: complex data integration logic in Hadoop, or restrict the process to running relatively simple transformations in ●● The parallel RDBMS MapReduce. ●● The grid without the RDBMS or Hadoop ●● MapReduce has known performance limitations for processing ●● In Hadoop, with or without pushdown into MapReduce large data integration workloads, as it was designed to support ●● Between the Hadoop environment and the outside environ- finely grained fault tolerance at the expense of high perfor- ment, extracting data volumes on one side, processing and mance processing. transforming the records in flight, and loading the records on the other side Best practice #4: World-clas s data governance across the enterprise To achieve success and sustainability—and to keep costs low—an Most large organizations have found it difficult, if not impossi- effective big data integration solution must flexibly support each ble, to establish data governance across the enterprise. There are of these scenarios. Based on IBM experience with big data several reasons for this. For example, business users manage data customers, InfoSphere Information Server currently is the only using business terminology that is familiar to them. Until commercial data integration software platform that supports all recently, there has been no mechanism for defining, controlling of these scenarios, including pushdown of data integration logic and managing this business terminology and linking it to into MapReduce. IT assets.

There are many myths circulating within the industry about Also, neither business users nor IT staff have a high degree of running ETL tools in Hadoop for big data integration. The confidence in their data, and may be uncertain of its origins popular wisdom seems to be that combining any non-scalable and/or history . The technology for creating and managing data ETL tool and Hadoop provides all required massively scalable governance through capabilities such as data lineage and cross- data integration processing. In reality, MapReduce suffers several tool impact analysis did not exist, and manual methods involve limitations for processing large-scale data integration workloads: overwhelming complexity. Industry regulatory requirements only add to the complexity of managing governance. Finally, overreliance on hand coding for data integration makes it diffi- ●● Not all data integration logic can be pushed into MapReduce using the ETL tool. Based on experiences with its clients, cult to implement data governance throughout an organization. IBM estimates that about 50 percent of data integration logic cannot be pushed into MapReduce. 14 Big data integration and Hadoop

It is essential to establish world-class data governance—with a Best practice #5: Robust administration and operations fully governed data lifecycle for all key data assets—that includes control across the enterprise the Hadoop environment but is not limited to it. Here are Organizations adopting Hadoop for big data integration must suggested steps for a comprehensive data lifecycle: expect robust, mainframe-class administration and operations management, including:

●● Find: Leverage terms, labels and collections to find governed, curated data sources ●● An operations console interface that provides quick answers ●● Curate: Add labels, terms, custom properties to relevant assets for anyone operating the data integration applications, ●● Collect: Use collections to capture assets for a specific analysis developers and other stakeholders as they monitor the runtime or governance effort environment

●● ●● Collaborate: Share collections for additional curation and Workload management to allocate resource priority to governance certain projects in a shared-services environment and ●● Govern: Create and reference information governance queue workloads on a busy system policies and rules; apply data quality, masking, archiving ●● Performance analysis for insight into resource consumption to and cleansing to data identify bottlenecks and determine when systems may need ●● Offload: Copy data in one click to HDFS for analysis for more resources warehouse augmentation ●● Building workflows that include Hadoop-based activities ●● Analyze: Analyze offloaded data defined through Oozie directly in the job sequence, as well as ●● Reuse and trust: Understand how data is being used with other data integration activities lineage for analysis and reports Administration management for big data integration must With a comprehensive data governance initiative in place, you include: can build an environment that helps ensure all Hadoop data is of high quality, secure and fit for purpose. It enables business users ●● An integrated web-based installer for all capabilities to answer questions such as: ●● High-availability configurations for meeting 24/7­ requirements

●● ●● Do I understand the content and meaning of this data? Flexible deployment options to deploy new instances or ●● Can I measure the quality of this information? expand existing instances on expert, optimized hardware ●● Where does the data in my report come from? systems

●● ●● What is being done to the data inside of Hadoop? Centralized authentication, authorization and session ●● Where was it before reaching our Hadoop data lake? management ●● Audit logging of security-related events to promote Sarbanes- Oxley Act compliance ●● Lab certification for various Hadoop distributions IBM Software 15

Best practices for big data integration set For more information a foundation for success To learn more about the big data integration best practices Organizations are looking to big data initiatives to help them cut and IBM integration solutions, please contact your costs, increase revenue and gain first-mover advantages. Hadoop IBM representative or IBM Business Partner, or visit: technology supports new processes and architectures that enable ibm.com/software/data/integration­ ­ ­ business transformation, but certain big data challenges and opportunities must be addressed before this can happen. Additionally, IBM Global Financing can help you acquire the software capabilities that your business needs in the most IBM recommends building a big data integration architecture cost-effective and strategic way possible. We’ll partner with that is flexible enough to leverage the strengths of the RDBMS, credit-qualified clients to customize a financing solution to suit the ETL grid and Hadoop environments. Users should be able your business and development goals, enable effective cash to construct an integration workflow once and then run it in any management, and improve your total cost of ownership. Fund of these three environments. your critical IT investment and propel your business forward with IBM Global Financing. For more information, visit: The five big data integration best practices outlined in this paper ibm.com/financing­ represent best-of- breed approaches that set your projects up for success. Following these guidelines can help your organization minimize risks and costs and maximize ROI for your Hadoop projects. © Copyright IBM Corporation 2014

IBM Corporation Software Group Route 100 Somers, NY 10589

Produced in the United States of America September 2014

IBM, the IBM logo, ibm.com, DataStage, GPFS, InfoSphere, Platform, and PureData are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml­ ­

Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.

This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

It is the user’s responsibility to evaluate and verify the operation of any other products or programs with IBM products and programs.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT . IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

The client is responsible for ensuring compliance with laws and regulations applicable to it. IBM does not provide legal advice or represent or warrant

3 that its services or products will ensure that the client is in compliance with International Technology Group. “Business Case for Enterprise Data any law or regulation. Actual available storage capacity may be reported for Integration Strategy: Comparing IBM InfoSphere Information Server both uncompressed and compressed data and will vary and may be less than and Open Source Tools.” February 2013. ibm.com/common/ssi/cgi-bin/­ ­ ­ ­ ­ stated. ssialias?infotype=PM&subtype=XB&htmlfid=IME14019USEN 1 Intel Corporation. “Extract, Transform, and Load Big Data With Apache 4 “Large-Scale ETL With Hadoop,” Strata+Hadoop World 2012 presenta- Hadoop.” July 2013. http://intel.ly/UX1Umk­ ­ ­ tion given by Eric Sammer, Principal Solution Architect, Cloudera. www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/­ ­ ­ ­ ­ ­ ­ 2 Measurements produced by IBM while working on-site with a customer strata-hadoop-world-2012-large-scale-etl-with-hadoop.html­ ­ ­ ­ ­ ­ ­ ­ deployment.

Please Recycle

IMW14791-USEN-00