Big Data Integration and Hadoop Best Practices for Minimizing Risks and Maximizing ROI for Hadoop Initiatives 2 Big Data Integration and Hadoop

IBM Software September 2014 Big data integration and Hadoop Best practices for minimizing risks and maximizing ROI for Hadoop initiatives 2 Big data integration and Hadoop Introduction Big data and Hadoop projects depend on collecting, moving, Apache Hadoop technology is transforming the economics and transforming, cleansing, integrating, governing, exploring and dynamics of big data initiatives by supporting new processes and analyzing massive volumes of different types of data from many architectures that can help cut costs, increase revenue and create different sources. Accomplishing all this requires a resilient, end- competitive advantage. An open source software project that to-end information integration solution that is massively scalable enables the distributed processing and storage of large data sets and provides the infrastructure, capabilities, processes and disci- across clusters of commodity servers, Hadoop can scale from a pline required to support Hadoop projects. single server to thousands, as demands change. Primary Hadoop components include the Hadoop Distributed File System for storing large files and the Hadoop distributed parallel processing framework (known as MapReduce). “By most accounts, 80 percent of the develop- ment effort in a big data project goes into However, by itself, Hadoop infrastructure does not present a data integration and only 20 percent goes complete big data integration solution, and there are both chal- lenges and opportunities to address before you can reap its bene- toward data analysis.” fits and maximize return on investment (ROI). —Intel Corporation, “Extract, Transform, and Load Big Data with Apache Hadoop”1 The importance of big data integration for Hadoop initiatives The rapid emergence of Hadoop is driving a paradigm shift in how organizations ingest, manage, transform, store and analyze An effective big data integration solution delivers simplicity, big data. Deeper analytics, greater insights, new products and speed, scalability, functionality and governance to produce services, and higher service levels are all possible through this consumable data from the Hadoop swamp. Without effective technology, enabling you to reduce costs significantly and gener- integration, you get “garbage in, garbage out”—not a good rec- ate new revenues. ipe for trusted data, much less accurate and complete insights or transformative results. IBM Software 3 As the Hadoop market has evolved, leading technology analysts agree that Hadoop infrastructure by itself is not a complete Critical success factor: Avoid hype and distinguish fact or effective big data integration solution (read this report that from fiction discusses how Hadoop is not a data integration platform). To fur- ther complicate the situation, some Hadoop software vendors During these emerging stages of the Hadoop market, carefully have saturated the market with hype, myths and misleading or consider everything you hear about Hadoop’s prowess. A contradictory information. significant gap exists between the myths and the realities of exploiting Hadoop, particularly when it comes to big data integration. There is a lot of industry hype claiming that any To cut through this misinformation and develop an adoption non- scalable extract, transform and load (ETL) tool plus plan for your Hadoop big data project, you must follow a best Hadoop equals a high-p erformance, highly scalable data practices approach that takes into account emerging technolo- integration platform. gies, scalability requirements, and current resources and skill levels. The challenge: create an optimized big data integration In reality, MapReduce was not designed for the high- approach and architecture while avoiding implementation performance processing of massive data volumes, but for pitfalls. finely grained fault tolerance. That discrepancy can lower overall performance and efficiency by an order of magnitude Massive data scalability: The overarching or more. requirement Hadoop Yet Another Resource Negotiator (YARN) takes the If your big data integration solution cannot support massive data resource management capabilities that were in MapReduce scalability, it may fall short of expectations. To realize the full and packages them so they can be used by other applications business value of big data initiatives, massive data scalability is that need to execute dynamically across the Hadoop cluster. essential for big data integration on most Hadoop projects. As a result, this approach makes it possible to implement Massive data scalability means there are no limitations on data massively scalable data integration engines as native Hadoop volumes processed, processing throughput, or the number of applications without having to incur the performance limita- processors and processing nodes used. You can process more tions of MapReduce. All enterprise technologies seeking to be data and achieve higher processing throughput simply by adding scalable and efficient on Hadoop will need to adopt YARN as part of their product road map. more hardware. The same application will then run without modification and with increased performance as you add hardware Before you start your integration journey, be sure you under- resources (see Figure 1). stand the performance limitations of MapReduce and how different data integration vendors address them. Learn more in the “Themis: An I/O- Efficient MapReduce” paper, which discusses this subject at length: http://bit.ly/1v2UXAT 4 Big data integration and Hadoop Source Transform Cleanse Enrich EDW data Sequential 4-way parallel 64-way parallel Disk Disk CPU CPU CPU CPU CPU Memory Shared memory Uniprocessor SMP systemMPP clustered system or GRID Figure 1. Massive data scalability is a mandatory requirement for big data integration. In the big data era, organizations must be able to support an MPP clustered system to scale. Requirements for supporting massive data scalability are not Critical success factor: Big data integration platforms must only linked to the emergence of the Hadoop infrastructure. support all three dimensions of scalability Leading data warehouse vendors such as IBM and Teradata, and • Linear data scalability: A hardware and software system leading data integration platforms such as IBM® InfoSphere® delivers linear increases in processing throughput with linear Information Server have provided shared-nothing, massively increases in hardware resources. For example, an applica- parallel software platforms supporting massive data scalability for tion delivers linear data scalability if it can process 200 GB of years—for nearly two decades in some cases. data in four hours running on 50 processors, 400 GB of data in four hours running on 100 processors and so on. Over time, these vendors have converged on four common • Application scale-up: A measurement of how effectively the software architecture characteristics that support massive data software achieves linear data scalability across processors scalability, as shown in Figure 2. within one symmetric multiprocessor (SMP) system. • Application scale-out: A determination of how well the software achieves linear data scalability across SMP nodes in a shared-n othing architecture. IBM Software 5 A shared-nothing Implemented using That leverages data Resulting in a architecture software dataflow partitioning for linear design isolation data scalability environment Software is designed from Software dataflow enables Large data sets are Design a data processing the ground up to exploit a full exploitation of a partitioned across separate job once, and use it in any shared-nothing, massively shared-nothing architecture nodes and a single job hardware configuration parallel architecture by by making it easy to executes the same without needing to partitioning data sets implement and execute application logic against redesign and re-tune across computing nodes data pipelining and data all partitioned data. the job. and executing a single partitioning within a node application with the same and across nodes. application logic executing Software dataflow also against each data partition. hides the complexities of building and tuning parallel applications from users. Figure 2. The four characteristics of massive data scalability. Most commercial data integration software platforms were never needing to redesign and retune the job. These capabilities are designed to support massive data scalability, meaning they were critical to reducing costs by realizing efficiency gains. Without not built from the ground up to exploit a shared-nothing, them, the platform won’t be able to work with big data volumes. massively parallel architecture. They rely on shared memory multithreading instead of software dataflow. The InfoSphere Information Server data integration portfolio Furthermore, some vendors do not support partitioning large supports the four massive data scalability architectural data sets across nodes and running a single data integration job characteristics. Learn more in the Forrester report, in parallel against separate data partitions, or the ability to design “Measuring The Total Economic Impact Of IBM InfoSphere a job once and use it in any hardware configuration without Information Server” at http://ibm.co/UX1RqB 6 Big data integration and Hadoop Optimizing big data integration The right answer to these questions depends on an enterprise’s workloads: A balanced approach unique big data requirements. Organizations can choose among Because nearly all Hadoop big data use cases

Load more