Getting Started with Big Data: Planning Guide

FEBRUARY 2013 Planning Guide Getting Started with Big Data Steps IT Managers Can Take to Move Forward with Apache Hadoop* Software Why You Should Read This Document This planning guide provides valuable information and practical steps for IT managers who want to plan and implement big data analytics initiatives and get started with Apache Hadoop* software, including: • The IT landscape for big data and the challenges and opportunities associated with this disruptive force • Introduction to Hadoop* software, the emerging standard for gaining insight from big data, including processing and analytic tools (Apache Hadoop MapReduce, Apache HBase* software) • Guidance on how to get the most out of Hadoop software with a focus on areas where Intel can help, including infrastructure technology, optimizing, and tuning • Five basic “next steps” and a checklist to help IT managers move forward with planning and implementing their own Hadoop project FEBRUARY 2013 Planning Guide Getting Started with Big Data Steps IT Managers Can Take to Move Forward with Apache Hadoop* Software Contents 3 The IT Landscape for Big Data Analytics 4 What Big Data Analytics Is (and Isn’t) 6 Emerging Technologies for Managing Big Data 13 Deploying Hadoop in Your Data Center 18 Five Steps and a Checklist: Get Started with Your Big Data Analytics Project 20 Intel Resources for Learning More 2 Intel IT Center Planning Guide | Big Data The IT Landscape for Big Data Analytics The buzz about big data analytics is growing louder. Big data analytics represents a significant challenge for IT organizations—and yet according to an Intel survey of 200 IT Today, every organization across the globe is faced with an managers, 84 percent are already analyzing unstructured data, unprecedented growth in data. Imagine this: The digital universe and 44 percent of those that aren’t expect to do so by 2014.5 of data was expected to expand to 2.7 zettabytes (ZB) by the end The of 2012. Then it’s predicted to double every two years, reaching potential for big data is irresistible. 8 ZB of data by 2015.1 It’s hard to conceptualize this quantity of The three Vs characterize what big data is all about, and also help information, but here’s one way: If the U.S. Library of Congress holds define the major issues that IT needs to address: 462 terabytes (TB) of digital data, then 8 ZB is equivalent to almost 18 million Libraries of Congress.2 That’s really big data. • Volume. The massive scale and growth of unstructured data outstrips traditional storage and analytical solutions. The Value of Big Data • Variety. Traditional data management processes can’t cope with What exactly is “big data,” and where is it coming from? the heterogeneity of big data—or “shadow” or “dark data,” such as access traces and Web search histories. Big data refers to huge data sets that are orders of magnitude larger (volume); more diverse, including structured, semistructured, • Velocity. Data is generated in real time, with demands for usable and unstructured data (variety); and arriving faster (velocity) than information to be served up immediately. you or your organization has had to deal with before. This flood of data is generated by connected devices—from PCs and smart phones to sensors such as RFID readers and traffic cams. Plus, it’s heterogeneous and comes in many formats, including text, document, image, video, and more. A Mountain of Data What about the 8 ZB of data projected for 2015? Nearly 15 billion connected devices—including 3 billion Internet users plus machine-to- machine connections—will contribute to this ocean of data.3 1 Kilobyte (KB) = 1,000 Bytes 1 Megabyte (MB) = 1,000,000 Bytes The real value of big data is in the insights it produces when 1 Gigabyte (GB) = 1,000,000,000 Bytes analyzed—finding patterns, deriving meaning, making decisions, 1 Terabyte (TB) = 1,000,000,000,000 Bytes and ultimately responding to the world with intelligence. 1 Petabyte (PB) = 1,000,000,000,000,000 Bytes 1 Exabyte (EB) = 1,000,000,000,000,000,000 Bytes Using Big Data Analytics to Win 1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 Bytes Big data is a disruptive force, presenting opportunities as well as 1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 Bytes challenges to IT organizations. A study by the McKinsey Global Institute Big data is measured in terabytes, petabytes, and even exabytes. Put it all in perspective with this handy conversion chart. established that data is as important to organizations as labor and capital.4 The study concluded that if organizations can effectively capture, analyze, visualize, and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line. 3 Intel IT Center Planning Guide | Big Data What Big Data Analytics Is (and Isn’t) Big data analytics is clearly a game changer, enabling organizations to gain insights from new sources of data that haven’t been mined in the past. Here’s more about what big data analytics is … and isn’t. Big Data Analytics Is … Big Data Analytics Isn’t … • A technology-enabled strategy for gaining richer, deeper insights • Just about technology. At the business level, it’s about how to into customers, partners, and the business—and ultimately exploit the vastly enhanced sources of data to gain insight. gaining competitive advantage. • Only about volume. It’s also about variety and velocity. But • Working with data sets whose size and variety is beyond the perhaps most important, it’s about value derived from the data. ability of typical database software to capture, store, manage, • Generated or used only by huge online companies like Google and analyze. or Amazon anymore. While Internet companies may have • Processing a steady stream of real-time data in order to make pioneered the use of big data at web scale, applications touch time-sensitive decisions faster than ever before. every industry. • Distributed in nature. Analytics processing goes to where the • About “one-size-fits-all” traditional relational databases built data is for greater speed and efficiency. on shared disk and memory architecture. Big data analytics uses a grid of computing resources for massively parallel • A new paradigm in which IT collaborates with business users and processing (MPP). “data scientists” to identify and implement analytics that will increase operational efficiency and solve new business problems. • Meant to replace relational databases or the data warehouse. Structured data continues to be critically important to companies. • Moving decision making down in the organization and However, traditional systems may not be suitable for the new empowering people to make better, faster decisions in real time. sources and contexts of big data. 4 Intel IT Center Planning Guide | Big Data The Purpose of This Guide The remainder of this guide will describe emerging technologies for Your New Best Friend: managing and analyzing big data, with a focus on getting started The Data Scientist with the Apache Hadoop* open-source software framework, which A new kind of professional is helping organizations make sense provides the framework for distributed processing of large data sets of the massive streams of digital information: the data scientist. across clusters of computers. We’ll also provide five practical steps Data scientists are responsible for modeling complex business you can take to begin planning your own big data analytics project problems, discovering business insights, and identifying using this technology. opportunities. They bring to the job: • Skills for integrating and preparing large, varied data sets • Advanced analytics and modeling skills to reveal and understand hidden relationships • Business knowledge to apply context • Communication skills to present results Data science is an emerging field. Demand is high, and finding skilled personnel is one of the major challenges associated with big data analytics. A data scientist may reside in IT or the business—but either way, he or she is your new best friend and collaborator for planning and implementing big data analytics projects. 5 Intel IT Center Planning Guide | Big Data Emerging Technologies for Managing Big Data For organizations to realize the full potential of big data, they must New technologies are emerging to make big data analytics find a new approach to capturing, storing, and analyzing data. scalable and cost-effective. One new approach uses the power Traditional tools and infrastructure aren’t as efficient working with of a distributed grid of computing resources and “shared nothing larger and more varied data sets coming in at high velocity. architecture,” distributed processing frameworks, and nonrelational databases to redefine the way data is managed and analyzed. Shared Nothing Architecture for Massively Scalable Systems The new shared nothing architecture can scale with the huge services such as Google* and Amazon*, each node is independent volumes, variety, and speed requirements of big data by distributing and stateless, so that shared nothing architecture scales easily— the work across dozens, hundreds, or even thousands of commodity simply add another node—enabling systems to handle growing servers that process the data in parallel. First implemented by processing loads. large community research projects such as SETI@home and online HA RDW AR E AR CH Single Processor IT E C T Symmetric Multiprocessing (SMP) U R E Multicore Computing E UR CT Massively Parallel Processing (MPP) ITE RCH ATA A Distributed Grid D Local Flat Files Hierarchical Shared Relational Nothing Partitioned Complex-Flexible/ Nonrelational E R U T C E Parallel Layers T I H Distributed Frameworks C R A Parallel Algorithms N O I Multitasking/Multithreaded T A C I L Sequential P P A Shared nothing architecture is possible because of the convergence of advances in hardware, data management, and analytic applications technologies.

Getting Started with Big Data: Planning Guide

Security Log Analysis Using Hadoop Harikrishna Annangi Harikrishna Annangi, [email protected]

Building a Scalable Index and a Web Search Engine for Music on the Internet Using Open Source Software

Natural Language Processing Technique for Information Extraction and Analysis

Building a Search Engine for the Cuban Web

Full-Graph-Limited-Mvn-Deps.Pdf

Chapter 2 Introduction to Big Data Technology

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

Apache Solr 3 1 Cookbook.Pdf

Deploying Arch

A Software Architecture for Progressive Scanning of On-Line Communities

Enterprise Data Warehouse Optimization with Hadoop on Power

Tracking Down the Bad Guys Tom Barber - NASA JPL Big Data Conference - Vilnius Nov 2017 Who Am I?