Hadoop Echosystem

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? • Zohar Elkayam, CTO at Brillix • Programmer, DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years • Oracle ACE Associate • Part of ilOUG – Israel Oracle User Group • Involved with Big Data projects since 2011 • Blogger – www.realdbamagic.com and www.ilDBA.co.il 2 http://brillix.co.il About Brillix • We offer complete, integrated end-to-end solutions based on best-of- breed innovations in database, security and big data technologies • We provide complete end-to-end 24x7 expert remote database services • We offer professional customized on-site trainings, delivered by our top-notch world recognized instructors 3 Some of Our Customers 4 http://brillix.co.il Agenda • What is the Big Data challenge? • A Big Data Solution: Apache Hadoop • HDFS • MapReduce and YARN • Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other tools • Another Big Data Solution: Apache Spark • Where does the DBA fits in? 5 http://brillix.co.il The Challenge 6 The Big Data Challenge 7 http://brillix.co.il Volume • Big data comes in one size: Big. • Size is measured in Terabyte (1012), Petabyte (1015), Exabyte (1018), Zettabyte (1021) • The storing and handling of the data becomes an issue • Producing value out of the data in a reasonable time is an issue 8 http://brillix.co.il Variety • Big Data extends beyond structured data, including semi-structured and unstructured information: logs, text, audio and videos • Wide variety of rapidly evolving data types requires highly flexible stores and handling Un-Structured Structured Objects Tables Flexible Columns and Rows Structure Unknown Predefined Structure Textual and Binary Mostly Textual 9 http://brillix.co.il Velocity • The speed in which data is being generated and collected • Streaming data and large volume data movement • High velocity of data capture – requires rapid ingestion • Might cause a backlog problem 10 http://brillix.co.il Value Big data is not about the size of the data, It’s about the value within the data 11 http://brillix.co.il So, We Define Big Data Problem… • When the data is too big or moves too fast to handle in a sensible amount of time • When the data doesn’t fit any conventional database structure • When we think that we can still produce value from that data and want to handle it • When the technical solution to the business need becomes part of the problem 12 http://brillix.co.il How to do Big Data 13 14 Big Data in Practice • Big data is big: technological framework and infrastructure solutions are needed • Big data is complicated: • We need developers to manage handling of the data • We need devops to manage the clusters • We need data analysts and data scientists to produce value 15 http://brillix.co.il Possible Solutions: Scale Up • Older solution: using a giant server with a lot of resources (scale up: more cores, faster processers, more memory) to handle the data • Process everything on a single server with hundreds of CPU cores • Use lots of memory (1+ TB) • Have a huge data store on high end storage solutions • Data needs to be copied to the processes in real time, so it’s no good for high amounts of data (Terabytes to Petabytes) 16 http://brillix.co.il Another Solution: Distributed Systems • A scale-out solution: let’s use distributed systems: use multiple machine for a single job/application • More machines means more resources • CPU • Memory • Storage • But the solution is still complicated: infrastructure and frameworks are needed 17 http://brillix.co.il Distributed Infrastructure Challenges • We need Infrastructure that is built for: • Large-scale • Linear scale out ability • Data-intensive jobs that spread the problem across clusters of server nodes • Storage: efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data • Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing • High-end hardware is too expensive - we need a solution that uses cheaper hardware 18 http://brillix.co.il Distributed System/Frameworks Challenges • How do we distribute our workload across the system? • Programming complexity – keeping the data in sync • What to do with faults and redundancy? • How do we handle security demands to protect highly-distributed infrastructure and data? 19 http://brillix.co.il A Big Data Solution: Apache Hadoop 20 Apache Hadoop • Open source project run by Apache Foundation (2006) • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure • It Is has been the driving force behind the growth of the big data industry • Get the public release from: • http://hadoop.apache.org/core/ 21 http://brillix.co.il Original Hadoop Components • HDFS (Hadoop Distributed File System) – distributed file system that runs in clustered environments • MapReduce – programming paradigm for running processes over clustered environments • Hadoop main idea: let’s distribute the data to many servers, and then bring the program to the data 22 http://brillix.co.il Hadoop Benefits • Designed for scale out • Reliable solution based on unreliable hardware • Load data first, structure later • Designed for storing large files • Designed to maximize throughput of large scans • Designed to leverage parallelism • Solution Ecosystem 23 http://brillix.co.il What Hadoop Is Not? • Hadoop is not a database – it does not a replacement for DW, or for other relational databases • Hadoop is not for OLTP/real-time systems • Very good for large amounts, not so much for smaller sets • Designed for clusters – there is no Hadoop monster server (single server) 24 http://brillix.co.il Hadoop Limitations • Hadoop is scalable but it’s not fast • Some assembly may be required • Batteries are not included (DIY mindset) – some features needs to be developed if they’re not available • Open source license limitations apply • Technology is changing very rapidly 25 http://brillix.co.il Hadoop under the Hood 26 Original Hadoop 1.0 Components • HDFS (Hadoop Distributed File System) – distributed file system that runs in a clustered environment • MapReduce – programming technique for running processes over a clustered environment 27 http://brillix.co.il Hadoop 2.0 • Hadoop 2.0 changed the Hadoop conception and introduced a better resource management concept: • Hadoop Common • HDFS • YARN • Multiple data processing frameworks including MapReduce, Spark and others 28 http://brillix.co.il HDFS is... • A distributed file system • Designed to reliably store data using commodity hardware • Designed to expect hardware failures and still stay resilient • Intended for larger files • Designed for batch inserts and appending data (no updates) 29 http://brillix.co.il Files and Blocks • Files are split into 128MB blocks (single unit of storage) • Managed by NameNode and stored on DataNodes • Transparent to users • Replicated across machines at load time • Same block is stored on multiple machines • Good for fault-tolerance and access • Default replication factor is 3 30 http://brillix.co.il HDFS is Good for... • Storing large files • Terabytes, Petabytes, etc... • Millions rather than billions of files • 128MB or more per file • Streaming data • Write once and read-many times patterns • Optimized for streaming reads rather than random reads 32 http://brillix.co.il HDFS is Not So Good For... • Low-latency reads / Real-time application • High-throughput rather than low latency for small chunks of data • HBase addresses this issue • Large amount of small files • Better for millions of large files instead of billions of small files • Multiple Writers • Single writer per file • Writes at the end of files, no-support for arbitrary offset 33 http://brillix.co.il Using HDFS in Command Line 34 http://brillix.co.il How Does HDFS Look Like (GUI) 35 http://brillix.co.il Interfacing with HDFS 36 http://brillix.co.il MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • MapReduce can be written in Java, Scala, C, Payton, Ruby and others • Concept: Bring the code to the data, not the data to the code 37 http://brillix.co.il The MapReduce Paradigm • Imposes key-value input/output • We implement two main functions: • MAP - Takes a large problem and divides into sub problems and performs the same function on all sub-problems Map(k1, v1) -> list(k2, v2) • REDUCE - Combine the output from all sub-problems (each key goes to the same reducer) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else (almost) 38 http://brillix.co.il Divide and Conquer 39 http://brillix.co.il YARN • Takes care of distributed processing and coordination • Scheduling • Jobs are broken down into smaller chunks called tasks • These tasks are scheduled to run on data nodes • Task Localization with Data • Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task • Code is moved to where the data is 40 http://brillix.co.il YARN • Error Handling • Failures are an expected behavior so tasks are automatically re-tried on other machines • Data Synchronization • Shuffle and Sort barrier re-arranges and moves data between machines • Input and output are coordinated by the framework 41 http://brillix.co.il Submitting a Job • Yarn script with a class argument command launches a JVM and executes the provided Job

Hadoop Echosystem

Myriad: Resource Sharing Beyond Boundaries

Is 'Distributed' Worth It? Benchmarking Apache Spark with Mesos

The Dzone Guide to Volume Ii

Kubernetes As an Availability Manager for Microservice Based Applications Leila Abdollahi Vayghan

A Single Platform for Container Orchestration and Data Services

Delivering Business Value with Apache Mesos

Apache/Mesos

Guide to the Open Cloud Open Cloud Projects Profiled

Two Stage Cluster for Resource Optimization with Apache Mesos

2017 Kevin Klues [email protected]

Deploying Apache Flink at Scale

Why Kubernetes Matters