
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? • Zohar Elkayam, CTO at Brillix • Programmer, DBA, team leader, database trainer, public speaker, and a senior consultant for over 18 years • Oracle ACE Associate • Part of ilOUG – Israel Oracle User Group • Involved with Big Data projects since 2011 • Blogger – www.realdbamagic.com and www.ilDBA.co.il 2 http://brillix.co.il About Brillix • We offer complete, integrated end-to-end solutions based on best-of- breed innovations in database, security and big data technologies • We provide complete end-to-end 24x7 expert remote database services • We offer professional customized on-site trainings, delivered by our top-notch world recognized instructors 3 Some of Our Customers 4 http://brillix.co.il Agenda • What is the Big Data challenge? • A Big Data Solution: Apache Hadoop • HDFS • MapReduce and YARN • Hadoop Ecosystem: HBase, Sqoop, Hive, Pig and other tools • Another Big Data Solution: Apache Spark • Where does the DBA fits in? 5 http://brillix.co.il The Challenge 6 The Big Data Challenge 7 http://brillix.co.il Volume • Big data comes in one size: Big. • Size is measured in Terabyte (1012), Petabyte (1015), Exabyte (1018), Zettabyte (1021) • The storing and handling of the data becomes an issue • Producing value out of the data in a reasonable time is an issue 8 http://brillix.co.il Variety • Big Data extends beyond structured data, including semi-structured and unstructured information: logs, text, audio and videos • Wide variety of rapidly evolving data types requires highly flexible stores and handling Un-Structured Structured Objects Tables Flexible Columns and Rows Structure Unknown Predefined Structure Textual and Binary Mostly Textual 9 http://brillix.co.il Velocity • The speed in which data is being generated and collected • Streaming data and large volume data movement • High velocity of data capture – requires rapid ingestion • Might cause a backlog problem 10 http://brillix.co.il Value Big data is not about the size of the data, It’s about the value within the data 11 http://brillix.co.il So, We Define Big Data Problem… • When the data is too big or moves too fast to handle in a sensible amount of time • When the data doesn’t fit any conventional database structure • When we think that we can still produce value from that data and want to handle it • When the technical solution to the business need becomes part of the problem 12 http://brillix.co.il How to do Big Data 13 14 Big Data in Practice • Big data is big: technological framework and infrastructure solutions are needed • Big data is complicated: • We need developers to manage handling of the data • We need devops to manage the clusters • We need data analysts and data scientists to produce value 15 http://brillix.co.il Possible Solutions: Scale Up • Older solution: using a giant server with a lot of resources (scale up: more cores, faster processers, more memory) to handle the data • Process everything on a single server with hundreds of CPU cores • Use lots of memory (1+ TB) • Have a huge data store on high end storage solutions • Data needs to be copied to the processes in real time, so it’s no good for high amounts of data (Terabytes to Petabytes) 16 http://brillix.co.il Another Solution: Distributed Systems • A scale-out solution: let’s use distributed systems: use multiple machine for a single job/application • More machines means more resources • CPU • Memory • Storage • But the solution is still complicated: infrastructure and frameworks are needed 17 http://brillix.co.il Distributed Infrastructure Challenges • We need Infrastructure that is built for: • Large-scale • Linear scale out ability • Data-intensive jobs that spread the problem across clusters of server nodes • Storage: efficient and cost-effective enough to capture and store terabytes, if not petabytes, of data • Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing • High-end hardware is too expensive - we need a solution that uses cheaper hardware 18 http://brillix.co.il Distributed System/Frameworks Challenges • How do we distribute our workload across the system? • Programming complexity – keeping the data in sync • What to do with faults and redundancy? • How do we handle security demands to protect highly-distributed infrastructure and data? 19 http://brillix.co.il A Big Data Solution: Apache Hadoop 20 Apache Hadoop • Open source project run by Apache Foundation (2006) • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure • It Is has been the driving force behind the growth of the big data industry • Get the public release from: • http://hadoop.apache.org/core/ 21 http://brillix.co.il Original Hadoop Components • HDFS (Hadoop Distributed File System) – distributed file system that runs in clustered environments • MapReduce – programming paradigm for running processes over clustered environments • Hadoop main idea: let’s distribute the data to many servers, and then bring the program to the data 22 http://brillix.co.il Hadoop Benefits • Designed for scale out • Reliable solution based on unreliable hardware • Load data first, structure later • Designed for storing large files • Designed to maximize throughput of large scans • Designed to leverage parallelism • Solution Ecosystem 23 http://brillix.co.il What Hadoop Is Not? • Hadoop is not a database – it does not a replacement for DW, or for other relational databases • Hadoop is not for OLTP/real-time systems • Very good for large amounts, not so much for smaller sets • Designed for clusters – there is no Hadoop monster server (single server) 24 http://brillix.co.il Hadoop Limitations • Hadoop is scalable but it’s not fast • Some assembly may be required • Batteries are not included (DIY mindset) – some features needs to be developed if they’re not available • Open source license limitations apply • Technology is changing very rapidly 25 http://brillix.co.il Hadoop under the Hood 26 Original Hadoop 1.0 Components • HDFS (Hadoop Distributed File System) – distributed file system that runs in a clustered environment • MapReduce – programming technique for running processes over a clustered environment 27 http://brillix.co.il Hadoop 2.0 • Hadoop 2.0 changed the Hadoop conception and introduced a better resource management concept: • Hadoop Common • HDFS • YARN • Multiple data processing frameworks including MapReduce, Spark and others 28 http://brillix.co.il HDFS is... • A distributed file system • Designed to reliably store data using commodity hardware • Designed to expect hardware failures and still stay resilient • Intended for larger files • Designed for batch inserts and appending data (no updates) 29 http://brillix.co.il Files and Blocks • Files are split into 128MB blocks (single unit of storage) • Managed by NameNode and stored on DataNodes • Transparent to users • Replicated across machines at load time • Same block is stored on multiple machines • Good for fault-tolerance and access • Default replication factor is 3 30 http://brillix.co.il HDFS is Good for... • Storing large files • Terabytes, Petabytes, etc... • Millions rather than billions of files • 128MB or more per file • Streaming data • Write once and read-many times patterns • Optimized for streaming reads rather than random reads 32 http://brillix.co.il HDFS is Not So Good For... • Low-latency reads / Real-time application • High-throughput rather than low latency for small chunks of data • HBase addresses this issue • Large amount of small files • Better for millions of large files instead of billions of small files • Multiple Writers • Single writer per file • Writes at the end of files, no-support for arbitrary offset 33 http://brillix.co.il Using HDFS in Command Line 34 http://brillix.co.il How Does HDFS Look Like (GUI) 35 http://brillix.co.il Interfacing with HDFS 36 http://brillix.co.il MapReduce is... • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • MapReduce can be written in Java, Scala, C, Payton, Ruby and others • Concept: Bring the code to the data, not the data to the code 37 http://brillix.co.il The MapReduce Paradigm • Imposes key-value input/output • We implement two main functions: • MAP - Takes a large problem and divides into sub problems and performs the same function on all sub-problems Map(k1, v1) -> list(k2, v2) • REDUCE - Combine the output from all sub-problems (each key goes to the same reducer) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else (almost) 38 http://brillix.co.il Divide and Conquer 39 http://brillix.co.il YARN • Takes care of distributed processing and coordination • Scheduling • Jobs are broken down into smaller chunks called tasks • These tasks are scheduled to run on data nodes • Task Localization with Data • Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task • Code is moved to where the data is 40 http://brillix.co.il YARN • Error Handling • Failures are an expected behavior so tasks are automatically re-tried on other machines • Data Synchronization • Shuffle and Sort barrier re-arranges and moves data between machines • Input and output are coordinated by the framework 41 http://brillix.co.il Submitting a Job • Yarn script with a class argument command launches a JVM and executes the provided Job
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages67 Page
-
File Size-