Hadoop 2.0 Introduction – with HDP for Windows
Total Page:16
File Type:pdf, Size:1020Kb
Hadoop 2.0 Introduction – with HDP for Windows Seele Lin Who am I • Speaker: – 林彥⾠ – A.K.A Seele Lin – Mail: [email protected] • Experience – 2010~Present – 2013~2014 • Trainer of Hortonworks Certificated Training lecture – HCAHD( Hortonworks Certified Apache Hadoop Developer) – HCAHA ( Hortonworks Certified Apache Hadoop Administrator) Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture Fundamentals – What is HDFS – What is MapReduce – What is YARN – Hadoop eco-systems • HDP for Windows – What is HDP – How to install HDP on Windows – The advantages of HDP – What’s Next • Conclusion • Q&A What is Big Data What is Big Data? 1. In what timeframe do we now create the same amount of information that 2 days we created from the dawn of civilization until 2003? 2. 90% of the world’s data was created 2 years in the last (how many years)? 3. This is data from 2010 report! Sources: http://www.itbusinessedge.com/cm/blogs/lawson/just-the-stats-big-numbers-about-big-data/? cs=48051 http://techcrunch.com/2010/08/04/schmidt-data/ How large can it be? • 1ZB = 1000 EB = 1,000,000 PB = 1,000,000,000 TB Every minute… • http://whathappensontheinternetin60seconds.com/ The definition? “Big Data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it too.”── Dan Ariely. A set of files A database A single file Big Data Includes All Types of Data • Pre-defined • Inconsistent • Irregular schema structure structure or.. • Relational • Cannot be • Parts of it database stored in lack system rows in a structure Structured single table • Pictures • Logs, tweets Unstructured • Video • Often has Semi-structured nested structure Time-sensitive Immutable 6 Key Hadoop DATA TYPES 1. Sentiment How your customers feel 2. Clickstream Website visitors’ data 3. Sensor/Machine Data from remote sensors and machines 4. Geographic Location-based data Value 5. Server Logs 6. Text Millions of web pages, emails, and documents Page © Hortonworks Inc. 2013 4 V’s of Big Data http://www.datasciencecentral.com/profiles/blogs/data-veracity Next Product to Buy (NPTB) Business Problem • Telecom product portfolios are complex • There are many cross-sell opportunities to installed base • Sales associates use in-person conversations to guess about NPTB recommendations, with little supporting data Solution • Hadoop gives telcos the ability to make confident NPTB recommendations, based on data from all its customers • Confident NPTB recommendations empower sales associates and improve their interactions with customers • Use the HDP data lake to reduce sales friction and create NPTB advantage like Amazon’s advantage in eCommerce Use case – Walmart prediction Diapers Beer Friday Revenue ? Localized, Personalized Promotions Business Problem • Telcos can geo-locate their mobile subscribers • They could create localized and personalized promotions • This requires connections with both deep historical data and real- time streaming data • Those connections have been expensive and complicated Solution • Hadoop brings the data together to inexpensively localize and personalize promotions delivered to mobile devices • Notify subscribers about local attractions, events and sales that align with their preferences and location • Telcos can sell these promotional services to retailers 360° View of the Customer Business Problem • Retailers interact with customers across multiple channels • Customer interaction and purchase data is often siloed • Few retailers can correlate customer purchases with marketing campaigns and online browsing behavior • Merging data in relational databases is expensive Solution • Hadoop gives retailers a 360° view of customer behavior • Store data longer & track phases of the customer lifecycle • Gain competitive advantage: increase sales, reduce supply chain expenses and retain the best customers Use case – Target case • Target mined their customer data and send coupons to shopper who have high “pregnancy prediction” score. • One angry father stormed into a Target to yell at them for sending his daughter coupons for baby clothes and cribs. • Guess what, she was pregnant, and hadn’t told her father yet. Changes in Analyzing Data Big data is fundamentally changing the way we analyze information. – Ability to analyze vast amounts of data rather than evaluating sample sets. – Historically we have had to look at causes. Now we can look at patterns and correlations in data that give us much better perspective. Recent day cases 1: • http://www.ibtimes.co.uk/global-smartphone-data-traffic-increase-eightfold-17-exabyte-2020-1475571 Recent day cases 1: Practice on LINE http://tech.naver.jp/blog/?p=2412 Recent day cases 2: in Taiwan – The media analyze of 2014 Taipei City mayor election • IMHO,⿊貘來說 http://gene.speaking.tw/2014/11/blog-post_28.html • 破解社群與 APP⾏銷 http://taiwansmm.wordpress.com/2014/11/26/⾏銷絕不等於買廣 告 -2014年台北市⻑選舉柯⽂哲與連 / Scale up or Scale out? Guess what… • Traditionally, computation has been processor-bound • For decades, the primary push was to increase the computing power of a single machine – Faster processor, more RAM • Distributed systems evolved to allow developers to use multiple machines for a single job – At compute time, data is copied to the compute nodes Scaling with a traditional database • scalling with a queue • sharding the database • fault-tolerane issue • corruption issue • problems NOSQL • Not Only SQL • provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. • Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. • Column: Accumulo, Cassandra, Druid, HBase, Vertica • Document: Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB • Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike • Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog First principles(1/2) • "At the most fundamental level, what does a data system do?" • a data system does: "A data system answers questions based on information that was acquired in the past". • "What is this person's name?” • "How many friends does this person have?” • A bank account web page answers questions like "What is my current balance?” • "What transactions have occurred on my account recently?" First principles(2/2) • Data is often used interchangeably with the word "information". • You answer questions on your data by running functions that take data as input • The most general purpose data system can answer questions by running functions that take in the as input. In fact, any query can be answered entire dataset by running a function on the complete dataset Desired Properties of a Big Data System • Robust and fault-tolerant • Low latency reads and updates • Scalable • General • Extensible • Allow ad hoc queries • Minimal maintenance The Lambda Architecture • There is no single tool that provides a complete solution. • You have to use The Lambda Architecture a variety of tools and techniques to build a complete Big Data system. • solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers • batch layer, the serving layer, and the speed layer The Lambda Architecture model Batch Layer 1 • The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset. • The master dataset can be thought of us a very large list of records. • two things : – store an immutable, constantly growing master dataset, – and compute arbitrary functions on that dataset. • If you're going to precompute views on a dataset, you need to be able to do so for any view and any dataset. • There's a class of systems called "batch processing systems" that are built to do exactly what the batch layer requires Batch Layer(2 What is Batch View • Everything starts from the "query = function(all data)" equation. • You could literally run your query functions on the fly on the complete dataset to get the results. • it would take a huge amount of resources to do and would be unreasonably expensive • Instead of computing the query on the fly, you read the results from the precomputed view. How we get Batch View Serving Layer 1 Serving Layer 2 • The batch layer emits batch views as the result of its functions. • The next step is to load the views somewhere so that they can be queried. • The serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view. Batch and serving layers satisfy almost all properties • Robust and fault tolerant • Scalable • General • Extensible • Allows ad hoc queries • Minimal maintenance Speed Layer 1 Speed Layer 2 • speed layer as similar to the batch layer in that it produces views based on data it receives. • One big difference is that in order to achieve the fastest latencies possible, the speed layer doesn't look at all the new data at once. • it updates the realtime view as it receives new data instead of recomputing them like the batch layer does. • "incremental updates” vs "recomputation updates". • Page view example Speed Layer 3 • complexity isolation:complexity is pushed into a layer whose results are only temporary • The last piece of the Lambda Architecture is merging the resutlts from the batch and realtime views Summary of the Lambda Architecture Summary of the Lambda Architecture