hadoop big data. | 1

how to learn hadoop.

*this is meant as a study guide that was created from lecture videos and is used to help you gain an understanding of how does ai work.

How to Learn Hadoop for Data Science

Hadoop is open source and has been around for over 10 years. There are current innovations with Hadoop in the cloud sector. Hadoop is being implemented for IoT devices and Genomics data. Hadoop is made up of three components: Storage (HDFS), Compute (MapReduce, Spark) and Management (YARN, Mesos).

Apache Spark is used for computing processes in memory of the worked nodes, which allows for quicker processing speeds of Hadoop jobs.

HDFS is used for batch, extract transformation and load jobs that was included in the Hadoop core distribution.

Data lakes are examples such as Amazon S3 or Google Cloud Storage and they are cloud- based file storages. These options are typically cheaper than using HDFS locally or on the cloud itself.

Cloudera and Databricks are the current leading commercial distributions of Hadoop.

Public cloud distributions are Google Cloud dataproc, AWS EMR and Microsoft Azure HDInsight.

As you add Hadoop libraries into your production environment, ensure you are matching the

mind movement machine. hadoop big data. | 2

proper versions so your code runs.

Hadoop on GCP

Sign Up for GCP -> Login to GCP Console -> In left side menu, Big Data: Dataproc -> Create cluster to create a managed Hadoop cluster -> name the cluster and fill out additional information -> click Create -> you can then create a Job, types of Dataproc jobs are Hadoop, Spark, PySpark, Hive, SparkSql and Pig. -> Submit Job

Preemptible worker nodes are provided in Google Cloud dataproc production instances and are used to speed up batch jobs as these worker nodes are very cheaper.

Apache Spark is made from Databricks and they have a commercial distribution. Databricks with Apache Spark comes with Jupyter notebooks for data scientists.

Setup IDE in Databricks Apache Spark

Download your preferred IDE for Python, Scala or Java (which are the programming languages used to program against the Spark API) -> go into databricks and signup for Community Edition (Google Chrome works best for databricks)

Adding Hadoop Libraries to databricks (example library is ) add Hadoop libraries to databricks by going into databricks workspace -> Shared -> Create: Library -> Source: Maven Coordinate ->Coordinate: avro -> Click Search Spark Packages -> switch to Maven Central once in Search Packages -> type in avro_tools__1.8.1 -> Create Library

mind movement machine. hadoop big data. | 3

Setting up clusters in databricks

Databricks UI -> Clusters -> Create Cluster -> select Spark version and go with default settings -> Create Cluster

Loading data into tables in databricks

Databricks UI -> Tables -> Create Table -> select Data Source such as a csv or from a Cloud Bucket -> name table -> Create Table

Hadoop Batch Processing

Use MapReduce for Batch Processing.

YARN is used for scaling when moving from development to production. YARN is scalable and fault tolerant. It handles memory scheduling in a familiar type way of Hadoop schedulers.

Apache Mesos is a newer alternative to YARN and is designed for Hadoop workloads. Apache Mesos was designed as a cluster OS framework and handles both job scheduling and container group management.

YARN vs Spark Standalone

Hadoop Use Cases:

Stream Data Processing Use Cases

mind movement machine. hadoop big data. | 4

Streaming ETL is used to take data coming in, process it and respond to it

Data enrichment is taking data streaming in and combining it with other data sources

Trigger event detection is used for devices such as credit card services

Machine Learning Use Cases

Sentiment Analysis which can be explained easiest through the sentiment example.

Fog Computing Use Cases

IoT device messages is used to get information from devices and the edge layer to respond quickly

Type of Big Data Streaming

Streaming Options

How to Learn Hadoop with Apache Spark

Open source cluster computing framework that uses in-memory objects for distributed computing. Apache Spark is faster than MapReduce over 100 times. The objects that Spark uses are called Resilient distributed datasets or RDD. Spark has many utility libraries.

Apache Spark Libraries

Learn more about Spark by visiting Apache.

mind movement machine. hadoop big data. | 5

Spark Data Interfaces

Resilient Distributed Datasets are low level interface to a sequence of data object in memory.

DataFrames are a collection of distributed row types and are similar to what you find with Python data frames.

Datasets are new for Spark 2.0 and are distributed collections that combine DataFrames and RDD functionality.

Spark Programming Languages

Python is easiest, Scala is functional and scalable, Java is for enterprise and R is geared towards machine learning.

SparkSession Objects

In Spark 1.x, SparkSession Objects are called SQLContext or sc.

In Spark 2.0, SparkSession Objects are called spark.

Spark shell

Connect Spark with your Created Cluster in databricks- Databricks UI -> Create new cluster -> Create New Notebook -> in the notebook, type sc OR spark OR sqlContext and click execute

mind movement machine. hadoop big data. | 6

Databricks UI

Visit Databricks to setup your account.

Workspace is the container for stored information.

Tables will provide the tables that have been created.

Clusters will show the clusters created.

Within Clusters: Spark UI, you can see the Jobs, Storage, Environment and Execution.

Executors will show the RDD blocks, which is where your data stores for Spark.

Working with Sparks notebook in databricks

Databricks UI -> Workspace -> Users -> select notebook

To add markdown language, use %md This is a markdown note example.

You can add notebooks into dashboards. You can see revision history and link to github.

Import and export Spark notebooks in databricks

File Types to Export- DBC Archive is databricks archive, Source File is the file exported in the language used (.py for Python), iPython notebook and HTML, which is a static web page. Exporting as a Source File and importing into a code editor is a good option.

mind movement machine. hadoop big data. | 7

To import a file, in the databricks UI select Workspace: Import and drag the file to import.

Example wordcount in databricks Spark using Scala

Databricks Spark Data Sources

Transformations and Actions

Spark Transformation operations do not show results without a Spark Action operation. Example of Spark Transformations are select, distinct, groupby, sum, orderby, filter and limit.

Examples of Spark Action operations are show, count, collect and save. Take is a Spark action that selects a certain number of elements from the beginning of a dataframe.

Pipelining in Spark is the separation of operations of Transformations and Actions.

DAG

A DAG will show you a visualization of the detailed process of what is occurring, the overhead, how long it is taking, how many executors there are and etc. View the specific Spark Job to see the DAG.

Spark SQL

Using Spark SQL in databricks- Workspace -> Import Notebook -> read the csv data into Spark using spark.read

mind movement machine. hadoop big data. | 8

-> display data with display -> use createOrReplaceTempView to register the imported csv data into Spark SQL Tables -> you can then you SQL with %SQL ->

SparkR

Easiest way to use SparkR is to make a local R dataframe and turn it into a SparkR DataFrame. Use the createDataFrame method to do this.

Spark ML with SparkR in databricks

Workspace -> Import Notebook -> Create a cluster -> load (csv) data with spark.read -> prepare data for model by cleaning it with appropriate Spark commands

Example of Pyspark Spark ML Model (Linear Regression) in databricks

Spark with GraphX

GraphX is a library that helps with graph databases and it is an API that can be on top of Spark. It is used for graph-parallel computation.

Spark with ADAM is used for genomics.

Spark Streaming

Streaming handles data coming in either through batch ingesting or stream ingesting. Amazon Kinesis and Google Cloud Pub/Sub are examples of proprietary solutions.

mind movement machine. hadoop big data. | 9

Streaming in real time when the data is being processed can be batch with MapReduce or streaming with Apache Spark or .

Using PySpark for Batch Processing in databricks- Databricks UI -> create a notebook -> verify you have spark by typing in spark in the notebook -> load data -> *example code

Using PySpark for in databricks- Databricks UI -> create a notebook -> verify you have spark by typing in spark in the notebook -> load data -> *example code

How to Learn Hadoop with Streaming Services

Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub and Microsoft Azure Service Bus.

Streaming with Google Cloud Pub/Sub

Go to Google Cloud to setup an account and get started.

Login to GCP Console -> Create New Project -> Launch Cloud Shell -> Create a Topic, which is a named resource which you send messages, in gcloud shell -> Create a subscription, which is how long messages will be retained

Apache Kafka

Apache Kafka is used for massive streaming ingest of data. It is a set of servers. The server cluster stores the record streams into topics. Each record has a key, value and a timestamp.

The four API’s used with Kafka are Producer API: used to publish to topics, Consumer API: used to subscribe to topics, Streams API: used to control the input and output of streams and

mind movement machine. hadoop big data. | 10

Connector API: used to connect to your systems such as a database.

Kafka features include Pub/Sub, in-order messages, streaming and is fault-tolerant.

Apache Storm

An alternative streaming processor to Apache Spark. Apache Storm is now being referenced as Apache Heron. It is a real time stream processor that is a record-at-a-time ingest pipeline.

Apache Storm core concepts are Topologies: real-time application logic, Streams: unbound sequence of tuples, Sputs: source of a stream in a topology, Bolts: performs processes on stream data and Stream grouping: section/partition of topology.

Apache Storm is not micro batches but true streaming.

Apache Kafka vs. Apache Storm

Apache Kafka is used to bring in the stream of data while Apache Storm is used for processing that stream.

You can use Apache Kafka instead of Google Cloud Pub/Sub if you needed messages in order.

You can use instead of Google Cloud Dataflow if you need highly scalable pipelines.

Click to share on Facebook (Opens in new window)

Click to share on Twitter (Opens in new window)

mind movement machine. hadoop big data. | 11

Click to share on LinkedIn (Opens in new window)

Click to share on Reddit (Opens in new window)

mind movement machine.