Hadoop Big Data. | 1
Total Page:16
File Type:pdf, Size:1020Kb
hadoop big data. | 1 how to learn hadoop. *this is meant as a study guide that was created from lecture videos and is used to help you gain an understanding of how does ai work. How to Learn Hadoop for Data Science Hadoop is open source and has been around for over 10 years. There are current innovations with Hadoop in the cloud sector. Hadoop is being implemented for IoT devices and Genomics data. Hadoop is made up of three components: Storage (HDFS), Compute (MapReduce, Spark) and Management (YARN, Mesos). Apache Spark is used for computing processes in memory of the worked nodes, which allows for quicker processing speeds of Hadoop jobs. HDFS is used for batch, extract transformation and load jobs that was included in the Hadoop core distribution. Data lakes are examples such as Amazon S3 or Google Cloud Storage and they are cloud- based file storages. These options are typically cheaper than using HDFS locally or on the cloud itself. Cloudera and Databricks are the current leading commercial distributions of Hadoop. Public cloud distributions are Google Cloud dataproc, AWS EMR and Microsoft Azure HDInsight. As you add Hadoop libraries into your production environment, ensure you are matching the mind movement machine. hadoop big data. | 2 proper versions so your code runs. Hadoop on GCP Sign Up for GCP -> Login to GCP Console -> In left side menu, Big Data: Dataproc -> Create cluster to create a managed Hadoop cluster -> name the cluster and fill out additional information -> click Create -> you can then create a Job, types of Dataproc jobs are Hadoop, Spark, PySpark, Hive, SparkSql and Pig. -> Submit Job Preemptible worker nodes are provided in Google Cloud dataproc production instances and are used to speed up batch jobs as these worker nodes are very cheaper. Apache Spark is made from Databricks and they have a commercial distribution. Databricks with Apache Spark comes with Jupyter notebooks for data scientists. Setup IDE in Databricks Apache Spark Download your preferred IDE for Python, Scala or Java (which are the programming languages used to program against the Spark API) -> go into databricks and signup for Community Edition (Google Chrome works best for databricks) Adding Hadoop Libraries to databricks (example library is Apache Avro) add Hadoop libraries to databricks by going into databricks workspace -> Shared -> Create: Library -> Source: Maven Coordinate ->Coordinate: avro -> Click Search Spark Packages -> switch to Maven Central once in Search Packages -> type in avro_tools__1.8.1 -> Create Library mind movement machine. hadoop big data. | 3 Setting up clusters in databricks Databricks UI -> Clusters -> Create Cluster -> select Spark version and go with default settings -> Create Cluster Loading data into tables in databricks Databricks UI -> Tables -> Create Table -> select Data Source such as a csv or from a Cloud Bucket -> name table -> Create Table Hadoop Batch Processing Use MapReduce for Batch Processing. YARN is used for scaling when moving from development to production. YARN is scalable and fault tolerant. It handles memory scheduling in a familiar type way of Hadoop schedulers. Apache Mesos is a newer alternative to YARN and is designed for Hadoop workloads. Apache Mesos was designed as a cluster OS framework and handles both job scheduling and container group management. YARN vs Spark Standalone Hadoop Use Cases: Stream Data Processing Use Cases mind movement machine. hadoop big data. | 4 Streaming ETL is used to take data coming in, process it and respond to it Data enrichment is taking data streaming in and combining it with other data sources Trigger event detection is used for devices such as credit card services Machine Learning Use Cases Sentiment Analysis which can be explained easiest through the Twitter sentiment example. Fog Computing Use Cases IoT device messages is used to get information from devices and the edge layer to respond quickly Type of Big Data Streaming Streaming Options How to Learn Hadoop with Apache Spark Open source cluster computing framework that uses in-memory objects for distributed computing. Apache Spark is faster than MapReduce over 100 times. The objects that Spark uses are called Resilient distributed datasets or RDD. Spark has many utility libraries. Apache Spark Libraries Learn more about Spark by visiting Apache. mind movement machine. hadoop big data. | 5 Spark Data Interfaces Resilient Distributed Datasets are low level interface to a sequence of data object in memory. DataFrames are a collection of distributed row types and are similar to what you find with Python data frames. Datasets are new for Spark 2.0 and are distributed collections that combine DataFrames and RDD functionality. Spark Programming Languages Python is easiest, Scala is functional and scalable, Java is for enterprise and R is geared towards machine learning. SparkSession Objects In Spark 1.x, SparkSession Objects are called SQLContext or sc. In Spark 2.0, SparkSession Objects are called spark. Spark shell Connect Spark with your Created Cluster in databricks- Databricks UI -> Create new cluster -> Create New Notebook -> in the notebook, type sc OR spark OR sqlContext and click execute mind movement machine. hadoop big data. | 6 Databricks UI Visit Databricks to setup your account. Workspace is the container for stored information. Tables will provide the tables that have been created. Clusters will show the clusters created. Within Clusters: Spark UI, you can see the Jobs, Storage, Environment and Execution. Executors will show the RDD blocks, which is where your data stores for Spark. Working with Sparks notebook in databricks Databricks UI -> Workspace -> Users -> select notebook To add markdown language, use %md This is a markdown note example. You can add notebooks into dashboards. You can see revision history and link to github. Import and export Spark notebooks in databricks File Types to Export- DBC Archive is databricks archive, Source File is the file exported in the language used (.py for Python), iPython notebook and HTML, which is a static web page. Exporting as a Source File and importing into a code editor is a good option. mind movement machine. hadoop big data. | 7 To import a file, in the databricks UI select Workspace: Import and drag the file to import. Example wordcount in databricks Spark using Scala Databricks Spark Data Sources Transformations and Actions Spark Transformation operations do not show results without a Spark Action operation. Example of Spark Transformations are select, distinct, groupby, sum, orderby, filter and limit. Examples of Spark Action operations are show, count, collect and save. Take is a Spark action that selects a certain number of elements from the beginning of a dataframe. Pipelining in Spark is the separation of operations of Transformations and Actions. DAG A DAG will show you a visualization of the detailed process of what is occurring, the overhead, how long it is taking, how many executors there are and etc. View the specific Spark Job to see the DAG. Spark SQL Using Spark SQL in databricks- Workspace -> Import Notebook -> read the csv data into Spark using spark.read mind movement machine. hadoop big data. | 8 -> display data with display -> use createOrReplaceTempView to register the imported csv data into Spark SQL Tables -> you can then you SQL with %SQL -> SparkR Easiest way to use SparkR is to make a local R dataframe and turn it into a SparkR DataFrame. Use the createDataFrame method to do this. Spark ML with SparkR in databricks Workspace -> Import Notebook -> Create a cluster -> load (csv) data with spark.read -> prepare data for model by cleaning it with appropriate Spark commands Example of Pyspark Spark ML Model (Linear Regression) in databricks Spark with GraphX GraphX is a library that helps with graph databases and it is an API that can be on top of Spark. It is used for graph-parallel computation. Spark with ADAM is used for genomics. Spark Streaming Streaming handles data coming in either through batch ingesting or stream ingesting. Amazon Kinesis and Google Cloud Pub/Sub are examples of proprietary solutions. mind movement machine. hadoop big data. | 9 Streaming in real time when the data is being processed can be batch with MapReduce or streaming with Apache Spark or Apache Storm. Using PySpark for Batch Processing in databricks- Databricks UI -> create a notebook -> verify you have spark by typing in spark in the notebook -> load data -> *example code Using PySpark for Stream Processing in databricks- Databricks UI -> create a notebook -> verify you have spark by typing in spark in the notebook -> load data -> *example code How to Learn Hadoop with Streaming Services Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub and Microsoft Azure Service Bus. Streaming with Google Cloud Pub/Sub Go to Google Cloud to setup an account and get started. Login to GCP Console -> Create New Project -> Launch Cloud Shell -> Create a Topic, which is a named resource which you send messages, in gcloud shell -> Create a subscription, which is how long messages will be retained Apache Kafka Apache Kafka is used for massive streaming ingest of data. It is a set of servers. The server cluster stores the record streams into topics. Each record has a key, value and a timestamp. The four API’s used with Kafka are Producer API: used to publish to topics, Consumer API: used to subscribe to topics, Streams API: used to control the input and output of streams and mind movement machine. hadoop big data. | 10 Connector API: used to connect to your systems such as a database. Kafka features include Pub/Sub, in-order messages, streaming and is fault-tolerant. Apache Storm An alternative streaming processor to Apache Spark. Apache Storm is now being referenced as Apache Heron. It is a real time stream processor that is a record-at-a-time ingest pipeline. Apache Storm core concepts are Topologies: real-time application logic, Streams: unbound sequence of tuples, Sputs: source of a stream in a topology, Bolts: performs processes on stream data and Stream grouping: section/partition of topology.