Big Data Administrator

 Module I: Introduction to Big Data and Ecosystem

o Data Types - RDBMS, NoSQL, Time Series, Graph, Filesystem, Stream, Sensor, Spatial. o Distributed / Parallel Processing Concepts. o Big Data Characteristics, Challenges with Traditional Systems. o Hadoop TimeLine & History. o Fundamentals, Core Components, Rack Awareness, Node & Cluster Concept. o Solution Types, Distributions & Specialties, Challenges & Complexity & Use Cases. o , Filesystems & Other Terminology . RedHat, CentOS, Ubuntu - VM, Server, AWS & Other Cloud Options. . Ext3, Ext4, SAN, NAS, NFS, RAID, S3, ZFS, Alluxio, QuantcastFS, XtreemFS, BeeGFS, MooseFS, OrangeFS, LizardFS, Lucene. . OpenLDAP, DNS, DHCP, NTP, Kerberos, CA, SSH, Putty, HAProxy, Saltstack. o Role Expectation, Job Description, Responsibilities & Growth Plan

. Data Modelling, Designing, ETL (Development / Process) Management, Capacity Planning, Proposal, POCs & Deployment, for (New / Expansion) of Hardware and Software Environments, with Systems Engineering, Infrastructure, Network, Database, Application, Data Delivery and Business Intelligence teams, to ensure business applications are highly available and performing within agreed SLAs.

. Installation, Implementation, Administration, Configuration, Connectivity, Scaling, Backup, Recovery, Updates, Upgrades, Security, (OS {Primarily Linux} / Memory / Network / Disk / File / User / Node / Volume) Management, Performance Monitoring, Tuning, Task Automation {Bash Scripting}, Maintenance, Support, CI Integration, Log Review (Data Exhaust), Quality Audit, (Develop / Document) Best Practices & Benchmarking for New, Ongoing & Existing Enterprise Cluster, Based on specific / generic Distro or Cloud Provider, and .

. Primary Point of Contact for Vendor Selection, Management & Escalation.

 Module II: HDFS, Hadoop Architecture & YARN

o HDFS Components, Fault Tolerance, Horizontal Scaling, Block Size, Factor, Daemons, HA, Federation, Quotas. o Anatomy of Read / Write & Failure / Recovery on HDFS. o YARN “The Hadoop OS” In Depth (Architecture, HA, RM, Scheduler, Queues, Node Labels).

 Module III: Environment

o Stack Insight {On Premise Vs Cloud} (Cloudera, Hortonworks, MapR, AWS). o Capacity Planning, Hardware / Virtualization Options. o Multi Node “Cloudera” Cluster “First Look”. o Architecture Discussion, Network SetUp & Nodes Enlisting for “Batch" Multi Node Cluster for Classroom assignments and learning. o Automated Bash Scripts Creation / Understanding for speed deployment. o OS Modifications, Java, MySql & Other required Installations. o Ensuring All Lab & Participants System Prerequisites are fulfilled for further proceedings.

 Module IV: Cloudera Multi Node “On Premise” Cluster (CentOS 7 + CDH 5.15/6.0.1)

o Set up a local CDH repository. Install Cloudera Manager Server and agents. Install CDH using Cloudera Manager. Add a new node to an existing cluster. Add a service using Cloudera Manager. o Configure a service using Cloudera Manager. Create an HDFS user's home directory. Configure NameNode HA. Configure ResourceManager HA. Configure proxy for Hiveserver2/Impala. o Rebalance the cluster (bandwidth, balance).Set up alerting for excessive disk fill. Define and install a rack topology script. Install new type of I/O compression library in cluster. Revise YARN resource assignment based on user feedback. Commission/decommission a node. o Configure HDFS ACLs. Install and configure Sentry. Configure Hue user authorization and authentication. Enable/configure log and query redaction. Create encrypted zones in HDFS.LDAP Authentication on Gateway Machines. o Execute file system commands via HTTPFS. Efficiently copy data within / between clusters. Create/restore a snapshot of an HDFS directory. Get/set ACLs for a file or directory structure. Benchmark the cluster (I/O, CPU, and network). o Resolve errors/warnings in Cloudera Manager. Resolve performance problems/errors in cluster operation. Determine reason for application failure. Configure the Fair Scheduler to resolve application delays.

 Module V: Hortonworks Multi Node “On Premise” Cluster (CentOS 7 + HDP 2.6/3.1.0)

o Configure a local HDP repository. Install ambari-server and ambari-agent. Install HDP using the Ambari install wizard. Add a new node to an existing cluster. Decommission a node. Add an HDP service to a cluster using Ambari. o Define and deploy a rack topology script. Change the configuration of a service using Ambari. Configure the Capacity Scheduler. Create a home directory for a user and configure permissions. Configure the include and exclude DataNode files. o Restart an HDP service. View an application’s log file. Configure and manage alerts. Troubleshoot a failed job. o Configure NameNode HA. Configure ResourceManager HA. Copy data between two clusters using distcp. Create a snapshot of an HDFS directory. Recover a snapshot. Configure HiveServer2 HA. o Configure HDFS ACLs. Kerberos Implementation. LDAP Authentication on Gateway Machines. Benchmark the cluster (I/O, CPU, and network).

 Apart from the Multi Node Local LAN Batch Clusters as mentioned above, single / multi node cluster on local machine through VM Virtualization will also be covered.

 Data ETL through Talend Open Studio / Sqoop & Orchestration through SaltStack.

 AWS Instance Types, creation and access will be discussed. Under free tier permits, Installation as above will be attempted.

 Industry Compliant Practical Curriculum On Latest Stacks.  Guidance for Resume Preparation & Interview Questions.  POCs Suggestions for further practice and pursual.  Mock Interviews for Interested Candidates.  Discussing Current Market Standards & New / Incubating Tech.  Other Q & A, Doubt Clearance.  Discussing Cloudera & Hortonworks Certification Programs.