Big Data Administrator • Module I
Total Page:16
File Type:pdf, Size:1020Kb
Big Data Administrator Module I: Introduction to Big Data and Ecosystem o Data Types - RDBMS, NoSQL, Time Series, Graph, Filesystem, Stream, Sensor, Spatial. o Distributed / Parallel Processing Concepts. o Big Data Characteristics, Challenges with Traditional Systems. o Hadoop TimeLine & History. o Fundamentals, Core Components, Rack Awareness, Node & Cluster Concept. o Solution Types, Distributions & Specialties, Challenges & Complexity & Use Cases. o Linux, Filesystems & Other Terminology . RedHat, CentOS, Ubuntu - VM, Server, AWS & Other Cloud Options. Ext3, Ext4, SAN, NAS, NFS, RAID, S3, ZFS, Alluxio, QuantcastFS, XtreemFS, BeeGFS, MooseFS, OrangeFS, LizardFS, Lucene. OpenLDAP, DNS, DHCP, NTP, Kerberos, CA, SSH, Putty, HAProxy, Saltstack. o Role Expectation, Job Description, Responsibilities & Growth Plan . Data Modelling, Designing, ETL (Development / Process) Management, Capacity Planning, Proposal, POCs & Deployment, for (New / Expansion) of Hardware and Software Environments, with Systems Engineering, Infrastructure, Network, Database, Application, Data Delivery and Business Intelligence teams, to ensure business applications are highly available and performing within agreed SLAs. Installation, Implementation, Administration, Configuration, Connectivity, Scaling, Backup, Recovery, Updates, Upgrades, Security, (OS {Primarily Linux} / Memory / Network / Disk / File / User / Node / Volume) Management, Performance Monitoring, Tuning, Task Automation {Bash Scripting}, Maintenance, Support, CI Integration, Log Review (Data Exhaust), Quality Audit, (Develop / Document) Best Practices & Benchmarking for New, Ongoing & Existing Enterprise Cluster, Based on specific / generic Distro or Cloud Provider, and Apache Hadoop. Primary Point of Contact for Vendor Selection, Management & Escalation. Module II: HDFS, Hadoop Architecture & YARN o HDFS Components, Fault Tolerance, Horizontal Scaling, Block Size, Replication Factor, Daemons, HA, Federation, Quotas. o Anatomy of Read / Write & Failure / Recovery on HDFS. o YARN “The Hadoop OS” In Depth (Architecture, HA, RM, Scheduler, Queues, Node Labels). Module III: Environment o Stack Insight {On Premise Vs Cloud} (Cloudera, Hortonworks, MapR, AWS). o Capacity Planning, Hardware / Virtualization Options. o Multi Node “Cloudera” Cluster “First Look”. o Architecture Discussion, Network SetUp & Nodes Enlisting for “Batch" Multi Node Cluster for Classroom assignments and learning. o Automated Bash Scripts Creation / Understanding for speed deployment. o OS Modifications, Java, MySql & Other required Installations. o Ensuring All Lab & Participants System Prerequisites are fulfilled for further proceedings. Module IV: Cloudera Multi Node “On Premise” Cluster (CentOS 7 + CDH 5.15/6.0.1) o Set up a local CDH repository. Install Cloudera Manager Server and agents. Install CDH using Cloudera Manager. Add a new node to an existing cluster. Add a service using Cloudera Manager. o Configure a service using Cloudera Manager. Create an HDFS user's home directory. Configure NameNode HA. Configure ResourceManager HA. Configure proxy for Hiveserver2/Impala. o Rebalance the cluster (bandwidth, balance).Set up alerting for excessive disk fill. Define and install a rack topology script. Install new type of I/O compression library in cluster. Revise YARN resource assignment based on user feedback. Commission/decommission a node. o Configure HDFS ACLs. Install and configure Sentry. Configure Hue user authorization and authentication. Enable/configure log and query redaction. Create encrypted zones in HDFS.LDAP Authentication on Gateway Machines. o Execute file system commands via HTTPFS. Efficiently copy data within / between clusters. Create/restore a snapshot of an HDFS directory. Get/set ACLs for a file or directory structure. Benchmark the cluster (I/O, CPU, and network). o Resolve errors/warnings in Cloudera Manager. Resolve performance problems/errors in cluster operation. Determine reason for application failure. Configure the Fair Scheduler to resolve application delays. Module V: Hortonworks Multi Node “On Premise” Cluster (CentOS 7 + HDP 2.6/3.1.0) o Configure a local HDP repository. Install ambari-server and ambari-agent. Install HDP using the Ambari install wizard. Add a new node to an existing cluster. Decommission a node. Add an HDP service to a cluster using Ambari. o Define and deploy a rack topology script. Change the configuration of a service using Ambari. Configure the Capacity Scheduler. Create a home directory for a user and configure permissions. Configure the include and exclude DataNode files. o Restart an HDP service. View an application’s log file. Configure and manage alerts. Troubleshoot a failed job. o Configure NameNode HA. Configure ResourceManager HA. Copy data between two clusters using distcp. Create a snapshot of an HDFS directory. Recover a snapshot. Configure HiveServer2 HA. o Configure HDFS ACLs. Kerberos Implementation. LDAP Authentication on Gateway Machines. Benchmark the cluster (I/O, CPU, and network). Apart from the Multi Node Local LAN Batch Clusters as mentioned above, single / multi node cluster on local machine through VM Virtualization will also be covered. Data ETL through Talend Open Studio / Sqoop & Orchestration through SaltStack. AWS Instance Types, creation and access will be discussed. Under free tier permits, Installation as above will be attempted. Industry Compliant Practical Curriculum On Latest Stacks. Guidance for Resume Preparation & Interview Questions. POCs Suggestions for further practice and pursual. Mock Interviews for Interested Candidates. Discussing Current Market Standards & New / Incubating Tech. Other Q & A, Doubt Clearance. Discussing Cloudera & Hortonworks Certification Programs. .