Job Oriented Engineering Course in Business Intelligence and Data Analytics
Total Page:16
File Type:pdf, Size:1020Kb
Job Oriented Engineering Course in Business Intelligence and Data Analytics
Sr. Subject No.
1 Statistical Analysis Techniques 2 Big Data Analytics 3 Machine Learning 4 Computer Programming 5 Professional Development courses 6 Project
STATISTICAL ANALYSIS TECHNIQUES
UNIT I - DATA PREPROCESSING Reading and getting data into R – ordered and unordered factors – arrays and matrices – lists and data frames – reading data from files Data Preprocessing: handling incomplete or incorrect data, handling missing values, subsetting, sorting, transforming scale, determining percentiles, removing noise, removing inconsistencies, transformations, standardizing, min-max normalization, z-score standardization
UNIT II - DESCRIPTIVE STATISTICS Populations and samples, Sampling Techniques - Data classification, Tabulation, Frequency and Graphic representation; Measures of central value: Arithmetic mean, Geometric mean, Harmonic mean, Mode, Median, Quartiles, Deciles, Percentile; Measures of variation: Range, IQR, Quartile deviation, Mean deviation, standard deviation; Measures of association: coefficient variance, ANOVA, corelation, outliers; Measures of shape: Skewness, Moments and Kurtosis
UNIT III - INFERENTIAL STATISTICS AND HYPOTHESIS TESTING Random variable, probability distributions, joint probability function, Sampling distribution of mean, Central Limit Theorem, Standard Error Estimation - Point and Interval Estimates, Confidence Intervals, level of confidence, sample size Hypothesis Testing - Level of significance, p-value, z-test, t-test, chi-square test, 1 and 2 tailed test, uses of t-distribution, F-distribution, χ2 distribution Conditional probability, expectation, independence, Bayes' rule
UNIT IV - PREDICTIVE ANALYTICS Predictive modeling and Analysis - Regression Analysis, Multicollinearity, Correlation analysis, Rank correlation coefficient, Multiple correlation, Least square, Curve fitting and goodness of fit, Residual analysis, Logistic regressions
UNIT V - EXPLORATORY DATA ANALYSIS AND VISUALIZATION Boxplot, scatter plot, histogram, model visualization, clustering and classification Make your data alive with visuals using R, Excel and tools like Tableau, Introduction to graphical analysis – plot() function – displaying multivariate data – matrix plots – multiple plots in one window - exporting graph - using graphics parameters UNIT VI - TIME SERIES FORECASTING Forecasting Models for Time series : Time series data, components of time series, TS forecasting modelling methods- Simple Moving Average, Simple exponential, double exponential (Holt's method), Triple exponential (Holt's winter method)
BIG DATA ANALYTICS
UNIT I – INTRODUCTION HADOOP ARCHITECTURE Big Data and its importance, Apache Hadoop and Hadoop EcoSystem, Moving Data in and out of Hadoop, Hadoop Architecture, Hadoop daemons, Schedulers, Hadoop 2.0 New Features, YARN Cluster Setup, SSH and Hadoop Configuration
UNIT - II HDFS and MAPREDUCE Introduction to distributed file system, Common Hadoop Shell commands, Hadoop Storage: HDFS, blocks, replication, HDFS commands Hadoop Map Reduce paradigm, Map and Reduce tasks, inputs and outputs of MapReduce - Data Serialization, Map / Reduce Side Join, write MR jobs in Java, Running MR jobs in local / pseudo / cluster mode, Data Locality, Shuffling and sorting
UNIT - III PIG PIG fundamentals, MapReduce vs. PIG, data types, programming constructs, execution modes, Grunt Shell, Script, Built-in Functions, Relational Join Operators, Core Relational Operators, How to write UDFs in Pig
UNIT - IV HIVE AND HIVEQL Hive Architecture and Installation, Hive vs RDBMS, Built-in Hive Functions, HiveQL - Querying Data - Sorting and Aggregating, Joins and Subqueries, How to write UDFs in Hive
UNIT - V HBASE HBase concepts, Schema Design, HBase Shell, HBase Java API for CRUD Operations, HBase vs. RDBMS Introduction to Zookeeper, Oozie, Flume, Sqoop
Unit VI - SPARK Spark Introduction, Framework, Installation, Spark with Map-Reduce, Spark-SQL with dataframes, Spark ML
Machine Learning
Unit I - DATA PREPROCESSING Text preprocessing, stop word removal, stemming, Dimensionality Reduction, Feature Selection algorithms, TF-IDF computation
Unit II - CLASSIFICATION Supervised learning, Bayesian Classification, k-Nearest Neighbors (k-NN), Decision tree, Support Vector Machines, Neural Networks, Multi label classification Overfitting/Underfitting, bagging/boosting and ensemble methods, Classifier performance measures, confusion matrix, Cross validation
Unit III - CLUSTERING AND OUTLIER ANALYSIS UnSupervised learning, K-means algorithm, other techniques, Interpretation of clusters and validation Introduction to outlier mining, Applications, Detection Techniques
UNIT IV – ASSOCIATION MINING Association rule mining, Apriori algorithm, Market Basket Analysis, Associative Classification
Unit V - TEXT ANALYTICS Introduction, text mining operations, Categorization, Clustering, Information extraction, Text mining applications
Unit VI – BUSINESS INTELLIGENCE What is a data warehouse, need for a data warehouse, architecture, Data Integration, data marts, OLTP vs OLAP, Multidimensional Modeling: Star and snow flake schema, Data cubes, Enterprise Reporting OLAP operations, Data Cube Computation and Data Generalization, Data lake Recent trends
[Note: analysis using tools like R / SCILAB / WEKA / MEKA]