Job Oriented Engineering Course in Business Intelligence and Data Analytics

Sr. Subject No.

1 Statistical Analysis Techniques 2 Big Data Analytics 3 Machine Learning 4 Computer Programming 5 Professional Development courses 6 Project

STATISTICAL ANALYSIS TECHNIQUES

UNIT I - DATA PREPROCESSING Reading and getting data into R – ordered and unordered factors – arrays and matrices – lists and data frames – reading data from files Data Preprocessing: handling incomplete or incorrect data, handling missing values, subsetting, sorting, transforming scale, determining percentiles, removing noise, removing inconsistencies, transformations, standardizing, min-max normalization, z-score standardization

UNIT II - DESCRIPTIVE STATISTICS Populations and samples, Sampling Techniques - Data classification, Tabulation, Frequency and Graphic representation; Measures of central value: Arithmetic mean, Geometric mean, Harmonic mean, Mode, Median, Quartiles, Deciles, Percentile; Measures of variation: Range, IQR, Quartile deviation, Mean deviation, standard deviation; Measures of association: coefficient variance, ANOVA, corelation, outliers; Measures of shape: Skewness, Moments and Kurtosis

UNIT III - INFERENTIAL STATISTICS AND HYPOTHESIS TESTING Random variable, probability distributions, joint probability function, Sampling distribution of mean, Central Limit Theorem, Standard Error Estimation - Point and Interval Estimates, Confidence Intervals, level of confidence, sample size Hypothesis Testing - Level of significance, p-value, z-test, t-test, chi-square test, 1 and 2 tailed test, uses of t-distribution, F-distribution, χ2 distribution Conditional probability, expectation, independence, Bayes' rule

UNIT IV - PREDICTIVE ANALYTICS Predictive modeling and Analysis - Regression Analysis, Multicollinearity, Correlation analysis, Rank correlation coefficient, Multiple correlation, Least square, Curve fitting and goodness of fit, Residual analysis, Logistic regressions

UNIT V - EXPLORATORY DATA ANALYSIS AND VISUALIZATION Boxplot, scatter plot, histogram, model visualization, clustering and classification Make your data alive with visuals using R, Excel and tools like Tableau, Introduction to graphical analysis – plot() function – displaying multivariate data – matrix plots – multiple plots in one window - exporting graph - using graphics parameters UNIT VI - TIME SERIES FORECASTING Forecasting Models for Time series : Time series data, components of time series, TS forecasting modelling methods- Simple Moving Average, Simple exponential, double exponential (Holt's method), Triple exponential (Holt's winter method)

BIG DATA ANALYTICS

UNIT I – INTRODUCTION HADOOP ARCHITECTURE Big Data and its importance, Apache Hadoop and Hadoop EcoSystem, Moving Data in and out of Hadoop, Hadoop Architecture, Hadoop daemons, Schedulers, Hadoop 2.0 New Features, YARN Cluster Setup, SSH and Hadoop Configuration

UNIT - II HDFS and MAPREDUCE Introduction to distributed file system, Common Hadoop Shell commands, Hadoop Storage: HDFS, blocks, replication, HDFS commands Hadoop Map Reduce paradigm, Map and Reduce tasks, inputs and outputs of MapReduce - Data Serialization, Map / Reduce Side Join, write MR jobs in Java, Running MR jobs in local / pseudo / cluster mode, Data Locality, Shuffling and sorting

UNIT - III PIG PIG fundamentals, MapReduce vs. PIG, data types, programming constructs, execution modes, Grunt Shell, Script, Built-in Functions, Relational Join Operators, Core Relational Operators, How to write UDFs in Pig

UNIT - IV HIVE AND HIVEQL Hive Architecture and Installation, Hive vs RDBMS, Built-in Hive Functions, HiveQL - Querying Data - Sorting and Aggregating, Joins and Subqueries, How to write UDFs in Hive

UNIT - V HBASE HBase concepts, Schema Design, HBase Shell, HBase Java API for CRUD Operations, HBase vs. RDBMS Introduction to Zookeeper, Oozie, Flume, Sqoop

Unit VI - SPARK Spark Introduction, Framework, Installation, Spark with Map-Reduce, Spark-SQL with dataframes, Spark ML

Machine Learning

Unit I - DATA PREPROCESSING Text preprocessing, stop word removal, stemming, Dimensionality Reduction, Feature Selection algorithms, TF-IDF computation

Unit II - CLASSIFICATION Supervised learning, Bayesian Classification, k-Nearest Neighbors (k-NN), Decision tree, Support Vector Machines, Neural Networks, Multi label classification Overfitting/Underfitting, bagging/boosting and ensemble methods, Classifier performance measures, confusion matrix, Cross validation

Unit III - CLUSTERING AND OUTLIER ANALYSIS UnSupervised learning, K-means algorithm, other techniques, Interpretation of clusters and validation Introduction to outlier mining, Applications, Detection Techniques

UNIT IV – ASSOCIATION MINING Association rule mining, Apriori algorithm, Market Basket Analysis, Associative Classification

Unit V - TEXT ANALYTICS Introduction, text mining operations, Categorization, Clustering, Information extraction, Text mining applications

Unit VI – BUSINESS INTELLIGENCE What is a data warehouse, need for a data warehouse, architecture, Data Integration, data marts, OLTP vs OLAP, Multidimensional Modeling: Star and snow flake schema, Data cubes, Enterprise Reporting OLAP operations, Data Cube Computation and Data Generalization, Data lake Recent trends

[Note: analysis using tools like R / SCILAB / WEKA / MEKA]