Course Hand Out
Total Page:16
File Type:pdf, Size:1020Kb
BVRIT HYDERABAD College of Engineering for Women Department of Computer Science and Engineering Course Hand Out Subject Name : Predictive Analytics Prepared by : G.Shanti, B.NagaVeni, Dr.G.Naga Satish Year, Sem, Regulation : IV Year- II Sem (R15) UNIT-I Introduction to Predictive Analytics & Linear Regression Introduction: Predictive Analytics is an art of predicting future on the basis of past trend. It is a branch of Statistics which comprises of Modelling Techniques, Machine Learning & Data Mining. Predictive Analytics is primarily used in Decision Making. What and Why analytics: Analytics is a journey that involves a combination of potential skills, advanced technologies, applications, and processes used by firm to gain business insights from data and statistics. This is done to perform business planning. Reporting Vs Analytics: Reporting is presenting result of data analysis and Analytics is process or systems involved in analysis of data to obtain a desired output. Introduction to tools and Environment: Analytics is now days used in all the fields ranging from Medical Science to Aero science to Government Activities. Data Science and Analytics are used by Manufacturing companies as well as Real Estate firms to develop their business and solve various issues by the help of historical data base. Tools are the softwares that can be used for Analytics like SAS or R. While techniques are the procedures to be followed to reach up to a solution. Various steps involved in Analytics: Access Manage Analyze Report Various Analytics techniques are: Data Preparation Reporting, Dashboards & Visualization Segmentation Icon Forecasting Descriptive Modelling Predictive Modelling Application of Modelling in Business: A statistical model embodies a set of assumptions concerning the generation of the observed data, and similar data from a larger population. A model represents, often in considerably idealized form, the data-generating process. Signal processing is an enabling technology that encompasses the fundamental theory, applications, algorithms, and implementations of processing or transferring information contained in many different physical, symbolic, or abstract formats broadly designated as signals. It uses mathematical, statistical, computational, heuristic, and linguistic representations, formalisms, and techniques for representation, modelling, analysis, synthesis, discovery, recovery, sensing, acquisition, extraction, learning, security, or forensics. In manufacturing statistical models are used to define Warranty policies, solving various conveyor related issues, Statistical Process Control etc. Databases & Type of data and variables: A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”. Data can be categorized on various parameters like Categorical, Type etc. Data can be categorized on various parameters like Categorical, Type etc. Data is of 2 types – Numeric and Character. Again numeric data can be further divided into sub group of – Discrete and Continuous. Again, Data can be divided into 2 categories – Nominal and ordinal. Also based on usage data is divided into 2 categories – Quantitative and Qualitative Data Modelling Techniques Overview: Regression analysis mainly focuses on finding a relationship between a dependent variable and one or more independent variables. Predict the value of a dependent variable based on the value of at least one independent variable. It explains the impact of changes in an independent variable on the dependent variable. Y = f(X, β) where Y is the dependent variable X is the independent variable β is the unknown coefficient Types of Regression Models: Missing Imputations: In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data. LINEAR REGRESSION It’s a common technique to determine how one variable of interest is affected by another. Its used for three main purposes: For describing the linear dependence of one variable on the other. For prediction of values of other variable from the one which has more data. Correction of linear dependence of one variable on the other. A line is fitted through the group of plotted data. Y= α + βX + ε α = intercept coefficients β = slope coefficients ε = residuals Cluster Analysis is the process of forming groups of related variable for the purpose of drawing important conclusions based on the similarities within the group. The greater the similarity within a group and greater the difference between the groups, more distinct is the clustering. Time Series Time series data is an ordered sequence of observations on a quantitative variable measured over an equally spaced time interval. Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematic finance, weather forecasting, earthquake prediction electroencephalography, control engineering, astronomy, communications engineering and other places. Time series analysis is used in Analysing time series data QUESTIONS: What is Linear Regression? What are the steps involved in Analysis of Data? What is the meaning of epsilon (ε) in regression equation? What is cluster analysis of Data? Is Clustering a Data Mining technique? How to impute missing Data in R? UNIT-II Logistic Regression Logistic regression, or Logit regression, or Logit model is a regression model where the dependent variable (DV) is categorical. Logistic regression was developed by statistician David Cox in 1958. OLS and MLE: OLS -> Ordinary Least Square MLE -> Maximum Likelihood Estimation The ordinary least squares, or OLS, can also be called the linear least squares. This is a method for approximately determining the unknown parameters located in a linear regression model. According to books of statistics and other online sources, the ordinary least squares is obtained by minimizing the total of squared vertical distances between the observed responses within the dataset and the responses predicted by the linear approximation. Through a simple formula, you can express the resulting estimator, especially the single regressor, located on the right-hand side of the linear regression model. Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model, and for fitting a statistical model to data. If you want to find the height measurement of every basketball player in a specific location, you can use the maximum likelihood estimation. Normally, you would encounter problems such as cost and time constraints. If you could not afford to measure all of the basketball players’ heights, the maximum likelihood estimation would be very handy. Using the maximum likelihood estimation, you can estimate the mean and variance of the height of your subjects. The MLE would set the mean and variance as parameters in determining the specific parametric values in a given model. Error Matrix: A confusion matrix, also known as a contingency table or an error matrix , is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice-versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another). A table of confusion (sometimes also called a confusion matrix), is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. This allows more detailed analysis than mere proportion of correct guesses (accuracy). Accuracy is not a reliable metric for the real performance of a classifier, because it will yield misleading results if the data set is unbalanced (that is, when the number of samples in different classes vary greatly). Tree Building: Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. Decision tree Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. An example is shown on the right. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Able to handle both numerical and categorical data. Other techniques are usually specialized in analysing datasets that have only one type