TKR College of Engineering and Technology Department of Computer Science and Engineering PE4-S1- INTRODUCTION TO ANALYTICS

Model Questions -1 Part-A 1. List the Data types in R  Logical  Numeric  Integer  Complex  Character  Raw 2. Explain R loops A loop statement allows us to execute a statement or group of statements multiple time.  Repeat loop-Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable.  while loop-Repeats a statement or group of statements while a given condition is true. It tests the condition before executing the loop body.  for loop- Like a while statement, except that it tests the condition at the end of the loop body Control Statement-  break statement-Terminates the loop statement and transfers execution to the statement immediately following the loop  Next statement-The next statement simulates the behavior of R switch 3. What is Data Frame? • Data frames are tabular data objects. Unlike a matrix in data frame each column can containdifferent modes of data. The first column can be numeric while the second column can becharacter and third column can be logical. It is a list of vectors of equal length. Data Frames are created using the data.frame( ) function. It displays data along with header information • A data frame is used for storing data tables. It is a list of vectors of equal length

4. Explain about the concept of Reading Datasets We can import Datasets from various sources having various file types : Example: • .csv or .txt format • Big data tool – Impala • CSV File The sample data can also be in comma separated values (CSV) format. Each cell inside such data file isseparated by a special character, which usually is a comma, although other characters can be used as well. Thefirst row of the data file should contain the column names instead of the actual data. Here is a sample of the expected format. Col1,Col2,Col3 100,a1,b1 200,a2,b2 300,a3,b3 After we copy and paste the data above in a file named "mydata.csv" with a text editor, we can read the data with the function read.csv. In R data can read in two ways either from local disc or web. From disc: The data file location is known on local disc use: read.csv() or read.table() functions. Path is not specific then use : file.choose() > mydata = read.csv("mydata.csv") # read csv file > mydata From Web: The URL of the data from web is pass to read.csv() or read.table() functions 5. What is R ? Why to use R ? Justify R is a flexible and powerful open-source implementation of the language S (for statistics) developed by John Chambers and others at Bell Labs Five reasons to learn and use R: • R is open source and completely free. R community members regularly contribute packages to increase R‘s functionality. • R is as good as commercially available statistical packages like SPSS, SAS, and Minitab. • R has extensive statistical and graphing capabilities. R provides hundreds of built-in statistical functions as well as its own built-in programming language. • R is used in teaching and performing computational statistics. It is the language of choice for many academics who teach computational statistics. • Getting help from the R user community is easy. There are readily available online tutorials, data sets, and discussion forums about R • 6. Write short notes on R Lopping and control statements. A loop statement allows us to execute a statement or group of statements multiple time.  Repeat loop-Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable.  while loop-Repeats a statement or group of statements while a given condition is true. It tests the condition before executing the loop body.  for loop- Like a while statement, except that it tests the condition at the end of the loop body Control Statement-  break statement-Terminates the loop statement and transfers execution to the statement immediately following the loop  Next statement-The next statement simulates the behavior of R switch 7. What are the challenges in analysis of data? Data analytics are extremely important for risk managers. They improve decision- making, increase accountability, benefit financial health, and help employees predict losses and monitor performance. • The amount of data being collected • Collecting meaningful and real-time data • Visual representation of data • Data from multiple sources • Inaccessible data • Poor quality data • Pressure from the top • Lack of support • Confusion or anxiety • Budget • Shortage of skills • Scaling data analysis

8. Justify R has a data analytics software • R allows practicing a wide variety of statistical and graphical techniques like linear and nonlinear modeling, time-series analysis, classification, classical statistical tests, clustering, etc. R is a highly extensible and easy to learn language • R has extensive statistical and graphing capabilities. R provides hundreds of built-in statistical functions aswell as its own built-in programming language. • R is used in teaching and performing computational statistics. It is the language of choice for many academics who teach computational statistics

9. What is Data set? A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. Data sets can also consist of a collection of documents or files 10. What is R Script? Give example. • An R script is simply a text file containing (almost) the same commands that you would enter on the command line of R. ( almost) refers to the fact that if you are using sink() to send the output to a file, you will have to enclose some commands in print() to get the same output as on the command line Example: The R script(s) and data view. The R script is where you keep a record of your work. Create a new R script file: To create a new R script file: 1) File -> New -> R Script, 2) Click on the icon with the ―+‖ sign and select ―R Script‖ 3) Use shortcut as: Ctrl+Shift+N ------Part-B

1. What is the difference between the array and matrix?.Explain with examples. Arrays Arrays An array object (or simply array) contains a collection of elements of the same type, each of which is indexed (i.e., identified) by a number. A variable of type array contains a reference to an array object. To use an array in Java we have to: 1. declare a variable of type array that allows us to refer to an array object; 2. construct the array object specifying its dimension (number of elements of the array object); 3. access the elements of the array object through the array variable in order to assign or obtain their values (as if they were single variables). Matrix A matrix is a collection of elements of the same type, organized in the form of a table. Each element is indexed by a pair of that identify the row and the column of the element. A matrix can be represented in Java through an array, whose elements are themselves (references to) arrays representing the various rows of the matrix. Declaration of a matrix A matrix is declared in the following way (as an array of arrays): int[][] m; // declaration of an array of arrays (matrix)

• We have two different options for constructing matrices or arrays. Either we use the creator functions matrix () and Array (), or you simply change the dimensions using the dim () function. For example, you make an array with four columns, three rows, and two ―tables‖ like this: >my.array< - array(1:24, dim=c(3,4,2)) In the above example, ―my.array‖ is the name of the array we have given. And ―←‖ is the assignment operator. There are 24 units in this array mentioned as ―1:24‖ and are divided in three dimensions ―(3, 4, 2)‖. Although the rows are given as the first dimension, the tables are filled column-wise. So, for arrays, R fills the columns, then the rows, and then the rest. Alternatively, you could just add the dimensions using the dim ( ) function. This is a little hack that goes a bit faster than using the array ( ) function; it‘s especially useful if you have your data already in a vector. (This little trick also works for creating matrices, by the way, because a matrix is nothing more than an array with only two dimensions.) Say you already have a vector with the numbers 1 through 24, like this: >my.vector<- 1:24 You can easily convert that vector to an array exactly like my.array simply by assigning the dimensions, like this: > dim(my.vector) <- c(3,4,2) Arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each. Creating an array: >my.array< - array(1:24, dim=c(3,4,2)) In the above example, ―my.array‖ is the name of the array we have given. There are 24 units in this array mentioned as ―1:24‖ and are divided in three dimensions ―(3, 4, 2)‖. Alternative: with existing vector and using dim() > my.vector<- 1:24 To convert my.vector vector to an array exactly like my.array simply by assigning the dimensions, like this: > dim(my.vector) <- c(3,4,2) Matrix A matrix is a collection of elements of the same type, organized in the form of a table. Each element is indexed by a pair of numbers that identify the row and the column of the element. A matrix can be represented in Java through an array, whose elements are themselves (references to) arrays representing the various rows of the matrix. Declaration of a matrix A matrix is declared in the following way (as an array of arrays): int[][] m; // declaration of an array of arrays (matrix) # Create a matrix. M = matrix( c('a','a','b','c','b','a'), nrow=2,ncol=3,byrow = TRUE) print(M) [,1] [,2] [,3] [1,] "a" "a" "b"

2) Explain the procedure to create R script to combine two data sets. Combining Data sets in R Merge(): To merge two data frames (datasets) horizontally. In most cases, we can join two data frames by one or more common key variables (i.e., an inner join). To merge two data frames by ID: total <- merge(data frameA, data frameB, by="ID") To merge on more than one criteria :To merge two data frames by ID and Country: total <- merge(data frameA,data frameB,by=c("ID","Country")) To join two data frames (datasets) vertically : rbind function. The two data frames must have the same variables, but they do not have to be in the same order. Example: total <- rbind(data frameA, data frameB) Plyr package: Tools for Splitting, Applying and Combining Data. We use rbind.fill() in plyr package in R. It binds or combines a list of data frames filling missing columns with NA. Example: rbind.fill(mtcars[c("mpg", "wt")], mtcars[c("wt", "cyl")]) In this all the missing value will be filled with NA [2,] "c" "b" "a"

Example4: To create a matrix havind the data as: 6,2 ,10 & 1, 3, -2 Step1: create two vectors as xr1,xr2 > xr1 <- c( 6, 2, 10) > xr2 <- c(1, 3, -2) > x <- rbind (xr1, xr2) ## binds the vectors into rows of a matrix(2X3) > x [,1] [,2] [,3] xr1 6 2 10 xr2 1 3 -2

Example: To create a matrix havind the data as: 6,2 ,10 & 1, 3, -2 Step1: create two vectors as xr1,xr2 > xr1 <- c( 6, 2, 10) > xr2 <- c(1, 3, -2) > x <- rbind (xr1, xr2) ## binds the vectors into rows of a matrix(2X3) > x [,1] [,2] [,3] xr1 6 2 10 xr2 1 3 -2

Example5: A vector with the numbers 1 through 24, like this:

3) What is RStudio?. Explain its features R Studio is an Integrated Development Environment (IDE) for R Language with advanced and more user-friendly GUI. R Studio allows the user to run R in a more user-friendly environment. It is open-source (i.e. free) and available at http://www.rstudio.com/

The R Studio screen has four windows: 1. Console. 2. Workspace and history. 3. Files, plots, packages and help. 4. The R script(s) and data view The R script is where you keep a record of your work. Create a new R script file: To create a new R script file: 1) File -> New -> R Script, 2) Click on the icon with the ―+‖ sign and select ―R Script‖ 3) Use shortcut as: Ctrl+Shift+N Console: The console is where you can type commands and see output. Workspace tab: The workspace tab shows all the active objects (see next slide). The workspace tab stores any object, value,function or anything you create during your R session. In the example below, if you click on the dotted squares you can see the data on a screen to the left Programming language for graphics and statistical computations • Available freely under the GNU public license • Used in data mining and statistical analysis • Included time series analysis, linear and nonlinear modeling among others • Very active community and package contributions • Very little programming language knowledge necessary • Can be downloaded from http://www.r-project.org/ open source

4) How do you understand learning objectives ?Explain

Understanding Learning objectives, Introduction to work & meeting requirements, Time Management, Work management & prioritization, Quality & Standards Adherence. Understanding Learning objectives: The benefits of this course include: • Efficient and Effective time management • Efficient – Meeting timelines • Effective – Meeting requirement for desired output • Awareness of the SSC (Sector Skill Council) environment and time zone understanding • Awareness of the SSC environment and importance of meeting timelines to handoffs Review the course objectives listed above. ―To fulfil these objectives today, we‘ll be conducting a number of hands-on activities. Hopefully we can open up some good conversations and some of you can share your experiences so that we can make this session as interactive as possible. Your participation will be crucial to your learning experience and that of your peers here in the session today. 5 ) Explain R functions and R loops with examples A function is a set of statements organized together to perform a specific task. R has a large number of in-built functions and the user can create their own functions. In R, a function is an object so the R interpreter is able to pass control to the function, along with arguments that may be necessary for the function to accomplish the actions. The function in turn performs its task and returns control to the interpreter as well as any result which may be stored in other objects. Function Definition An R function is created by using the keyword function. The basic syntax of an R function definition is as follows function_name <- function(arg_1, arg_2, ...) { Function body } Function Components The different parts of a function are −

 Function Name − This is the actual name of the function. It is stored in R environment as an object with this name.

 Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.

 Function Body − The function body contains a collection of statements that defines what the function does.

 Return Value − The return value of a function is the last expression in the function body to be evaluated. R has many in-built functions which can be directly called in the program without defining them first. We can also create and use our own functions referred as user defined functions. Built-in Function Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They are directly called by user written programs. You can refer most widely used R functions # Create a sequence of numbers from 32 to 44. print(seq(32,44)) # Find mean of numbers from 25 to 82. print(mean(25:82)) # Find sum of numbers frm 41 to 68. print(sum(41:68)) When we execute the above code, it produces the following result − [1] 32 33 34 35 36 37 38 39 40 41 42 43 44 [1] 53.5 [1] 1526 User-defined Function We can create user-defined functions in R. They are specific to what a user wants and once created they can be used like the built-in functions. Below is an example of how a function is created and used. # Create a function to print squares of numbers in sequence. new.function <- function(a) { for(i in 1:a) { b <- i^2 print(b) } R Loops Here may be a situation when you need to execute a block of code several number of times. In general, statements are executed sequentially. The first statement in a function is executed first, followed by the second, and so on. Programming languages provide various control structures that allow for more complicated execution paths. A loop statement allows us to execute a statement or group of statements multiple times and the following is the general form of a loop statement in most of the programming languages

1 repeat loop Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable. <- c("Hello","loop") cnt <- 2 repeat { print(v) cnt <- cnt+1 if(cnt > 5) { break } } When the above code is compiled and executed, it produces the following result − [1] "Hello" "loop" [1] "Hello" "loop" [1] "Hello" "loop" [1] "Hello" "loop"

2 while loop Repeats a statement or group of statements while a given condition is true. It tests the condition before executing the loop body. v <- c("Hello","while loop") cnt <- 2 while (cnt < 7) { print(v) cnt = cnt + 1 } When the above code is compiled and executed, it produces the following result − [1] "Hello" "while loop" [1] "Hello" "while loop" [1] "Hello" "while loop" [1] "Hello" "while loop" [1] "Hello" "while loop"

3 for loop Like a while statement, except that it tests the condition at the end of the loop body v <- LETTERS[1:4] for ( i in v) { print(i) } When the above code is compiled and executed, it produces the following result − [1] "A" [1] "B" [1] "C" [1] "D"

Loop Control Statements Loop control statements change execution from its normal sequence. When execution leaves a scope, all automatic objects that were created in that scope are destroyed. R supports the following control statements. Click the following links to check their detail

1 break statement Terminates the loop statement and transfers execution to the statement immediately following the loop. 2 Next statement The next statement simulates the behavior of R switch

6 ) Explain briefly about the types of Datasets with their syntax and example Contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −

 Vectors  Lists  Matrices  Arrays  Factors  Data Frames The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors. The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors • Data types: • Logical • Numeric • Integer • Complex • Character • Raw

When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector. # Create a vector. apple <- c('red','green',"yellow") print(apple) # Get the class of the vector. print(class(apple)) When we execute the above code, it produces the following result − [1] "red" "green" "yellow" [1] "character" Lists A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it. # Create a list. list1 <- list(c(2,5,3),21.3,sin) # Print the list. print(list1) When we execute the above code, it produces the following result − [[1]] [1] 2 5 3

[[2]] [1] 21.3

[[3]] function (x) .Primitive("sin") Matrices A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function. # Create a matrix. M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE) print(M) When we execute the above code, it produces the following result − [,1] [,2] [,3] [1,] "a" "a" "b" [2,] "c" "b" "a" Arrays While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each. # Create an array. a <- array(c('green','yellow'),dim = c(3,3,2)) print(a) When we execute the above code, it produces the following result − , , 1

[,1] [,2] [,3] [1,] "green" "yellow" "green" [2,] "yellow" "green" "yellow" [3,] "green" "yellow" "green"

, , 2

[,1] [,2] [,3] [1,] "yellow" "green" "yellow" [2,] "green" "yellow" "green" [3,] "yellow" "green" "yellow"

7) Write short notes on Outliers? Discuss in detailed with example . Normally we use BOX Plot and Scatter plot to find outliers from graphical representation Outlier is a point or an observation that deviates significantly from the other observations. Reasons for outliers: Due to experimental errors or ―special circumstances‖ Outlier detection tests to check for outliers. Outlier treatments are three types: • Retention • Exclusion • Other treatment methods OUTLIER package in R: to detect and treat outliers in Data. Outlier detection from graphical representation: – Scatter plot and Box plot Example # Inject outliers into data. cars1 <- cars[1:30, ] # original data cars_outliers <- data.frame(speed=c(19,19,20,20,20), dist=c(190, 186, 210, 220, 218)) # introduce outliers. cars2 <- rbind(cars1, cars_outliers) # data with outliers # Plot of data with outliers. par(mfrow=c(1, 2)) plot(cars2$speed, cars2$dist, xlim=c(0, 28), ylim=c(0, 230), main="With Outliers", xlab="speed", ylab="dist", pch="*", col="red", cex=2) abline(lm(dist ~ speed, data=cars2), col="blue", lwd=3, lty=2) # Plot of original data without outliers. Note the change in slope (angle) of best fit line. plot(cars1$speed, cars1$dist, xlim=c(0, 28), ylim=c(0, 230), main="Outliers removed \n A much better fit!", xlab="speed", ylab="dist", pch="*", col="red", cex=2) abline(lm(dist ~ speed, data=cars1), col="blue", lwd=3, lty=2)

8) Explain about Quality and Standard Adherence Adherence to Quality Standards. Licensee agrees that the nature and quality of all goods and services provided by Licensee in connection with the use of the Intellectual Property shall conform to the standards set by Licensee for its own goods and services (―Quality Standards‖) Quality means different things to different people but if you go back to first principle, quality can be defined as ‗Fitness for purpose‘. To elaborate, it is really around the surety of a product or a service fulfilling what it is intended to do.

As a global leader in providing quality, safety and testing services for over 130 years globally, at Intertek we have a thorough understanding of the parameters that can help businesses achieve highest level of quality assurance throughout the supply chain

Quality standards are defined as documents that provide requirements, specifications, guidelines, or characteristics that can be used consistently to ensure that materials, products, processes, and services are fit for their purpose.

Standards provide organizations with the shared vision, understanding, procedures, and vocabulary needed to meet the expectations of their stakeholders. Because standards present precise descriptions and terminology, they offer an objective and authoritative basis for organizations and consumers around the world to communicate and conduct business

9) Discuss the following a) Data types in R. Give examples. In Analytics the data is classified as Quantitative(numeric) and Qualitative(Character/Factor) on very broad level. • Numeric Data: - It includes 0~9, ―.‖ and ―- ve‖ sign. • Character Data: - Everything except Numeric data type is Character. For Example, Names, Gender etc. For Example, ―1,2,3…‖ are Quantitative Data while ―Good‖, ―Bad‖ etc. are Qualitative Data. We can convert Qualitative Data into Quantitative Data using Ordinal Values. For Example, ―Good‖ can be rated as 9 while ―Average‖ can be rated as 5 and ―Bad‖ can be rated as 0 Data Type Verify Logical TRUE , FALSE v <- TRUE print(class(v)) [1] "logical" Numeric 12.3, 5, 999 v <- 23.5 print(class(v)) [1] "numeric" Integer 2L, 34L, 0L v <- 2L print(class(v)) [1] "integer" Complex 3 + 2i v <- 2+5i print(class(v)) [1] "complex" Character 'a' , '"good", "TRUE", '23.4' v <- "TRUE" print(class(v)) [1] "character" Raw "Hello" is stored as 48 65 6c 6c 6f v <- charToRaw("Hello") print(class(v)) [1] "raw" mode() or Class(): These are used to know the type of data object type assigned. Example: Assign several different objects to x, and check the mode (storage class) of each object. # Declare variables of different types: my_numeric <- 42 my_character <- "forty-two" my_logical <- FALSE b) Quality and Standard Adherence [5] Adherence to Quality Standards. Licensee agrees that the nature and quality of all goods and services provided by Licensee in connection with the use of the Intellectual Property shall conform to the standards set by Licensee for its own goods and services (―Quality Standards‖) Quality means different things to different people but if you go back to first principle, quality can be defined as ‗Fitness for purpose‘. To elaborate, it is really around the surety of a product or a service fulfilling what it is intended to do.

As a global leader in providing quality, safety and testing services for over 130 years globally, at Intertek we have a thorough understanding of the parameters that can help businesses achieve highest level of quality assurance throughout the supply chain

Quality standards are defined as documents that provide requirements, specifications, guidelines, or characteristics that can be used consistently to ensure that materials, products, processes, and services are fit for their purpose. Standards provide organizations with the shared vision, understanding, procedures, and vocabulary needed to meet the expectations of their stakeholders. Because standards present precise descriptions and terminology, they offer an objective and authoritative basis for organizations and consumers around the world to communicate and conduct business.

10) Articulate briefly about the concept of Reading Datasets, types of Datasets with their syntax and example We can import Datasets from various sources having various file types : Example: • .csv or .txt format • Big data tool – Impala • CSV File The sample data can also be in comma separated values (CSV) format. Each cell inside such data file isseparated by a special character, which usually is a comma, although other characters can be used as well. Thefirst row of the data file should contain the column names instead of the actual data. Here is a sample of the expected format. Col1,Col2,Col3 100,a1,b1 200,a2,b2 300,a3,b3 After we copy and paste the data above in a file named "mydata.csv" with a text editor, we can read the data with the function read.csv. In R data can read in two ways either from local disc or web. From disc: The data file location is known on local disc use: read.csv() or read.table() functions. Path is not specific then use : file.choose() > mydata = read.csv("mydata.csv") # read csv file > mydata From Web: The URL of the data from web is pass to read.csv() or read.table() functions

Generally, while doing programming in any programming language, you need to use various variables to store various information. Variables are nothing but reserved memory locations to store values. This means that, when you create a variable you reserve some space in memory. You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. Based on the data type of a variable, the allocates memory and decides what can be stored in the reserved memory. In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are

 Vectors  Lists  Matrices  Arrays  Factors  Data Frames The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors Data Type Example Verify

Logical TRUE, FALSE v <- TRUE print(class(v)) it produces the following result − [1] "logical"

Numeric 12.3, 5, 999 v <- 23.5 print(class(v)) it produces the following result − [1] "numeric"

Integer 2L, 34L, 0L v <- 2L print(class(v)) it produces the following result − [1] "integer"

Complex 3 + 2i v <- 2+5i print(class(v)) it produces the following result − [1] "complex"

Character 'a' , '"good", "TRUE", '23.4' v <- "TRUE" print(class(v)) it produces the following result − [1] "character"

Raw "Hello" is stored as 48 65 6c 6c 6f v <- charToRaw("Hello") print(class(v)) it produces the following result − [1] "raw" • In R programming, the very basic data types are the R-objects called vectors which hold elements of different classes as shown above. Please note in R the number of classes is not confined to only the above six types. For example, we can use many atomic vectors and create an array whose class will become array

11 a) Explain R data frames with example Data Frames: Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length. Data Frames are created using the data.frame() function. It displays data along with header information. To retrieve data in a particular cell: Enter its row and column coordinates in the single square bracket "[ ]" operator. Example: To retrieve the cell value from the first row, second column of mtcars. >mtcars[1,2] mtcars[row,column] # Create the data frame. >BMI <- data.frame(gender = c("Male", "Male","Female"), height = c(152, 171.5, 165), weight = c(81,93, 78),Age =c(42,38,26)) >print(BMI) gender height weight Age 1 Male 152.0 81 42 2 Male 171.5 93 38 3 Female 165.0 78 26 Data1 : Height GPA 66 3.80 62 3.78 63 3.88 70 3.72 74 3.69 > student.ht <- c( 66, 62, 63, 70, 74) > student.gpa < - c( 3.80, 3.78, 3.88, 3.72, 3.69) > student.data1 < - data.frame(student.ht, student.gpa) > student.data student.ht student.gpa 1 66 3.80 2 62 3.78 3 63 3.88 4 70 3.72 5 74 3.69 > plot(student.ht, student.gpa)

b) Write a R program for matrix multiplication R has two multiplication operators for matrices. The first is denoted by * which is the same as a simple multiplication sign. This operation does a simple element by elementmultiplication up to matrices. The second operator is denoted by %*% and it performs a matrix multiplication between the two matrices. # Create two 2x3 matrices. matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2) print(matrix1) matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2) print(matrix2) # Multiply the matrices. result <- matrix1 * matrix2 cat("Result of multiplication","\n") print(result) # Divide the matrices result <- matrix1 / matrix2 cat("Result of division","\n") print(result)

When we execute the above code, it produces the following result − [,1] [,2] [,3] [1,] 3 -1 2 [2,] 9 4 6 [,1] [,2] [,3] [1,] 5 0 3 [2,] 2 9 4 Result of multiplication [,1] [,2] [,3] [1,] 15 0 6 [2,] 18 36 24 Result of division [,1] [,2] [,3] [1,] 0.6 -Inf 0.6666667 [2,] 4.5 0.4444444 1.5000000

12 a)What are the looping structures available in R?

• In R functions and loop structure are created using for,while along with conditional statements if-else. FOR Loop: For loop is used to repeat an action for every value in a vector. ―for‖ loop in R : for(i in values){... do something ...}

This for loop consists of the following parts: The keyword for, followed by parentheses.An identifier between the parentheses. In this example, we use i, but that can be any object name you like.The keyword in, which follows the identifier.

Syntax: for (val in sequence) { statement}

Example :To count the number of even numbers in a vector. x <- c(2,5,3,9,8,11,6) count <- 0 for (val in x) { if(val %% 2 == 0) count = count+1 } print(count) Output [1] 3

While loop while (test_expression) { statement}

Example: > while (i < 6) { + print(i) + i = i+1 + } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5

IF-ELSE Function if ( test_expression1) { statement1} else if ( test_expression2) { statement2} else if ( test_expression3) { statement3} else statement4

Example1: x <- -5 y <- if(x > 0) 5 else 6 [1] 6 Example3: x <- 0 if (x < 0) { print("Negative number")} else if (x > 0) { print("Positive number")} else print("Zero") output: [1] Zero

b)How to you find outliers in R? [5] Outlier is a point or an observation that deviates significantly from the other observations. Reasons for outliers: Due to experimental errors or ―special circumstances‖ Outlier detection tests to check for outliers. Outlier treatments are three types: • Retention • Exclusion • Other treatment methods OUTLIER package in R: to detect and treat outliers in Data. Outlier detection from graphical representation: – Scatter plot and Box plot

Outliers and Missing Data treatment: Missing Values • In R, missing values are represented by the symbol NA (not available). • Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). • R uses the same symbol for character and numeric data. • To Test missing values: is.na () function

Example: y <- c(1,2,3,NA) is.na(y) [1] FALSE FALSE FALSE TRUE mean(y)##Arithmetic functions on missing values [1] NA To remove missing values :na.omit() function. newdata<- na.omit(y) Alternative method using na.rm=TRUE mean(x, na.rm=TRUE)

13 a) What is array and what is matrix? Discuss with examples.[5]

Arrays Arrays An array object (or simply array) contains a collection of elements of the same type, each of which is indexed (i.e., identified) by a number. A variable of type array contains a reference to an array object. To use an array in Java we have to: 1. declare a variable of type array that allows us to refer to an array object; 2. construct the array object specifying its dimension (number of elements of the array object); 3. access the elements of the array object through the array variable in order to assign or obtain their values (as if they were single variables). Matrix A matrix is a collection of elements of the same type, organized in the form of a table. Each element is indexed by a pair of numbers that identify the row and the column of the element. A matrix can be represented in Java through an array, whose elements are themselves (references to) arrays representing the various rows of the matrix. Declaration of a matrix A matrix is declared in the following way (as an array of arrays): int[][] m; // declaration of an array of arrays (matrix)

• We have two different options for constructing matrices or arrays. Either we use the creator functions matrix () and Array (), or you simply change the dimensions using the dim () function. For example, you make an array with four columns, three rows, and two ―tables‖ like this: >my.array< - array(1:24, dim=c(3,4,2)) In the above example, ―my.array‖ is the name of the array we have given. And ―←‖ is the assignment operator. There are 24 units in this array mentioned as ―1:24‖ and are divided in three dimensions ―(3, 4, 2)‖. Although the rows are given as the first dimension, the tables are filled column-wise. So, for arrays, R fills the columns, then the rows, and then the rest. Alternatively, you could just add the dimensions using the dim ( ) function. This is a little hack that goes a bit faster than using the array ( ) function; it‘s especially useful if you have your data already in a vector. (This little trick also works for creating matrices, by the way, because a matrix is nothing more than an array with only two dimensions.) Say you already have a vector with the numbers 1 through 24, like this: >my.vector<- 1:24 You can easily convert that vector to an array exactly like my.array simply by assigning the dimensions, like this: > dim(my.vector) <- c(3,4,2) Arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each. Creating an array: >my.array< - array(1:24, dim=c(3,4,2)) In the above example, ―my.array‖ is the name of the array we have given. There are 24 units in this array mentioned as ―1:24‖ and are divided in three dimensions ―(3, 4, 2)‖. Alternative: with existing vector and using dim() > my.vector<- 1:24 To convert my.vector vector to an array exactly like my.array simply by assigning the dimensions, like this: > dim(my.vector) <- c(3,4,2) Matrix A matrix is a collection of elements of the same type, organized in the form of a table. Each element is indexed by a pair of numbers that identify the row and the column of the element. A matrix can be represented in Java through an array, whose elements are themselves (references to) arrays representing the various rows of the matrix. Declaration of a matrix A matrix is declared in the following way (as an array of arrays): int[][] m; // declaration of an array of arrays (matrix) # Create a matrix. M = matrix( c('a','a','b','c','b','a'), nrow=2,ncol=3,byrow = TRUE) print(M) [,1] [,2] [,3] [1,] "a" "a" "b"

b)Explain the procedure to create R script to combine two data sets.[5]

Combining Data sets in R • To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

For example, To merge two data frames by ID: total <- merge(data frameA,data frameB,by="ID") • To merge on more than one criteria we pass the argument as follows

To merge two data frames by ID and Country: total <- merge(data frameA,data frameB,by=c("ID","Country")) • To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order For example, total <- rbind(data frameA, data frameB) If data frameA has variables that data frameB does not, then either: 1. Delete the extra variables in data frameA or 2. Create the additional variables in data frameB and set them to NA (missing) before joining them with rbind( ). We use cbind() function to combine data by column the syntax is same as rbind() Plyr package: Tools for Splitting, Applying and Combining Data. We use rbind.fill() in plyr package in R. It binds or combines a list of data frames filling missing columns with NA.

For example, rbind.fill(mtcars[c("mpg", "wt")], mtcars[c("wt", "cyl")]) In this all the missing value will be filled with NA

OR

14 a)What is RStudio?. Explain its features. [5] • R Studio is an Integrated Development Environment (IDE) for R Language with advanced and more user-friendly GUI. R Studio allows the user to run R in a more user- friendly environment. It is open-source (i.e. free) and available at http://www.rstudio.com/ The R Studio screen has four windows: 1. Console. 2. Workspace and history. 3. Files, plots, packages and help. 4. The R script(s) and data view The R script is where you keep a record of your work. Create a new R script file: To create a new R script file: 1) File -> New -> R Script, 2) Click on the icon with the ―+‖ sign and select ―R Script‖ 3) Use shortcut as: Ctrl+Shift+N. Running the R commands on R Script file: Console: The console is where you can type commands and see output. Workspace tab: The workspace tab shows all the active objects (see next slide). The workspace tab stores any object, value, function or anything you create during your R session. In the example below, if you click on the dotted squares you can see the data on a screen to the left History tab: The history tab shows a list of commands used so far.The history tab keeps a record of all previous commands. It helps when testing and running processes. Here you can either save the whole list or you can select the commands you want and send them to an R script to keep track of your work. In this example, we select all and click on the ―To Source‖ icon, a window on the left will open with the list of commands. Make sure to save the ‗untitled1‘ file as an *.R script R Studio Features: • RStudio runs on most desktops or on a server and accessed over the web • RStudio integrates the tools you use with R into a single environment: • RStudio includes powerful coding tools designed to enhance your productivity • RStudio enables rapid navigation to files and functions • RStudio make it easy to start new or find existing projects • RStudio has integrated support for Git and Subversion • RStudio supports authoring HTML, PDF, Word Documents, and slide shows • RStudio supports interactive graphics with Shiny and ggvis

b)How do you understand learning objectives ?Explain [5] Understanding Learning objectives: The benefits of this course include: • Efficient and Effective time management • Efficient – Meeting timelines • Effective – Meeting requirement for desired output • Awareness of the SSC (Sector Skill Council) environment and time zone understanding • Awareness of the SSC environment and importance of meeting timelines to handoffs Review the course objectives listed above. To fulfil these objectives today, we‘ll be conducting a number of hands-on activities. Hopefull we can open up some good conversations and some of you can share your experiences so that we can make this session as interactive as possible. Your participation will be crucial to your learning experience and that of your peers here in the session today Question: Please share your thoughts on following? A. Time is perishable – Cannot be created or recovered B. Managing is only option – Prioritize Importance of Time Management The first part of this session discusses the following: • Plan better avoid wastage • Understanding the timelines of the deliverables. Receiving the hand off from upstream teams • at right time is critical to start self contribution and ensure passing the deliverables to • downstream team. • It is important to value others‘ time as well to ensure overall organizational timelines are met • Share the perspective of how important is time, specifically in a global time zone mapping scenario Suggested Responses: • Time management has to be looked at an organizational level and not just individual level • These Aspects teach us how to build the blocks of time management ------END OF UNIT 1 ------

Model Questions- Unit-2

1. Define Expected value? • The expected value of a random variable is the long-run average value of repetitions of the experiment it represents. Expected value is also known as the expectation, mathematical expectation, EV, mean, or first moment. • Expected value of a discrete random variable is the probability-weighted average of all possible values • Continuous random variables is the sum replaced by an integral and the probabilities by probability densities

2. What are the roles of a Team member?

 Communicate  Don't Blame Others  Support Group Member's Ideas  No Bragging – No Full of yourself  Listen Actively  Get Involved  Coach, Don't Demonstrate  Provide Constructive Criticism  Try To Be Positive  Value Your Group's Ideas 3. Define Bivariate Random variable A random variable, aleatory variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability, in contrast to other mathematical variables. A random variable is a real-valued function defined on the points of a sample space Bivariate Random Variable: Bivariate Random Variables are those variables having only 2 possible outcomes. For example flip of coin For Example, If we toss a coin for 10 times and we get heads 8 times then we cannot say that the 11th time if coin is tossed then we get a head or a tail. But we are sure that we will either get a head or a tail.

4. Write the difference between Team work and Individual work Team work vs. Individual work • Team Work: • Work Agree on goals/milestones • Establish tasks to be completed • Communicate / monitor progress • Solve Problem • Interpret Results • Agree completion of projects Individual work • Work on tasks • Work on new / revised tasks Team Development • Team building is any activity that builds and strengthens the team as a team 5. What is Probability?  Probability is the chance of occurrence anything. P(A)= S/P  Where S is sample size or no of positive outcomes and P is the population size or total no of outcomes  A probability distribution describes how the values of a random variable are distributed

6. write short notes on Bivariate Random variable? • A random variable, aleatory variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability, in contrast to other mathematical variables. A random variable is a real-valued function defined on the points of a sample space Bivariate Random Variable: Bivariate Random Variables are those variables having only 2 possible outcomes. For example flip of coin . For Example, If we toss a coin for 10 times and we get heads 8 times then we cannot say that the 11th time if coin is tossed then we get a head or a tail. But we are sure that we will either get a head or a tail. 6. List the types of variables available in R • R variables are of an R object type and are mostly vectors (lists of data) and can be numeric or text. A variable in R can store an atomic vector, group of atomic vectors or a combination of many R objects. • Vectors • Lists • Matrices • Arrays • Factors • Data Frames

7. Define effective communication skills  Effective communication is a mutual understanding of the message. Effective communication is essential to workplace effectiveness The purpose of building communication skills is to achieve greater understanding and meaning between people and to build a climate of trust, openness, and support. A big part of working well with other people is communicating effectively.  8. Define Continous Uniform distribution. The continuous uniform distribution is the probability distribution of random number selection from the continuous interval between a and b. Its density function is defined by the following.

9. List the objectives of team work and individual work. Team work vs. Individual work Team Work: • Work Agree on goals/milestones • Establish tasks to be completed • Communicate / monitor progress • Solve Problem • Interpret Results • Agree completion of projects Individual work • Work on tasks • Work on new / revised tasks Team Development • Team building is any activity that builds and strengthens the team as a team ------Part-B

1 a)How do you summarize data in a data set? –Discuss with suitable examples.[5]

 To summarize data in R Studio we use majorly two functions Summary and Aggregate. Using Summary command Summary Statistics - Summarizing data with R: Example1: > grass rich graze 1 12 mow 2 15 mow 3 17 mow 4 11 mow 5 15 mow 6 8 unmow 7 9 unmow 8 7 unmow 9 9 unmow a) summary(): It gives the summary statistics of data object in terms of min, max,1st Quartile and 3rd Quartile mean/median values. > x<-c(1,2,3,4,5,6,7,8,9,10,11,12) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.75 6.50 6.50 9.25 12.00 > summary(grass) Rich graze Min. : 7.00 mow :5 1st Qu.: 9.00 unmow:4 Median :11.00 Mean :11.44 3rd Qu.:15.00 Max. :17.00 > summary(graze) Length Class Mode 9 character character b) str(): It gives the structure of data object in terms of class of object, No. of observations and each variable class and sample data. c) Tail(): It gives the last 6 observations of the given data object. Example3: > tail(iris) > tail(mtcars) d) Head(): It displays the top 6 observations from dataset e) Names(): It returns the coloum names f) nrow(): It returns the number of observations in the given dataset g) fix(iris): To fix the data in the given dataset. > fix(mydFrame1) h) With(): To replace $along with attribute names i) Aggregate(): To get the summary statistic of specific column with respect to different levels in the class attribute. aggregate( x ~ y, data, mean) Here x is numeric and y is factor type >aggregate(rich~graze, grass, mean) graze rich 1 mow 14.00 2 unmow 8.25 j) Subset (): To subset the data based on condition. subset (data, x>7, select=c(x,y)) x is one of variable in data select: to get the subset in specified order. >subset(grass, rich>7, select=c(graze,rich))

b)Explain how can you find out the mean, mode and median for iris data set. [5]

In Iris data set check whether Sepal Length is normally distributed or not. Use : To find if the Sepal Length is normally distributed or not we use 2 commands- qqnorm() &qqline(). The qqnorm() shows the actual distribution of data while qqline() shows the line on which data would lie if the data is normally distributed. The deviation of plot from line shows that data is not normally distributed.

Mean : The mean is the average of the numbers sum(x)/length(x) ## [1] 5.8 #?mean #Function in base R mean(x) ## [1] 5.8

Median : the middle number given the numbers are in order (sorted) sort(x) ## [1] 1 2 2 2 7 8 8 9 9 10 #?median median(x) ## [1] 7.5

Mode : The number which appears most often in a set of numbers #There is no function in base R to find mode of set of numbers x <- c(8,2,7,1,2,9,8,2,10,9)

#Function to find Mode x ## [1] 8 2 7 1 2 9 8 2 10 9 #?table y <- table(x) y ## x ## 1 2 7 8 9 10 ## 1 3 1 2 2 1 names(y)[which(y==max(y))] ## [1] "2"

OR 2 a)What is random and bivariate random variable? Explain through examples [5] Random & Bivariate Random Variables Random Variable:

• A random variable, aleatory variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability, in contrast to other mathematical variables.

• A random variable is a real-valued function defined on the points of a sample space.

• Random variables can be discrete, that is, taking any of a specified finite or countable list of values, endowed with a probability mass function, characteristic of a probability distribution; or continuous, taking any numerical value in an interval or collection of intervals, via a probability density function that is characteristic of a probability distribution; or a mixture of both types. The realizations of a random variable, that is, the results of randomly choosing values according to the variable's probability distribution function, are called random variates. • For Example, • If we toss a coin for 10 times and we get heads 8 times then we cannot say that the 11th time if coin is tossed then we get a head or a tail. But we are sure that we will either get a head or a tail. Bivariate Random Variable:

• Bivariate Random Variables are those variables having only 2 possible outcomes. For example flip of coin

b) Explain Frequentist tests and Bayesian tests [5] Tests of univariate normality include D'Agostino's K-squared test, the Jarque–Bera test, the Anderson–Darling test, the Cramér–von Mises criterion, the Lilliefors test for normality (itself an adaptation of the Kolmogorov–Smirnov test), the Shapiro–Wilk test, the Pearson's chi- squared test, and the Shapiro–Francia test. A 2011 paper from The Journal of Statistical Modeling and Analytics concludes that Shapiro-Wilk has the best power for a given significance, followed closely by Anderson-Darling when comparing the Shapiro-Wilk, Kolmogorov- Smirnov, Lilliefors, and Anderson-Darling tests. Some published works recommend the Jarque–Bera test. But it is not without weakness. It has low power for distributions with short tails, especially for bimodal distributions. Other authors have declined to include its data in their studies because of its poor overall performance. Historically, the third and fourth standardized moments (skewness and kurtosis) were some of the earliest tests for normality. The Jarque–Bera test is itself derived fromskewness and kurtosis estimates. Mardia's multivariate skewness and kurtosis tests generalize the moment tests to the multivariate case. Other early test statistics include the ratio of the mean absolute deviation to the standard deviation and of the range to the standard deviation. More recent tests of normality include the energy test (Székely and Rizzo) and the tests based on the empirical characteristic function (ecf) (e.g. Epps and Pulley, Henze–Zirkler, BHEP test). The energy and the ecf tests are powerful tests that apply for testing univariate or multivariate normality and are statistically consistent against general alternatives. The normal distribution has the highest entropy of any distribution for a given standard deviation. There are a number of normality tests based on this property, the first attributable to Vasicek.

Bayesian tests:

Kullback–Leibler divergences between the whole posterior distributions of the slope and variance do not indicate non-normality. However, the ratio of expectations of these posteriors and the expectation of the ratios give similar results to the Shapiro–Wilk statistic except for very small samples, when non-informative priors are used. Spiegelhalter suggests using a Bayes factor to compare normality with a different class of distributional alternatives. This approach has been extended by Farrell and Rogers-Stewart

3 a)What is Probability? [5] Probability Probability is the chance of occurrence anything. P(A)= S/P Where S is sample size or no of positive outcomes and P is the population size or total no of outcomes. A probability distribution describes how the values of a random variable are distributed. For example, the collection of all possible outcomes of a sequence of coin tossing is known to follow the binomial distribution. Whereas the means of sufficiently large samples of a data population are known to resemble the normal distribution. Since the characteristics of these theoretical distributions are well understood, they can be used to make Statistical inferences on the entire data population as a whole. For example, Probability of ace of Diamond in a pack of 52 cards when 1 card is pulled out at random. Now, ―At Random‖ means that there is no biased treatment with any card and the result will be totally at random. So, No. of Ace of Diamond in a pack = S = 1 Total no of possible outcomes = Total no. of cards in pack = 52 Probability of positive outcome = S/P = 1/52 That is we have 1.92% chance that we will get positive outcome

b)What is Professionalism? How to exhibit Professionalism? [5] professionalism • Professionalism is the competence or set of skills that are expected from a professional

• Professionalism determines how a person is perceived by his employer, co-workers, and casual . How long does it take for someone to form an opinion about you? • Studies have proved that it just takes six seconds for a person to form an opinion about another person How does someone form an opinion about you…….. Eye Contact – Maintaining eye contact with a person or the audience says that you are confident. It says that you are someone who can be trusted and hence can maintain contact with you. Handshake – Grasp the other person‘s hand firmly and it a few times. This shows that you are enthusiastic. Posture – Stand straight but not rigid, this will showcase that you are receptive and not very rigid in your thoughts. Clothing – Appropriate clothing says that you are a leader with a winning potential. How to exhibit professionalism… • Empathy • Positive Attitude • Teamwork • Professional Language • Knowledge • Punctual • Confident • Emotionally stable

4. Explain about Central Limit Theorem[10] Central Limit Theorem The central limit theorem states that under certain (fairly common) conditions, the sum of many random variables will have an approximately normal distribution. More specifically, where X1, …, Xn are independent and identically distributed random variables with the same arbitrary distribution, zero mean, and variance σ2; and Z is their mean scaled by

푛 1 푍 = 푛 푋 푛 푖 푖=1

Then, as n increases, the probability distribution of Z will tend to the normal distribution with zero mean and variance (σ2). The central limit theorem also implies that certain distributions can be approximated by the normal distribution, for example: • The binomial distribution B(n, p) is approximately normal with mean np and variance np(1−p) for large n and for p not too close to zero or one. • The Poisson distribution with parameter λ is approximately normal with mean λ and variance λ, for large values of λ. • The chi-squared distribution χ2(k) is approximately normal with mean k and variance 2k, for large k. • The Student's t-distribution t(ν) is approximately normal with mean 0 and variance 1 when ν is large

5) Explain Probability Distribution and its types. How do you use Probability distribution in R ? [10] Probability distribution : Which describes how the values of a random variable are distributed. The probability distribution for a random variable X gives the possible values for X, and the probabilities associated with each possible value (i.e., the likelihood that the values will occur) The methods used to specify discrete prob. distributions are similar to (but slightly different from) those used to specify continuous prob. distributions. Binomial distribution: The collection of all possible outcomes of a sequence of coin tossing Normal distribution: The means of sufficiently large samples of a data population • The characteristics of these theoretical distributions are well understood, they can be used to make Statistical inferences on the entire data population as a whole.

Example: Probability of ace of Diamond in a pack of 52 cards when 1 card is pulled out at random. At Random‖ means that there is no biased treatment No. of Ace of Diamond in a pack = S = 1 Total no of possible outcomes = Total no. of cards in pack = 52 Probability of positive outcome = S/P = 1/52 That is we have 1.92% chance that we will get positive outcome.

Probability Distribution Function ( PDF): It defines probability of outcomes based on certain conditions. Based on Conditions, there are majorly 5 types PDFs. Types of Probability Distribution: • Binomial Distribution • Poisson Distribution • Continuous Uniform Distribution • Exponential Distribution • Normal Distribution • Chi-squared Distribution • Student t Distribution • F Distribution Binomial Distribution The binomial distribution is a discrete probability distribution. It describes the outcome of n Independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is p, then the probability of having x successful outcomes in an experiment of n independent trials is as follows. (n ) ( ) ( ) (1 ) x x cx f x n p p Where x = 0, 1, 2, . . . , n Problem Ex: Find the probability of getting 3 doublets when a pair of fair dice are thrown for 10 times Poisson Distribution The Poisson distribution is the probability distribution of independent event occurrences in an interval. If is the mean occurrence per interval, then the probability of having x occurrences within a given interval is:( )!xef xx Where x = 0, 1, 2, . . . Problem If there are twelve cars crossing a bridge per minute on average, find the probability of having seventeen or more cars crossing the bridge in a particular minute. Solution The probability of having sixteen or less cars crossing the bridge in a particular minute is given by the function ppois Normal Distribution The normal distribution is defined by the following probability density function, where � is the population mean and σ2 is the variance. 1 ( )2 /2 2( ) 2x f x e If a random variable X follows the normal distribution, then we write: In particular, the normal distribution with 0 and σ = 1 is called the standard normal distribution, and is denoted as N(0,1). It can be graphed as follows. Figure 1 shows the normal distribution of sample data. The shape of a normal curve is highly dependent on thestandard deviation.

Normal distribution is a continuous distribution that is ―bell-shaped‖. • Data are often assumed to be normal. • Normal distributions can estimate probabilities over a continuous interval of data values Properties: The normal distribution f(x), with any mean and any positive deviation σ, has the following properties: It is symmetric around the point which is at the same time the mode, the median and the mean of the distribution. It is unimodal: its first derivative is positive for x negative for x >, and zero only at x = Its density has two inflection points (where the second derivative of is zero and changes sign), located one standard deviation away from the mean as x = − σ and x = + σ. Its density is log-concave. • Its density is infinitely differentiable, indeed super smooth of order 2. Its second derivative f′′(x) is equal to its derivative with respect to its variance σ2

Normal Distribution in R: Description: Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd. The normal distribution is important because of the Central Limit Theorem, which states that The population of all possible samples of size n from a population with mean and variance σ2approaches a normal distribution with mean and σ2∕n when n approaches infinity OR 6) How do you summarizing the data with R. [5]  To summarize data in R Studio we use majorly two functions Summary and Aggregate. Using Summary command Summary Statistics - Summarizing data with R: Example1: > grass rich graze 1 12 mow 2 15 mow 3 17 mow 4 11 mow 5 15 mow 6 8 unmow 7 9 unmow 8 7 unmow 9 9 unmow a) summary(): It gives the summary statistics of data object in terms of min, max,1st Quartile and 3rd Quartile mean/median values. > x<-c(1,2,3,4,5,6,7,8,9,10,11,12) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.75 6.50 6.50 9.25 12.00 > summary(grass) Rich graze Min. : 7.00 mow :5 1st Qu.: 9.00 unmow:4 Median :11.00 Mean :11.44 3rd Qu.:15.00 Max. :17.00 > summary(graze) Length Class Mode 9 character character b) str(): It gives the structure of data object in terms of class of object, No. of observations and each variable class and sample data. c) Tail(): It gives the last 6 observations of the given data object. Example3: > tail(iris) > tail(mtcars) d) Head(): It displays the top 6 observations from dataset e) Names(): It returns the coloum names f) nrow(): It returns the number of observations in the given dataset g) fix(iris): To fix the data in the given dataset. > fix(mydFrame1) h) With(): To replace $along with attribute names i) Aggregate(): To get the summary statistic of specific column with respect to different levels in the class attribute. aggregate( x ~ y, data, mean) Here x is numeric and y is factor type >aggregate(rich~graze, grass, mean) graze rich 1 mow 14.00 2 unmow 8.25 j) Subset (): To subset the data based on condition. subset (data, x>7, select=c(x,y)) x is one of variable in data select: to get the subset in specified order. >subset(grass, rich>7, select=c(graze,rich))

b)What are the fundamentals to build a Team? [5] Team Development • Team building is any activity that builds and strengthens the team as a team. Team building fundamentals • Clear Expectations – Vision/Mission • Context – Background – Why participation in Teams? • Commitment – dedication – Service as valuable to Organization & Own • Competence – Capability – Knowledge • Charter – agreement – Assigned area of responsibility • Control – Freedom & Limitations • Collaboration – Team work • Communication • Consequences – Accountable for rewards • Coordination • Cultural Change Roles of team member • Communicate • Don't Blame Others • Support Group Member's Ideas • No Bragging(Arrogant) – No Full of yourself • Listen Actively • Get Involved • Coach, Don't Demonstrate • Provide Constructive Criticism • Try To Be Positive • Value Your Group's Ideas

7) What is summarization of data? Explain different types of summarization in R[10] Summary Statistics - Summarizing data with R: Example1: > grass rich graze 1 12 mow 2 15 mow 3 17 mow 4 11 mow 5 15 mow 6 8 unmow 7 9 unmow 8 7 unmow 9 9 unmow a) summary(): It gives the summary statistics of data object in terms of min, max,1st Quartile and 3rd Quartile mean/median values. > x<-c(1,2,3,4,5,6,7,8,9,10,11,12) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.75 6.50 6.50 9.25 12.00 > summary(grass) Rich graze Min. : 7.00 mow :5 1st Qu.: 9.00 unmow:4 Median :11.00 Mean :11.44 3rd Qu.:15.00 Max. :17.00 > summary(graze) Length Class Mode 9 character character b) str(): It gives the structure of data object in terms of class of object, No. of observations and each variable class and sample data. Example2: > str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > str(grass) 'data.frame': 9 obs. of 2 variables: $ rich : int 12 15 17 11 15 8 9 7 9 $ graze: Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 c) Tail(): It gives the last 6 observations of the given data object. Example3: > tail(iris) > tail(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2 > tail(HairEyeColor,2) [1] 7 8 > tail(state.x77,2) Population Income Illiteracy Life Exp Murder HS Grad Frost Area Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149 54464 Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203 > tail(grass) rich graze 4 11 mow 5 15 mow 6 8 unmow 7 9 unmow 8 7 unmow 9 9 unmow d) Head(): It displays the top 6 observations from dataset Example: > head(iris) e) Names(): It returns the coloum names > names(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" >names(grass) graze rich f) nrow(): It returns the number of observations in the given dataset. > dim(mtcars) [1] 32 11 > nrow(mtcars) [1] 32 > ncol(mtcars) [1] 11 >nrow(iris) 9 g) fix(iris): To fix the data in the given dataset. > fix(mydF ------Methods to Summarise Data in R 1. apply Apply function returns a vector or array or list of values obtained by applying a function to either rows or columns. This is the simplest of all the function which can do this job. However this function is very specific to collapsing either row or column. m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2) apply(m, 1, mean) [1] 6 7 8 9 10 11 12 13 14 15 apply(m, 2, mean) [1] 5.5 15.5 2. lapply ―lapply‖ returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.‖ l <- list(a = 1:10, b = 11:20) lapply(l, mean) $a [1] 5.5 $b [1] 15.5

3. sapply ―sapply‖ does the same thing as apply but returns a vector or matrix. Let‘s consider the last example again. l <- list(a = 1:10, b = 11:20) l.mean <- sapply(l, mean) class(l.mean) [1] "numeric"

4. tapply Till now, all the function we discussed cannot do what Sql can achieve. Here is a function which completes the palette for R. Usage is ―tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)‖, where X is ―an atomic object, typically a vector‖ and INDEX is ―a list of factors, each of same length as X‖. Here is an example which will make the usage clear. attach(iris) # mean petal length by species tapply(iris$Petal.Length, Species, mean) setosa versicolor virginica 1.462 4.260 5.552

5. by Now comes a slightly more complicated algorithm. Function ‗by‘ is an object-oriented wrapper for ‗tapply‘ applied to data frames. Hopefully the example will make it more clear. attach(iris) by(iris[, 1:4], Species, colMeans) Species: setosa Sepal.Length Sepal.Width Petal.Length Petal.Width 5.006 3.428 1.462 0.246 ------Species: versicolor Sepal.Length Sepal.Width Petal.Length Petal.Width 5.936 2.770 4.260 1.326 ------Species: virginica Sepal.Length Sepal.Width Petal.Length Petal.Width 6.588 2.974 5.552 2.026 6. sqldf If you found any of the above statements difficult, don‘t panic. I bring you a life line which you can use anytime. Let‘s fit in the SQL queries in R. Here is a way you can do the same. attach(iris) summarization <- sqldf(select Species, mean(Petal.Length) from Petal.Length_mean where Species is not null group by Species‘) And it‘s done. Wasn‘t it simple enough? One setback of this approach is the amount of time it takes to execute. In case you are interested in getting speed and same results read the next section. 7. ddply Fastest of all we discussed. You will need an additional package. Let‘s do what we exactly did in tapply section. library(plyr) attach(iris) # mean petal length by species ddply(iris,"Species",summarise, Petal.Length_mean = mean (Petal.Length))

We can also use packages such as dplyr, data.table to summarize data. Here‘s a complete tutorial on useful packages for data manipulation in R – Faster Data Manipulation with these 7 R Packages. In general if you are trying to add this summarisation step in the middle of a process and need a table as output, you need to go for sqldf or ddply. ―ddply‖ in these cases is faster but will not give you options beyond just grouping. ―sqldf‖ has all features you need to summarize the data in SQL statements. In case you are interested in using function similar to pivot tables or transposing the tables, you can consider using ―reshape‖. We have covered a few examples of the same in our article – comprehensive guide for data exploration in R.

8) Give an account on probability distribution and its computation in R[10] Probability is the chance of occurrence anything. P(A)= S/P Where S is sample size or no of positive outcomes and P is the population size or total no of outcomes. A probability distribution describes how the values of a random variable are distributed. For example, the collection of all possible outcomes of a sequence of coin tossing is known to follow the binomial distribution. Whereas the means of sufficiently large samples of a data population are known to resemble the normal distribution. Since the characteristics of these theoretical distributions are well understood, they can be used to make Statistical inferences on the entire data population as a whole. For example, Probability of ace of Diamond in a pack of 52 cards when 1 card is pulled out at random. Now, ―At Random‖ means that there is no biased treatment with any card and the result will be totally at random. So, No. of Ace of Diamond in a pack = S = 1 Total no of possible outcomes = Total no. of cards in pack = 52 Probability of positive outcome = S/P = 1/52 That is we have 1.92% chance that we will get positive outcome Probability Distribution There are 2 types of Distribution Functions:- 1. Discrete 2. Continuous Probability Distribution Function or PDF is the function that defines probability of outcomes based on certain conditions. Based on Conditions, there are majorly 5 types PDFs Types of Probability Distribution: • Binomial Distribution • Poisson Distribution • Continuous Uniform Distribution • Exponential Distribution • Normal Distribution • Chi-squared Distribution • Student t Distribution • F Distribution Normal Distribution We come now to the most important continuous probability density function and perhaps the most important probability distribution of any sort, the normal distribution. On several occasions, we have observed its occurrence in graphs from, apparently, widely differing sources: the sums when three or more dice are thrown; the binomial distribution for large values of n; and in the hyper geometric distribution. There are many other examples as well and several reasons, which will appear here, to call this distribution ―normal.‖ If We say that X has a normal probability distribution. A graph of a normal distribution, where we have chosen a = 0 and b = 1, appears in figure below: The shape of a normal curve is highly dependent on the standarddeviation. Importance of Normal Distribution: Normal distribution is a continuous distribution that is ―bell-shaped‖. Data are often assumed to be normal. Normal distributions can estimate probabilities over a continuous interval of data values. Binomial Distribution The binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is p, then the probability of having x successful outcomes in an experiment of n independent trials is as follows. (n ) ( ) ( ) (1 ) x x cx f x n p p Where x = 0, 1, 2, . . . , n Poisson Distribution The Poisson distribution is the probability distribution of independent event occurrences in an interval.If the mean occurrence per interval, then the probability of having x occurrences within a given interval is: ( )!xef xxWhere x = 0, 1, 2, . . . Problem If there are twelve cars crossing a bridge per minute on average, find the probability of having seventeen or more cars crossing the bridge in a particular minute. Solution The probability of having sixteen or less cars crossing the bridge in a particular minute is given by the function ppois

9) How do you summarize data in a data set? –Discuss with suitable examples.[5] Summary Statistics - Summarizing data with R: Example1: > grass rich graze 1 12 mow 2 15 mow 3 17 mow 4 11 mow 5 15 mow 6 8 unmow 7 9 unmow 8 7 unmow 9 9 unmow a) summary(): It gives the summary statistics of data object in terms of min, max,1st Quartile and 3rd Quartile mean/median values. > x<-c(1,2,3,4,5,6,7,8,9,10,11,12) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.75 6.50 6.50 9.25 12.00 > summary(grass) Rich graze Min. : 7.00 mow :5 1st Qu.: 9.00 unmow:4 Median :11.00 Mean :11.44 3rd Qu.:15.00 Max. :17.00 > summary(graze) Length Class Mode 9 character character > summary(grass$graze) mow unmow 5 4 b) str(): It gives the structure of data object in terms of class of object, No. of observations and each variable class and sample data. Example2: > str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > str(grass) 'data.frame': 9 obs. of 2 variables: $ rich : int 12 15 17 11 15 8 9 7 9 $ graze: Factor w/ 2 levels "mow","unmow": 1 1 1 1 1 2 2 c) Tail(): It gives the last 6 observations of the given data object. Example3: > tail(iris) > tail(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2 > tail(HairEyeColor,2) [1] 7 8 > tail(state.x77,2) Population Income Illiteracy Life Exp Murder HS Grad Frost Area Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149 54464 Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203 b)Explain how can you find out the mean, mode and median for iris data set. [5]

In Iris data set check whether Sepal Length is normally distributed or not. Use : To find if the Sepal Length is normally distributed or not we use 2 commands- qqnorm() &qqline(). The qqnorm() shows the actual distribution of data while qqline()

shows the line on which data would lie i f the data is normally distributed. The deviation of plot from line shows that data is not normally distributed

He wants to segregate the data of flowers having Sepal length greater than 7 and Sepal width greater than 3 simultaneously. Solution: When we have to use more than 1 condition then we use & as shown below

OR 10) What is random and bivariate random variable? Explain through examples[5] Random & Bivariate Random Variables Random Variable:

• A random variable, aleatory variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability, in contrast to other mathematical variables.

• A random variable is a real-valued function defined on the points of a sample space.

Random variables can be discrete, that is, taking any of a specified finite or countable list of values, endowed with a probability mass function, characteristic of a probability distribution; or continuous, taking any numerical value in an interval or collection of intervals, via a probability density function that is characteristic of a probability distribution; or a mixture of both types. The realizations of a random variable, that is, the results of randomly choosing values according to the variable's probability distribution function, are called random variates

For Example, If we toss a coin for 10 times and we get heads 8 times then we cannot say that the 11th time if coin is tossed then we get a head or a tail. But we are sure that we will either get a head or a tail

Bivariate Random Variable: Bivariate Random Variables are those variables having only 2 possible outcomes. For example flip of coin b) Explain Frequentist tests and Bayesian tests [5]

Frequentist tests:

Tests of univariate normality include D'Agostino's K-squared test, the Jarque–Bera test, the Anderson–Darling test, the Cramér–von Mises criterion, the Lilliefors test for normality (itself an adaptation of the Kolmogorov–Smirnov test), the Shapiro–Wilk test, the Pearson's chi- squared test, and the Shapiro–Francia test. A 2011 paper from The Journal of Statistical Modeling and Analytics concludes that Shapiro-Wilk has the best power for a given significance, followed closely by Anderson-Darling when comparing the Shapiro-Wilk, Kolmogorov- Smirnov, Lilliefors, and Anderson-Darling tests. Some published works recommend the Jarque–Bera test. But it is not without weakness. It has low power for distributions with short tails, especially for bimodal distributions. Other authors have declined to include its data in their studies because of its poor overall performance. Historically, the third and fourth standardized moments (skewness and kurtosis) were some of the earliest tests for normality. The Jarque–Bera test is itself derived fromskewness and kurtosis estimates. Mardia's multivariate skewness and kurtosis tests generalize the moment tests to the multivariate case. Other early test statistics include the ratio of the mean absolute deviation to the standard deviation and of the range to the standard deviation. More recent tests of normality include the energy test (Székely and Rizzo) and the tests based on the empirical characteristic function (ecf) (e.g. Epps and Pulley, Henze–Zirkler, BHEP test). The energy and the ecf tests are powerful tests that apply for testing univariate or multivariate normality and are statistically consistent against general alternatives. The normal distribution has the highest entropy of any distribution for a given standard deviation. There are a number of normality tests based on this property, the first attributable to Vasicek

Bayesian tests:

Kullback–Leibler divergences between the whole posterior distributions of the slope and variance do not indicate non-normality. However, the ratio of expectations of these posteriors and the expectation of the ratios give similar results to the Shapiro–Wilk statistic except for very small samples, when non-informative priors are used. Spiegelhalter suggests using a Bayes factor to compare normality with a different class of distributional alternatives. This approach has been extended by Farrell and Rogers-Stewart ------END OF UNIT-2 ------

Model Questions- Unit-3 1. Define NO SQL?  A NoSQL (originally referring to "non SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages

2. Write the differences of NO SQL and SQL

Differences SQL Database NO SQL Database Types One type (SQL database) Many different types with minor variations including key-value stores, document databases, wide- column stores, and graph databases Data storage model Individual records (e.g., Varies based on database "employees") are stored as type. For example, key- rows in tables, with each value stores function column storing a specific similarly to SQL databases, piece of data about that but have only two columns record (e.g., "manager," ("key" and "value"), with "date hired," etc.), much more complex information like a spreadsheet. Separate sometimes stored within the data types are stored in "value" columns. Document separate tables, and then databases do away with the joined together when more table-and-row model complex queries are altogether, storing all executed relevant data together in single "document" in JSON, XML, or another format, which can nest values hierarchically Examples MySQL, Postgres, Oracle MongoDB, Cassandra, Database HBase, Neo4j

Schemas Structure and data types are Typically dynamic. Records fixed in advance. To store can add new information on information about a new the fly, and unlike SQL data item, the entire table rows, dissimilar data database must be altered, can be stored together as during which time the necessary. For some database must be taken databases (e.g., wide- offline column stores), it is somewhat more challenging to add new fields dynamically …. But not.Limted to…. 3. Explain how to Read R output in Excel  First create a csv output from an R data.frame then read this file in Excel. There is one function that you need to know it‘swrite.table. You might also want to consider: write.csv which uses ―.‖ for the decimal point and a comma for the separator and write.csv2 which uses a comma for the decimal point and a semicolon for the separator. x <- cbind(rnorm(20),runif(20)) colnames(x) <- c("A","B") write.table(x,"your_path",sep=",",row.names=FALSE)

4. List NO SQL database examples.  A NoSQL (originally referring to "non SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages Examples of NOSQL are . MongoDB . Cassandra, . HBase, . Neo4j

5. Explain how SQL is used in R. SQL using R:  It is sqldf, an R package for running SQL statements on data frames. To load the ―SQLDF‖ package we use below step Library (sqldf) # Use the titanic data set data(titanic3, package=”PASWR”) colnames(titanic3) head(titanic3) 6. Write short notes on R connector? Different approaches in R to connect with Excel to perform read write and execute activities.

XLConnect: It might be slow for large dataset but very powerful otherwise. require (XLConnect) wb <- loadWorkbook("myfile.xlsx") myDf <- readWorksheet(wb, sheet = "Sheet1", header = TRUE) xlsx: This package requires JRM to install. This is suitable for java supported envirments.Prefer the read.xlsx2() over read.xlsx(), it€s significantly faster for large dataset. require(xlsx) read.xlsx2("myfile.xlsx", sheetName = "Sheet1")

7. Give the uses of NOSQL databases  A NoSQL (originally referring to "non SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address: • Large volumes of structured, semi-structured, and unstructured data • Agile sprints, quick iteration, and frequent code pushes • Object-oriented programming that is easy to use and flexible • Efficient, scale-out architecture instead of expensive, monolithic architecture 8. How NOSQL is faster than SQL? NoSQL databases are specifically designed for unstructured data which can be document-oriented, column-oriented, graph-based, etc. In this case, a particular data entity is stored together and not partitioned.  A NoSQL (originally referring to "non SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications 9. What do you mean by data manipulation?-give examples.  Data manipulation is the process of changing data to make it easier to read or be more organized. For example, a log of data could be organized in alphabetical order, making individual entries easier to locate SQL  Specific language using Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE NOSQL- Through object-oriented APIs 10. State the difference between NOSQL and SQL Differences SQL Database NO SQL Database Types One type (SQL database) Many different types with minor variations including key-value stores, document databases, wide- column stores, and graph databases Data storage model Individual records (e.g., Varies based on database "employees") are stored as type. For example, key- rows in tables, with each value stores function column storing a specific similarly to SQL databases, piece of data about that but have only two columns record (e.g., "manager," ("key" and "value"), with "date hired," etc.), much more complex information like a spreadsheet. Separate sometimes stored within the data types are stored in "value" columns. Document separate tables, and then databases do away with the joined together when more table-and-row model complex queries are altogether, storing all executed relevant data together in single "document" in JSON, XML, or another format, which can nest values hierarchically Examples MySQL, Postgres, Oracle MongoDB, Cassandra, Database HBase, Neo4j

Schemas Structure and data types are Typically dynamic. Records fixed in advance. To store can add new information on information about a new the fly, and unlike SQL data item, the entire table rows, dissimilar data database must be altered, can be stored together as during which time the necessary. For some database must be taken databases (e.g., wide- offline column stores), it is somewhat more challenging to add new fields dynamically ------Part-B 1 a) What did you understand about SQL using R?Explain[5] SQL is a database query language - a language designed specifically for interacting with a database. It offers syntax for extracting data, updating data, replacing data, creating data, etc. For our purposes, it will typically be used when accessing data off a server database. If the database isn‘t too large, you can grab the entire data set and stick it in adata.frame. However, often the data are quite large so you interact with it piecemeal via SQL.

There are various database implementations (SQLite, Microsoft SQL Server, PostgreSQL, etc) which are database management software which use SQL to access the data. The method of connecting with each database may differ, but they support SQL (specifically they support ANSI SQL) and often extend it in subtle ways. This means that in general, SQL written to access a SQLite database may not work to access a PostgreSQL database. sqldf package library(sqldf)

The sqldf package is incredibly simple, from R‘s point of view. There is a single function are concerned about: sqldf. Passed to this function is a SQL statement, such as sqldf('SELECT age, circumference FROM Orange WHERE Tree = 1 ORDER BY circumference ASC') ## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the ## caller performs the quoting. ## age circumference ## 1 118 30 ## 2 484 58 ## 3 664 87 ## 4 1004 115 ## 5 1231 120 ## 6 1372 142 ## 7 1582 145

(Note: The above warning is due to some compatibility issues between sqldf and RSQLite and shouldn‘t affect anything.)

SQL Queries There are a large number of SQL major commands1. Queries are accomplished with the SELECT command. First a note about convention: By convention, SQL syntax is written in all UPPER CASE and variable names/database names are written in lower case. Technically, the SQL syntax is case insensitive, so it can be written in lower case or otherwise. Note however that R isnot case insensitive, so variable names and data frame names must have proper capitalization. Hence sqldf("SELECT * FROM iris") sqldf("select * from iris") are equivalent, but this would fail (assuming you haven‘t created a new object called ―IRIS‖): sqldf("SELECT * from IRIS")

The basic syntax for SELECT is SELECT variable1, variable2 FROM data

For example, data(BOD) BOD ## Time demand ## 1 1 8.3 ## 2 2 10.3 ## 3 3 19.0 ## 4 4 16.0 ## 5 5 15.6 ## 6 7 19.8 sqldf('SELECT demand FROM BOD') ## demand ## 1 8.3 ## 2 10.3 ## 3 19.0 ## 4 16.0 ## 5 15.6 ## 6 19.8 sqldf('SELECT Time, demand from BOD') ## Time demand ## 1 1 8.3 ## 2 2 10.3 ## 3 3 19.0 ## 4 4 16.0 ## 5 5 15.6 ## 6 7 19.8

A quick sidenote: SQL does not like variables with . in their name. If you have any, refer to the variable wrapped in quotes, such as iris1 <- sqldf('SELECT Petal.Width FROM iris') ## Error in rsqlite_send_query(conn@ptr, statement): no such column: Petal.Width iris2 <- sqldf('SELECT "Petal.Width" FROM iris')

Wildcard A wild card can be passed to extract everything. bod2 <- sqldf('SELECT * FROM BOD') bod2 ## Time demand ## 1 1 8.3 ## 2 2 10.3 ## 3 3 19.0 ## 4 4 16.0 ## 5 5 15.6 ## 6 7 19.8

LIMIT To control the number of results returned, use LIMIT #. sqldf('SELECT * FROM iris LIMIT 5') ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa

ORDER BY To order variables, use the syntax ORDER BY var1 {ASC/DESC}, var2 {ASC/DESC} where the choice of ASC for ascending or DESC for descending is made per variable. sqldf("SELECT * FROM Orange ORDER BY age ASC, circumference DESC LIMIT 5") ## Tree age circumference ## 1 2 118 33 ## 2 4 118 32 ## 3 1 118 30 ## 4 3 118 30 ## 5 5 118 30

WHERE Conditional statements can be added via WHERE: sqldf('SELECT demand FROM BOD WHERE Time < 3') ## demand ## 1 8.3 ## 2 10.3

Both AND and OR are valid, along with paranthese to affect order of operations. sqldf('SELECT * FROM rock WHERE (peri > 5000 AND shape < .05) OR perm > 1000') ## area peri shape perm ## 1 5048 941.543 0.328641 1300 ## 2 1016 308.642 0.230081 1300 ## 3 5605 1145.690 0.464125 1300 ## 4 8793 2280.490 0.420477 1300

There are few more complicated ways to use WHERE:

IN WHERE IN is used similar to R‘s %in%. It also supports NOT. sqldf('SELECT * FROM BOD WHERE Time IN (1,7)') ## Time demand ## 1 1 8.3 ## 2 7 19.8 sqldf('SELECT * FROM BOD WHERE Time NOT IN (1,7)') ## Time demand ## 1 2 10.3 ## 2 3 19.0 ## 3 4 16.0 ## 4 5 15.6

LIKE LIKE can be thought of as a weak regular expression command. It only allows the single wildcard % which matches any number of characters. For example, to extract the data where the feed ends with ―bean‖: sqldf('SELECT * FROM chickwts WHERE feed LIKE "%bean" LIMIT 5') ## weight feed ## 1 179 horsebean ## 2 160 horsebean ## 3 136 horsebean ## 4 227 horsebean ## 5 217 horsebean sqldf('SELECT * FROM chickwts WHERE feed NOT LIKE "%bean" LIMIT 5') ## weight feed ## 1 309 linseed ## 2 229 linseed ## 3 181 linseed ## 4 141 linseed ## 5 260 linseed

Aggregated data Select statements can create aggregated data using AVG, MEDIAN, MAX, MIN, and SUM as functions in the list of variables to select. The GROUP BY statement can be added to aggregate by groups. AS can name the sqldf("SELECT AVG(circumference) FROM Orange") ## AVG(circumference) ## 1 115.8571 sqldf("SELECT tree, AVG(circumference) AS meancirc FROM Orange GROUP BY tree") ## Tree meancirc ## 1 1 99.57143 ## 2 2 135.28571 ## 3 3 94.00000 ## 4 4 139.28571 ## 5 5 111.14286

Counting data SELECT COUNT() returns the number of observations. Passing * or nothing returns total rows, passing a variable name returns the number of non-NA entries. AS works as well. d <- data.frame(a = c(1,1,1), b = c(1,NA,NA)) d ## a b ## 1 1 1 ## 2 1 NA ## 3 1 NA sqldf("SELECT COUNT() as numrows FROM d") ## numrows ## 1 3 sqldf("SELECT COUNT(b) FROM d") ## COUNT(b) ## 1 1 b)Distinguish between SQL and NoSQL [5]

no Difference SQL Database NO SQL Database 1 Types One type (SQL database) with Many different types including minor key-value variations stores, document databases, wide-column stores, and graph databases 2 Development Developed in 1970s to deal Developed in 2000s to deal with History with first limitations of wave of data storage SQL databases, particularly applications concerning scale, replication and unstructured data storage 3 Examples MySQL, Postgres, Oracle MongoDB, Cassandra, HBase, Database Neo4j 4 Data Storage Individual records (e.g., Varies based on database type. Model "employees") are stored as For example: rows in key-value stores function tables, with each column similarly to SQL storing a databases, but have only two specific piece of data about columns that ("key" and "value"), with more record (e.g., "manager," "date complex hired," information sometimes stored etc.), much like a spreadsheet. within the Separate data types are stored "value" columns. in Document databases do away separate tables, and then with the joined table-and-row model altogether, together when more complex storing all queries relevant data together in single are executed. For example, "document" "offices" in JSON, XML, or another might be stored in one table, format, which and can nest values hierarchically "employees" in another. When a user wants to find the work address of an employee, the database engine joins the "employee" and "office" tables together to get all the information necessary 5 Schemas Structure and data types are Typically dynamic. Records can fixed in add new advance. To store information information on the fly, and about unlike SQL table a new data item, the entire rows, dissimilar data can be database stored together as must be altered, during which necessary. For some databases time (e.g., widecolumn the database must be taken stores), it is somewhat more offline challenging to add new fields dynamically 6 Scaling Vertically, meaning a single Horizontally, meaning that to server add capacity, a must be made increasingly database administrator can powerful simply add more in order to deal with increased commodity servers or cloud demand. It is possible to instances. The spread SQL database automatically spreads databases over many servers, data across but servers as necessary significant additional engineering is generally required 7 Development Mix of open-source (e.g., Open-source Model Postgres, MySQL) and closed source (e.g., Oracle Database) 8 Supports Yes, updates can be Yes, updates can be configured Transactions configured to complete to complete entirely or not at all entirely or not at all

9 Data Specific language using Through object-oriented APIs Manipulation Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE

10 Consistency Can be configured for strong Depends on product. Some consistency provide strong consistency (e.g., MongoDB) whereas others offer eventual consistency (e.g., Cassandra)

OR 2 Write short notes on the following a) Professionalism [5] What is professionalism: • Professionalism is the competence or set of skills that are expected from a professional

• Professionalism determines how a person is perceived by his employer, co-workers, and casual contacts.

• How long does it take for someone to form an opinion about you? • Studies have proved that it just takes six seconds for a person to form an opinion about another person.

How does someone form an opinion about you…… Eye Contact – Maintaining eye contact with a person or the audience says that you are confident. It says that you are someone who can be trusted and hence can maintain contact with you. Handshake – Grasp the other person‘s hand firmly and shake it a few times. This shows that you are enthusiastic. Posture – Stand straight but not rigid, this will showcase that you are receptive and not very rigid in your thoughts. Clothing – Appropriate clothing says that you are a leader with a winning potential. How to exhibit professionalism… • Empathy • Positive Attitude • Teamwork • Professional Language

b) Effective communication skills [5] Effective Communication We would probably all agree that effective communication is essential to workplace effectiveness. And yet, we probably don‘t spend much time thinking about how we communicate, and how we might improve our communication skills. The purpose of building communication skills is to achieve greater understanding and meaning between people and to build a climate of trust, openness, and support. To a large degree, getting our work done involves working with other people. And a big part of working well with other people is communicating effectively. Sometimes we just don‘t realize how critical effective communication is to getting the job done. So, let‘s have an experience that reminds us of the importance of effective communication. Actually, this experience is a challenge to achieve a group result without any communication at all! Let‘s give it a shot. What is Effective Communication? We cannot not communicate. The question is: Are we communicating what we intend to communicate? Does the message we send match the message the other person receives? Impression = Expression Real communication or understanding happens only when the receiver‘s impression matches what the sender intended through his or her expression.So the goal of effective communication is a mutual understanding of the message. There are three main forms of Communication: 1. Verbal communication 2. Non verbal communication 3. Written communication

Verbal Communication Verbal communication refers to the use of sounds and language to relay a message. It serves as a vehicle for expressing desires, ideas and concepts and is vital to the processes of learning and teaching. In combination with nonverbal forms of communication, verbal communication acts as the primary tool for expression between two or more people Non Verbal Communication How do we communicate without words??? • We communicate a lot to each other outside what we say. • We create confusion when our verbal and nonverbal don‘t match &When verbal and nonverbal messages don‘t match, we tend to ―listen‖ to the nonverbal one.

(Intuitively, we generally view others‘ ―body language‖ as a more reliableindicator of their attitudes and feelings than their words.) • We can learn to read the meanings of nonverbal behaviors. o The key is discovering an individual‘s behavior patterns—there is predictability to their meaning. o However, be careful—people can mask their feelings. o Also, trying to read something into every movement others make can get in the way of effective interactions

3. a)Explain how SQL is used in R[5] One type (SQL database) with minor variations Developed in 1970s to deal with first wave of data storage applications Examples: MySQL, Postgres, Oracle Database Individual records (e.g., "employees") are stored as rows in tables, with each column storing a specific piece of data about that record (e.g., "manager," "date hired," etc.), much like a spreadsheet. Separate data types are stored in separate tables, and then joined together when more complex queries are executed. For example, "offices" might be stored in one table, and "employees" in another. When a user wants to find the work address of an employee, the database engine joins the "employee" and "office" tables together to get all the information necessary.

Schemas Structure and data types are fixed in advance. To store information about a new data item, the entire database must be altered, during which time the database must be taken offline Mix of open-source (e.g., Postgres, MySQL) and closed source (e.g., Oracle Database Specific language using Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE Can be configured for strong consistency SQL using R: It is sqldf, an R package for running SQL statements on data frames. To load the ―SQLDF‖ package we use below step Library (sqldf) # Use the titanic data set data(titanic3, package=”PASWR”) colnames(titanic3) head(titanic3)

b) Define and explain the term NOSQL [5] A NoSQL (originally referring to "non SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages. There have been various approaches to classify NoSQL databases, each with different categories and subcategories, some of which overlap When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address: • Large volumes of structured, semi-structured, and unstructured data • Agile sprints, quick iteration, and frequent code pushes • Object-oriented programming that is easy to use and flexible • Efficient, scale-out architecture instead of expensive, monolithic architecture A basic classification based on data model, with examples: • Column:Accumulo, Cassandra, Druid, HBase, Vertica • Document:Clusterpoint, Apache CouchDB, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx • Key-value:CouchDB, Oracle NoSQL Database, Dynamo, FoundationDB, HyperDex, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS • Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog • Multi-model:OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB

OR 4. a)Illustrate the NOSQL database classification based on data model with examples[5]

NoSQL is originally referring to "non SQL" or "non-relational” and also called "Not only SQL” to emphasize that they may support SQL-like query languages. The RDBMS database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications Classification of NoSQL databases based on data model: A basic classification based on data model, with examples: Document: Clusterpoint, Apache CouchDB, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx

Example : Parent-Child Relationship–Embedded Entity Here is an example of denormalization of the SALES_ITEM schema in a Document database:

{ "_id": "123", "date": "10/10/2017", ―ship_status‖:‖backordered‖ "orderitems": [ { "itemid": "4348", "price": 10.00 }, { "itemid": "5648", "price": 15.00 }] }

• Key-value: CouchDB, Oracle NoSQL Database, Dynamo, FoundationDB, HyperDex, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS • Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog • Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB

Here is a document model for the tree shown above (there are multiple ways to represent trees): { "_id": "USA", “type”:”state”, "children": ["TN",”FL] "parent": null } { "_id": "TN", “type”:”state”, "children": ["Nashville”,”Memphis”] "parent": "USA” } { "_id": "FL", “type”:”state”, "children": ["Miami”,”Jacksonville”] "parent": "USA” } { "_id": "Nashville", “type”:”city”, "children": [] "parent": "TN” }

Each document is a tree node, with the row key equal to the node id. The parent field stores the parent node id. The children field stores an array of children node ids. A secondary index on the parent and children fields allows to quickly find the parent or children nodes

b)List the steps for connecting R to NOSQL database. [5] Excel and R integration with R connector 1 – Read Excel spreadsheet in R gdata: it requires you to install additional Perl libraries on Windows platforms but it‘s very powerful. require(gdata) myDf<- read.xls ("myfile.xlsx"), sheet = 1, header = TRUE) RODBC: This is reported for completeness only. It‘s rather dated; there are better ways to interact with Excel nowadays. XLConnect: It might be slow for large dataset but very powerful otherwise. require (XLConnect) wb<- loadWorkbook("myfile.xlsx") myDf<- readWorksheet(wb, sheet = "Sheet1", header = TRUE) xlsx: Prefer the read.xlsx2() over read.xlsx(), it‘s significantly faster for large dataset. require(xlsx) read.xlsx2("myfile.xlsx", sheetName = "Sheet1") xlsReadWrite: Available for Windows only. It‘s rather fast but doesn‘t support .xlsx files which is a serious drawback. It has been removed from CRAN lately. read.table(―clipboard‖): It allows to copy data from Excel and read it directly in R. This is the quick and dirty R/Excel interaction but it‘s very useful in some cases. myDf<- read.table("clipboard") 2 – Read R output in Excel First create a csv output from an R data.frame then read this file in Excel. There is one function that you need to know it‘swrite.table. You might also want to consider: write.csv which uses ―.‖ for the decimal point and a comma for the separator and write.csv2 which uses a comma for the decimal point and a semicolon for the separator. x <- cbind(rnorm(20),runif(20)) colnames(x) <- c("A","B") write.table(x,"your_path",sep=",",row.names=FALSE) 3 – Execute R code in VBA RExcel is from my perspective the best suited tool but there is at least one alternative. You can run a batch file within the VBA code. If R.exe is in your PATH, the general syntax for the batch file (.bat) is: R CMD BATCH [options] myRScript.R Here‘s an example of how to integrate the batch file above within your VBA code. 4 – Execute R code from an Excel spreadsheet Rexcel is the only tool I know for the task. Generally speaking once you installed RExcel you insert the excel code within a cell and execute from RExcel spreadsheet menu. See the RExcel references below for an example. 5 – Execute VBA code in R This is something I came across but I never tested it myself. This is a two steps process. First write a VBscript wrapper that calls the VBA code. Second run the VBscript in R with the system or shell functions. The method is described in full details here. 6 – Fully integrate R and Excel RExcel is a project developped by Thomas Baier and Erich Neuwirth, ―making R accessible from Excel and allowing to use Excel as a frontend to R‖. It allows communication in both directions: Excel to R and R to Excel and covers most of what is described above and more. I‘m not going to put any example of RExcel use here as the topic is largely covered elsewhere but I will show you where to find the relevant information. There is a wiki for installing RExcel and an excellent tutorial available here. I also recommend the following two documents: RExcel – Using R from within Excel and High-Level Interface Between R and Excel. They both give an in-depth view of RExcel capabilities

5) Explain the term NO SQL. Write the differences of NO SQL and SQL [10]

Define Nosql Database: NoSQL is originally referring to "non SQL" or "non-relational” and also called "Not only SQL” to emphasize that they may support SQL-like query languages. The RDBMS database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications Benefits of NoSQL Database: No SQL databases are more scalable and provide superior performance. The NoSQL data model addresses several issues that the relational model is not designed to address: • Large volumes of structured, semi-structured, and unstructured data • Agile sprints, quick iteration, and frequent code pushes • Object-oriented programming that is easy to use and flexible • Efficient, scale-out architecture instead of expensive, monolithic architecture Classification of NoSQL databases based on data model: A basic classification based on data model, with examples: • Document: Clusterpoint, Apache CouchDB, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx • Key-value: CouchDB, Oracle NoSQL Database, Dynamo, FoundationDB, HyperDex, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS • Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog • Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB no Difference SQL Database NO SQL Database 1 Types One type (SQL database) with Many different types including minor key-value variations stores, document databases, wide-column stores, and graph databases 2 Development Developed in 1970s to deal Developed in 2000s to deal with History with first limitations of wave of data storage SQL databases, particularly applications concerning scale, replication and unstructured data storage 3 Examples MySQL, Postgres, Oracle MongoDB, Cassandra, HBase, Database Neo4j 4 Data Storage Individual records (e.g., Varies based on database type. Model "employees") are stored as For example: rows in key-value stores function tables, with each column similarly to SQL storing a databases, but have only two specific piece of data about columns that ("key" and "value"), with more record (e.g., "manager," "date complex hired," information sometimes stored etc.), much like a spreadsheet. within the Separate data types are stored "value" columns. in Document databases do away separate tables, and then with the joined table-and-row model altogether, together when more complex storing all queries relevant data together in single are executed. For example, "document" "offices" in JSON, XML, or another might be stored in one table, format, which and can nest values hierarchically "employees" in another. When a user wants to find the work address of an employee, the database engine joins the "employee" and "office" tables together to get all the information necessary 5 Schemas Structure and data types are Typically dynamic. Records can fixed in add new advance. To store information information on the fly, and about unlike SQL table a new data item, the entire rows, dissimilar data can be database stored together as must be altered, during which necessary. For some databases time (e.g., widecolumn the database must be taken stores), it is somewhat more offline challenging to add new fields dynamically 6 Scaling Vertically, meaning a single Horizontally, meaning that to server add capacity, a must be made increasingly database administrator can powerful simply add more in order to deal with increased commodity servers or cloud demand. It is possible to instances. The spread SQL database automatically spreads databases over many servers, data across but servers as necessary significant additional engineering is generally required 7 Development Mix of open-source (e.g., Open-source Model Postgres, MySQL) and closed source (e.g., Oracle Database) 8 Supports Yes, updates can be Yes, updates can be configured Transactions configured to complete to complete entirely or not at all entirely or not at all

9 Data Specific language using Through object-oriented APIs Manipulation Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE

10 Consistency Can be configured for strong Depends on product. Some consistency provide strong consistency (e.g., MongoDB) whereas others offer eventual consistency (e.g., Cassandra)

OR 6 a)Explain how to Read R output in Excel. [5]

Excel and R integration with R connector Different approaches in R to connect with Excel to perform read write and execute activities: First create a csv output from an R data.frame then read this file in Excel. There is one function That you need to know it€s write.table. You might also want to consider: write.csv which uses for thedecimal point and a comma for the separator and write.csv2 which uses a comma for the decimal point and a semicolon for the separator. x <- cbind(rnorm(20),runif(20)) colnames(x) <- c("A","B") write.table(x,"your_path",sep=",",row.names=FALSE)

b)Explain how to Read Excel spreadsheet in R [5]

Read Excel spreadsheet in R: Multiple Packages are available to access Excel sheet from R 1. gdata: This package requires you to install additional Perl libraries on Windows platforms but it€s very powerful. require(gdata) myDf <- read.xls ("myfile.xlsx"), sheet = 1, header = TRUE) 2. XLConnect: It might be slow for large dataset but very powerful otherwise. require (XLConnect) wb <- loadWorkbook("myfile.xlsx") myDf <- readWorksheet(wb, sheet = "Sheet1", header = TRUE) 3. xlsx: This package requires JRM to install. This is suitable for java supported envirments.Prefer the read.xlsx2() over read.xlsx(), it€s significantly faster for large dataset. require(xlsx) read.xlsx2("myfile.xlsx", sheetName = "Sheet1") Lab activity : example R script: install.packages("rjava") install.packages("xlsx") require(xlsx) > read.xlsx2("myfile.xlsx", sheetName = "Sheet1") Sno Sname Marks Attendance Contactno Mailid 1 sri 45 45 988776655 [email protected] 2 vas 78 78 435465768 [email protected] 3 toni 34 46 -117845119 [email protected] 4 mac 90 89 -671156006 [email protected] 5 ros 25 23 -1224466893 [email protected] xlsReadWrite: Available for Windows only. It€s rather fast but doesn€t support. .xlsx files which is a serious drawback. It has been removed from CRAN lately. read.table(“clipboard”): It allows to copy data from Excel and read it directly in R. This is the quick and dirty R/Excel interaction but it€s very useful in some cases. myDf<- read.table("clipboard")

7) Write a R program to illustrate database connectivity [10] Excel and R integration with R connector Different approaches in R to connect with Excel to perform read write and execute activities: Read Excel spreadsheet in R: Multiple Packages are available to access Excel sheet from R 1. gdata: This package requires you to install additional Perl libraries on Windows platforms but it€s very powerful. require(gdata) myDf <- read.xls ("myfile.xlsx"), sheet = 1, header = TRUE) 2. XLConnect: It might be slow for large dataset but very powerful otherwise. require (XLConnect) wb <- loadWorkbook("myfile.xlsx") myDf <- readWorksheet(wb, sheet = "Sheet1", header = TRUE) 3. xlsx: This package requires JRM to install. This is suitable for java supported envirments.Prefer the read.xlsx2() over read.xlsx(), it€s significantly faster for large dataset. require(xlsx) read.xlsx2("myfile.xlsx", sheetName = "Sheet1") Lab activity : example R script: install.packages("rjava") install.packages("xlsx") require(xlsx) > read.xlsx2("myfile.xlsx", sheetName = "Sheet1") Sno Sname Marks Attendance Contactno Mailid 1 sri 45 45 988776655 [email protected] 2 vas 78 78 435465768 [email protected] 3 toni 34 46 -117845119 [email protected] 4 mac 90 89 -671156006 [email protected] 5 ros 25 23 -1224466893 [email protected] xlsReadWrite: Available for Windows only. It€s rather fast but doesn€t support. .xlsx files which is a serious drawback. It has been removed from CRAN lately. read.table(“clipboard”): It allows to copy data from Excel and read it directly in R. This is the quick and dirty R/Excel interaction but it€s very useful in some cases. myDf<- read.table("clipboard") First create a csv output from an R data.frame then read this file in Excel. There is one function that you need to know it€s write.table. You might also want to consider: write.csv which uses “.” for the decimal point and a comma for the separator and write.csv2 which uses a comma for the decimal point and a semicolon for the separator. x <- cbind(rnorm(20),runif(20)) colnames(x) <- c("A","B") write.table(x,"your_path",sep=",",row.names=FALSE) Execute R code in VBA RExcel is from my perspective the best suited tool but there is at least one alternative. You can run a batch file within the VBA code. If R.exe is in your PATH, the general syntax for the batch file (.bat) R CMD BATCH [options] myRScript.R Here€s an example of how to integrate the batch file above within your VBA code OR 8) Write about stored procedures in NOSQL. Explain with examples [10] Before we understand about NO SQL we will see how SQL is used in R. SQL using R: It is sqldf, an R package for running SQL statements on data frames. To load the ―SQLDF‖ package we use below step Library (sqldf) # Use the titanic data set data(titanic3, package=”PASWR”) colnames(titanic3) head(titanic3) NO SQL: A NoSQL (originally referring to "non SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages. There have been various approaches to classify NoSQL databases, each with different categories and subcategories, some of which overlap The Benefits of NoSQL When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address: • Large volumes of structured, semi-structured, and unstructured data • Agile sprints, quick iteration, and frequent code pushes • Object-oriented programming that is easy to use and flexible • Efficient, scale-out architecture instead of expensive, monolithic architecture

A basic classification based on data model, with examples: • Column:Accumulo, Cassandra, Druid, HBase, Vertica • Document:Clusterpoint, Apache CouchDB, Couchbase, DocumentDB, HyperDex, Lotus Notes, MarkLogic, MongoDB, OrientDB, Qizx • Key-value:CouchDB, Oracle NoSQL Database, Dynamo, FoundationDB, HyperDex, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS • Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog • Multi-model:OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB Connecting R to NoSQL databases Lab Activity1: No SQL Example: R script to access a XML file: Step1: Install Packages plyr,XML Step2: Take xml file url Step3: create XML Internal Document type object in R using xmlParse() Step4 :Convert xml object to list by using xmlToList() Step5: convert list object to data frame by using ldply(xl, data.frame) install.packages("XML") install.packages("plyr") > fileurl<-"http://www.w3schools.com/xml/simple.xml" > doc<-xmlParse(fileurl,useInternalNodes=TRUE) > class(doc) [1] "XMLInternalDocument" "XMLAbstractDocument" > doc Belgian Waffles $5.95 Two of our famous Belgian Waffles with plenty of real maple syrup 650 > xl<-xmlToList(doc) > class(xl) [1] "list" > xl $food $food$name [1] "Belgian Waffles" $food$price [1] "$5.95" $food$description [1] "Two of our famous Belgian Waffles with plenty of real maple syrup" $food$calories [1] "650" $food > data<-ldply(xl, data.frame) > head(data) .id name price 1 food Belgian Waffles $5.95 2 food Strawberry Belgian Waffles $7.95 3 food Berry-Berry Belgian Waffles $8.95 4 food French Toast $4.50 5 food Homestyle Breakfast $6.95

9 a) What did you understand about SQL using R? Explain [5] One type (SQL database) with minor variations Developed in 1970s to deal with first wave of data storage applications Examples MySQL, Postgres, Oracle Database Individual records (e.g., "employees") are stored as rows in tables, with each column storing a specific piece of data about that record (e.g., "manager," "date hired," etc.), much like a spreadsheet. Separate data types are stored in separate tables, and then joined together when more complex queries are executed. For example, "offices" might be stored in one table, and "employees" in another. When a user wants to find the work address of an employee, the database engine joins the "employee" and "office" tables together to get all the information necessary. Schemas Structure and data types are fixed in advance. To store information about a new data item, the entire database must be altered, during which time the database must be taken offline Mix of open-source (e.g., Postgres, MySQL) and closed source (e.g., Oracle Database Specific language using Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE Can be configured for strong consistency SQL using R: It is sqldf, an R package for running SQL statements on data frames. To load the ―SQLDF‖ package we use below step Library (sqldf) # Use the titanic data set data(titanic3, package=”PASWR”) colnames(titanic3) head(titanic3)

b)Distinguish between SQL and NoSQL [5]

no Difference SQL Database NO SQL Database 1 Types One type (SQL database) with Many different types including minor key-value variations stores, document databases, wide-column stores, and graph databases 2 Development Developed in 1970s to deal Developed in 2000s to deal with History with first limitations of wave of data storage SQL databases, particularly applications concerning scale, replication and unstructured data storage 3 Examples MySQL, Postgres, Oracle MongoDB, Cassandra, HBase, Database Neo4j 4 Data Storage Individual records (e.g., Varies based on database type. Model "employees") are stored as For example: rows in key-value stores function tables, with each column similarly to SQL storing a databases, but have only two specific piece of data about columns that ("key" and "value"), with more record (e.g., "manager," "date complex hired," information sometimes stored etc.), much like a spreadsheet. within the Separate data types are stored "value" columns. in Document databases do away separate tables, and then with the joined table-and-row model altogether, together when more complex storing all queries relevant data together in single are executed. For example, "document" "offices" in JSON, XML, or another might be stored in one table, format, which and can nest values hierarchically "employees" in another. When a user wants to find the work address of an employee, the database engine joins the "employee" and "office" tables together to get all the information necessary 5 Schemas Structure and data types are Typically dynamic. Records can fixed in add new advance. To store information information on the fly, and about unlike SQL table a new data item, the entire rows, dissimilar data can be database stored together as must be altered, during which necessary. For some databases time (e.g., widecolumn the database must be taken stores), it is somewhat more offline challenging to add new fields dynamically 6 Scaling Vertically, meaning a single Horizontally, meaning that to server add capacity, a must be made increasingly database administrator can powerful simply add more in order to deal with increased commodity servers or cloud demand. It is possible to instances. The spread SQL database automatically spreads databases over many servers, data across but servers as necessary significant additional engineering is generally required 7 Development Mix of open-source (e.g., Open-source Model Postgres, MySQL) and closed source (e.g., Oracle Database) 8 Supports Yes, updates can be Yes, updates can be configured Transactions configured to complete to complete entirely or not at all entirely or not at all

9 Data Specific language using Through object-oriented APIs Manipulation Select, Insert, and Update statements, e.g. SELECT fields FROM table WHERE

10 Consistency Can be configured for strong Depends on product. Some consistency provide strong consistency (e.g., MongoDB) whereas others offer eventual consistency (e.g., Cassandra)

OR 10) Write short notes on the following a) Professionalism [5] What is professionalism.. • Professionalism is the competence or set of skills that are expected from a professional • Professionalism determines how a person is perceived by his employer, co-workers, and casual contacts. • How long does it take for someone to form an opinion about you? • Studies have proved that it just takes six seconds for a person to form an opinion about another person.

How does someone form an opinion about you.. Eye Contact – Maintaining eye contact with a person or the audience says that you are confident. It says that you are someone who can be trusted and hence can maintain contact with you. Handshake – Grasp the other person‘s hand firmly and shake it a few times. This shows that you are enthusiastic. Posture – Stand straight but not rigid, this will showcase that you are receptive and not very rigid in your thoughts. Clothing – Appropriate clothing says that you are a leader with a winning potential. How to exhibit professionalism.. • Empathy • Positive Attitude • Teamwork • Professional Language

b) Effective communication skills [5] Effective Communication We would probably all agree that effective communication is essential to workplace effectiveness. And yet, we probably don‘t spend much time thinking about how we communicate, and how we might improve our communication skills. The purpose of building communication skills is to achieve greater understanding and meaning between people and to build a climate of trust, openness, and support. To a large degree, getting our work done involves working with other people. And a big part of working well with other people is communicating effectively. Sometimes we just don‘t realize how critical effective communication is to getting the job done. So, let‘s have an experience that reminds us of the importance of effective communication. Actually, this experience is a challenge to achieve a group result without any communication at all! Let‘s give it a shot. What is Effective Communication.. We cannot not communicate. The question is: Are we communicating what we intend to communicate? Does the message we send match the message the other person receives? Impression = Expression Real communication or understanding happens only when the receiver‘s impressionmatches what the sender intended through his or her expression.So the goal of effective communication is a mutual understanding of the message. There are three main forms of Communication: 1. Verbal communication 2. Non verbal communication 3. Written communication Verbal Communication Verbal communication refers to the use of sounds and language to relay a message. It serves as a vehicle for expressing desires, ideas and concepts and is vital to the processes of learning and teaching. In combination with nonverbal forms of communication, verbal communication acts as the primary tool for expression between two or more people Non Verbal Communication How do we communicate without words??? • We communicate a lot to each other outside what we say. • We create confusion when our verbal and nonverbal messages don‘t match &When verbal and nonverbal messages don‘t match, we tend to ―listen‖ to the nonverbal one.

(Intuitively, we generally view others‘ ―body language‖ as a more reliableindicator of their attitudes and feelings than their words.) • We can learn to read the meanings of nonverbal behaviors. The key is discovering an individual‘s behavior patterns—there is predictability to their meaning. However, be careful—people can mask their feelings. Also, trying to read something into every movement others make can get in the way of effective interactions. ------END OF UNIT-3------

Model Questions- Unit-4

1. Explain Regression residuals  The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest Because a linear regression model is not always appropriate for the data, you should assess the appropriateness of the model by defining residuals and examining residual plots  Residuals: The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual Residual = Observed value - Predicted value e = y – ŷ Both the sum and the mean of the residuals are equal to zero. That is, Σ e = 0 and e = 0

2. Define Multiple Linear Regression?  Multiple linear regression is an extension of simple linear regression used to predict an outcome variable (y) on the basis of multiple distinct predictor variables (x). The ―b‖ values are called the regression weights (or beta coefficients) # Multiple Linear Regression Example fit <- lm(y ~ x1 + x2 + x3, data=mydata) summary(fit) # show results

3. Explain Auto Correlation

 similarity between observations as a function of the time lag between them Autocorrelation, also known as serial correlation or cross- autocorrelation is the cross-correlation of a signal with itself at different points in time It is a mathematical tool for finding repeating patterns such as the presence of a periodic signal obscured by noise, identifying the missing fundamental frequency in a signal implied by its harmonic frequencies  In statistics, the autocorrelation of a random process describes the correlation between values of the process at different times, as a function of the two times or of the time lag

4. Define Regression Modeling

Regression modeling or analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed

5. Define OLS Regression? OLS:- Ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data. This is applied in both simple linear and multiple regression where the common assumptions are (1) The model is linear in the coefficients of the predictor with an additive random error term (2) The random error terms are • normally distributed with 0 mean and • a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change 6. Define Multicollinearity? In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regressions may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors

7. How correlation is used for data analysis?  Correlation is defined in terms of the variance of x, the variance of y, and the covariance of x and y (the way the two vary together; the way they co-vary) on the assumption that both variables are normally distributed. • Correlation explains how one or more variables are related to each other. These variables can be input data features which have beenused to forecast our target variable. ... It means that when the value of one variable increases then the value of the other variable(s) also increases. In this way correlation can be used for data analytics.

8. Expand ANOVA and list its unique features  Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups).  In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups. ANOVAs are useful for comparing (testing) three or more means (groups or variables) for statistical significance. In R, to perform ANOVA test the built in function is anova()

9. What do you mean by Heteroscedasticity?  A collection of random variables is heteroscedastic (or 'heteroskedastic' from Ancient Greek hetero ―different‖ and skedasis ―dispersion‖) if there are sub-populations that have different variabilities from others. Here "variability" could be quantified by the variance or any other measure of statistical dispersion. Thus heteroscedasticity is the absence of homoscedasticity

10. What did you understand about residuals?.describe shortly  The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Residual = Observed value - Predicted value e = y – ŷ Both the sum and the mean of the residuals are equal to zero. That is, Σ e = 0 and e = 0 ------Part-B

1) Discuss in detail on Basic regression analysis. Give examples.[10]

Basic Regression Analysis Regression analysis is the statistical method you use when both the response variable and the explanatory variable are continuous variables (i.e. real numbers with decimal places – things like heights, weights, volumes, or temperatures). In simple regression, we try to determine whether there is a relationship between two variables. It is assumed that there is a high degree of correlation between the two variables chosen for use in regression. In R we use lm () function to do simple regression modeling. For example, > fit <- lm(data$petal_length ~ data$petal_width) When we call ―fit‖ as below >fit We get the intercept ―C‖ and the slope ―m‖ of the equation – Y=mX+C The fit information displays four charts: Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage. Below are the various graphs representing values of regression

OLS Regression OLS:- Ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data. This is applied in both simple linear and multiple regression where the common assumptions are (1) The model is linear in the coefficients of the predictor with an additive random error term (2) The random error terms are normally distributed with 0 mean and a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change Regression Modeling Regression modeling or analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution.

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation

Many techniques for carrying out regression analysis have been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional.

The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However, in many applications, especially with small effects or questions of causality based on observational data, regression methods can give misleading results.

In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable may be more specifically referred to as metric regression to distinguish it from related problems

OR 2) What is Autocorrelation and Multicollinearity ? Explain [10] Autocorrelation Autocorrelation, also known as serial correlation or cross-autocorrelation, is the cross-correlation of a signal with itself at different points in time (that is what the cross stands for). Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals. In statistics, the autocorrelation of a random process describes the correlation between values of the process at different times, as a function of the two times or of the time lag. Let X be some repeatable process, and i be some point in time after the start of that process. (i may be an integer for a discrete-time process or a real number for a continuous-time process.) Then Xi is the value (or realization) produced by a given run of the process at time i. Suppose that the process is further known to have defined values for mean μi and variance σi2 for all times i. Then the definition of the autocorrelation between times s and t is

where "E" is the expected value operator Test: - The traditional test for the presence of first-order autocorrelation is the Durbin–Watson statistic or, if the explanatory variables include a lagged dependent variable, Durbin's h statistic. The Durbin- Watson can be linearly mapped however to the Pearson correlation between values and their lags.

A more flexible test, covering autocorrelation of higher orders and applicable whether or not the regressors include lags of the dependent variable, is the Breusch–Godfrey test. This involves an auxiliary regression, wherein the residuals obtained from estimating the model of interest are regressed on (a) the original regressors and (b) k lags of the residuals, where k is the order of the test. The simplest version of the test statistic from this auxiliary regression is TR2, where T is the sample size and R2 is the coefficient of determination. Under the null hypothesis of no autocorrelation, this statistic is asymptotically distributed as x2 with k degrees of freedom

Multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regressions may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. In case of perfect multicollinearity the predictor matrix is singular and therefore cannot be inverted. Under these circumstances, for a general linear model y=Xβ+ε, the ordinary estimator.

Test:- Indicators that multicollinearity may be present in a model: 1) Large changes in the estimated regression coefficients when a predictor variable is added or deleted 2) Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the joint hypothesis that those coefficients are all zero (using an F-test) 3) If a multivariable regression finds an insignificant coefficient of a particular explanator, yet a simple linear regression of the explained variable on this explanatory variable shows its coefficient to be significantly different from zero, this situation indicates multicollinearity in the multivariable regression. 4) Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF for multicollinearity Where is the coefficient of determination of a regression of explanator j on all the other explanators. A tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem 5) Condition number test: The standard measure of ill-conditioning in a matrix is the condition index. It will indicate that the inversion of the matrix is numerically unstable with finite-precision numbers (standard computer floats and doubles). This indicates the potential sensitivity of the computed inverse to small changes in the original matrix. The Condition Number is computed by finding the square root of (the maximum eigenvalue divided by the minimum eigenvalue). If the Condition Number is above 30, the regression may have significant multicollinearity; multicollinearity exists if, in addition, two or more of the variables related to the high condition number have high proportions of variance explained. One advantage of this method is that it also shows which variables are causing the problem. 6) Farrar–Glauber test: If the variables are found to be orthogonal, there is no multicollinearity; if the variables are not orthogonal, then multicollinearity is present. C. Robert Wichers has argued that Farrar–Glauber partial correlation test is ineffective in that a given partial correlation may be compatible with different multicollinearity patterns. The Farrar–Glauber test has also been criticized by other researchers. 7) Perturbing the data. Multicollinearity can be detected by adding random noise to the data and re- running the regression many times and seeing how much the coefficients change. 8) Construction of a correlation matrix among the explanatory variables will yield indications as to the likelihood that any given couplet of right-hand-side variables is creating multicollinearity problems. Correlation values (off-diagonal elements) of at least .4 are sometimes interpreted as indicating a multicollinearity problem. This procedure is, however, highly problematic and cannot be recommended. Intuitively, correlation describes a bivariate relationship, whereas collinearity is a multivariate phenomenon ------3). Explain about Multiple Linear Regression with a suitable example [10] Introduction to Multiple Regression: The multiple regression is the relationship between several independent or predictor variables and a dependent or criterion variable. For example: A real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, you might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how "pretty" the house is (subjective rating). Lab Activity: Multiple Regression model on Iris Data set Step1: Subset the numeric data from iris dataset Step2: Find the correlation among all variables Step3: Find the formula based on highly correlated variables Step4: call the glm() >iris1<-iris[1:4] > cor(iris1) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000 Hence (sepal length, petal length),(petal width, sepal length) and (petal length ,petal width) are showing high correlation. > glm(formula=iris$Petal.Length~iris$Petal.Width+iris$Sepal.Length) Call: glm(formula = iris$Petal.Length ~ iris$Petal.Width + iris$Sepal.Length) Coefficients: (Intercept) iris$Petal.Width iris$Sepal.Length -1.5071 1.7481 0.5423 Degrees of Freedom: 149 Total (i.e. Null); 147 Residual Null Deviance: 464.3 Residual Deviance: 23.9 AIC: 158.2 Hence the formula found from model is iris$Petal.Length = 1.7481*iris$Petal.Width + 0.5423*iris$Sepal.Length-1.5071 study2:Multiple linear regression on MS application data

P Value is less than 0.05 which means we reject null hypothesis. Degree of Freedom is 142. For other examples use the link: http://www.ats.ucla.edu/stat/r/dae/rreg.htm Also refer to the book: - Practical Regression and Anova using R Now to create a linear model of effect of Body Weight and Sex on Heart Weight we use multiple regression modeling

So we can say that 65% variation in Heart Weight can be explained by the model. The equation becomes y=4.07x-0.08y-0.41 Dummy Variables: In regression analysis, a dummy variable (also known as an indicator variable, design variable, Boolean indicator, categorical variable, binary variable, or qualitative variable) is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. Dummy variables are used as devices to sort data into mutually exclusive categories (such as smoker/non-smoker, etc.). In other words, Dummy variables are "proxy" variables or numeric stand-ins for qualitative facts in a regression model. In regression analysis, the dependent variables may be influenced not only by quantitative variables (income, output, prices, etc.), but also by qualitative variables (gender, religion, geographic region, etc.). A dummy independent variable (also called a dummy explanatory variable) which for some observation has a value of 0 will cause that variable's coefficient to have no role in influencing the dependent variable, while when the dummy takes on a value 1 its coefficient acts to alter the intercept OR 4). Explain the concept of Basic Regression Analysis with a suitable example [10] Regression analysis is the statistical method you use when both the response variable and the explanatory variable are continuous variables (i.e. real numbers with decimal places – things like heights, weights, volumes, or temperatures). Regression modeling or analysis is a statistical process for estimating the relationships among variables. The main focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').The value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. The simple linear equation Y=mX+C , intercept ―C‖ and the slope ―m‖ . The below plot shows the linear regression

In Regression Model, both the response variable and the explanatory variable are continuous variables (i.e. real numbers with decimal places – things like heights, weights, volumes, or temperatures). In simple regression; we can determine a relationship between two variables. It is assumed that there is a high degree of correlation between the two variables chosen for use in regression. In R , lm () function to do simple regression modeling Linear Regression for finding the relation between petal length and petal width in IRIS dataset: > fit <- lm(iris$Petal.Length ~ iris$Petal.Width) >fit Call: lm(formula = iris$Petal.Length ~ iris$Petal.Width) Coefficients: (Intercept) iris$Petal.Width 1.084 2.230 We get the intercept ―C‖ and the slope ―m‖ of the equation – Y=mX+C. Here m=2.230 and C=1.084 now we found the linear equation between petal length and petal width is iris$Petal.Length=2.230* iris$Petal.Width+1.084 Visualization of fit data: The fit information displays four charts: Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage. Below are the various graphs representing values of regression

5 a)Explain Regression Modelling. [5] Regression Modeling: Regression modeling or analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). Understand influence of changes in dependent variable: More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables, i.e the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution. Estimation of continuous response variables: Regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable may be more specifically referred to as metric regression to distinguish it from related problems Regression analysis uses: It is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation. Parametric and non-parametric regression: Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional. Performance of regression analysis : The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally

b)Explain in detail about Regression residuals [5] Regression residuals: The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest. Because a linear regression model is not always appropriate for the data, assess the appropriateness of the model by defining residuals and examining residual plots. Residuals: The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual. Residual = Observed value - Predicted value e = y – ŷ x 60 70 80 85 95 y 70 65 70 95 85 ŷ 65.411 71.849 78.288 81.507 87.945 e 4.589 -6.849 -8.288 13.493 -2.945 Both the sum and the mean of the residuals are equal to zero. That is, Σ e = 0 and e = 0. The above table shows inputs and outputs from a simple linear regression analysis. Residual Plots: A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate How to find Residuals and plot them? ##Finding Residuals examples x=c(21,34,6,47,10,49,23,32,12,16,29,49,28,8,57,9,31,10,21,26,31,52,21,8,18,5,18,26, 27,26,32,2,59,58,19,14,16,9,23,28,34,70,69,54,39,9,21,54,26) y = c(47,76,33,78,62,78,33,64,83,67,61,85,46,53,55,71,59,41,82,56,39,89,31,43, 29,55, 81,82,82,85,59,74,80,88,29,58,71,60,86,91,72,89,80,84,54,71,75,84,79) m1 <- lm(y~x) #Create a linear model resid(m1) #List of residuals > resid(m1) #List of residuals OR 6) Explain the concept of Basic Regression Analysis with a suitable example [10]

Basic Regression Analysis Regression analysis is the statistical method you use when both the response variable and the explanatory variable are continuous variables (i.e. real numbers with decimal places – things like heights, weights, volumes, or temperatures). In simple regression, we try to determine whether there is a relationship between two variables. It is assumed that there is a high degree of correlation between the two variables chosen for use in regression. In R we use lm () function to do simple regression modeling. For example, > fit <- lm(data$petal_length ~ data$petal_width) When we call ―fit‖ as below >fit We get the intercept ―C‖ and the slope ―m‖ of the equation – Y=mX+C The fit information displays four charts: Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage. Below are the various graphs representing values of regression

OLS Regression OLS:- Ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data. This is applied in both simple linear and multiple regression where the common assumptions are (1) The model is linear in the coefficients of the predictor with an additive random error term (2) The random error terms are normally distributed with 0 mean and a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change Regression Modeling  Regression modeling or analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution.  Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation Many techniques for carrying out regression analysis have been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional.  The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However, in many applications, especially with small effects or questions of causality based on observational data, regression methods can give misleading results.  In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable may be more specifically referred to as metric regression to distinguish it from related problems 7) Explain regression analysis. Give example [10] Regression Analysis:  Regression modeling or analysis is a statistical process for estimating the relationships among variables. The main focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').The value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Regression analysis is a very widely used statistical tool to establish a relationship model between two variables. One of these variable is called predictor variable whose value is gathered through experiments. The other variable is called response variable whose value is derived from the predictor variable.

 In Linear Regression these two variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve. The general mathematical equation for a linear regression is – y = mx + c Following is the description of the parameters used − y is the response variable. x is the predictor variable. m(slope) and c(intercept) are constants which are called the coefficients. In R , lm () function to do simple regression modeling. The simple linear equation Y=mX+C , intercept ―C‖ and the slope ―m‖ . The below plot shows the linear regression Visualization of fit data: The fit information displays four charts: Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage. Assumptions of OLS Regression Ordinary least squares (OLS) Method: Ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses and the predicted responses by the linear approximation of the data. Assumptions of regression modeling: For both simple linear and multiple regressions where the common assumptions are a) The model is linear in the coefficients of the predictor with an additive random error term b) The random error terms are • Normally distributed with 0 mean and • A variance that doesn't change as the values of the predictor covariates change Regression Modeling: Regression modeling or analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). Understand influence of changes in dependent variable: More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables, i.e the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution. • Estimation of continuous response variables: Regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable may be more specifically referred to as metric regression to distinguish it from related problems. Regression analysis uses: It is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms nof these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation. • Parametric and non-parametric regression: Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. • Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional. • Performance of regression analysis : The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since nthe true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally OR 8 a)Justify the importance of correlation for prediction. [5] The Correlation is a measure of association between two variables. Correlations are Positive and negative which are ranging between +1 and -1. • Positive correlation (0 to +1) example: Earning and expenditure • Negative correlation (-1 to 0) example : Speed and time In R , correlation between x and y is by using cor(x,y) function •Measure of association between variables •Positive and negative correlation, ranging between +1 and -1 •Positive correlation example: •Earning and expenditure •Negative correlation example •Speed and time •Parametric – normal distribution and homogenous variance •Pearson correlation •Non parametric – no assumptions, nominal variables •Spearman correlation Correlation and Covariance: With two continuous variables, x and y, the question naturally arises as to whether their values are correlated with each other (remembering, of course, that correlation does not imply causation). Correlation is defined in terms of the variance of x, the variance of y, and the covariance of x and y (the way the two vary together; the way they co-vary) on the assumption that both variables are normally distributed. We have symbols already for the two variances, s2x and s2y. We denote the covariance of x and y by cov(x, y), after which the correlation coefficient r is defined as b)Explain OLS regression [5] OLS Regression OLS:- Ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data. This is applied in both simple linear and multiple regression where the common assumptions are (1) The model is linear in the coefficients of the predictor with an additive random error term (2) The random error terms are • normally distributed with 0 mean and • a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change Regression Modeling • Regression modeling or analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution.

• Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation

Many techniques for carrying out regression analysis have been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite- dimensional. The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However, in many applications, especially with small effects or questions of causality based on observational data, regression methods can give misleading results.

In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable may be more specifically referred to as metric regression to distinguish it from related problems 9) Discuss in detail on Basic regression analysis. Give examples.[10] Basic Regression Analysis Regression analysis is the statistical method you use when both the response variable and the explanatory variable are continuous variables (i.e. real numbers with decimal places – things like heights, weights, volumes, or temperatures). In simple regression, we try to determine whether there is a relationship between two variables. It is assumed that there is a high degree of correlation between the two variables chosen for use in regression. In R we use lm () function to do simple regression modeling. For example, > fit <- lm(data$petal_length ~ data$petal_width) When we call ―fit‖ as below >fit We get the intercept ―C‖ and the slope ―m‖ of the equation – Y=mX+C The fit information displays four charts: Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage. Below are the various graphs representing values of regression

OLS Regression OLS:- Ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data. This is applied in both simple linear and multiple regression where the common assumptions are (1) The model is linear in the coefficients of the predictor with an additive random error term (2) The random error terms are • normally distributed with 0 mean and • a variance that doesn't change as the values of the predictor covariates (i.e. IVs) change

OR 10) What is Autocorrelation and Multicollinearity ? Explain [10]

• In statistics, the autocorrelation of a random process describes the correlation between values of the process at different times, as a function of the two times or of the time lag. Autocorrelation Autocorrelation, also known as serial correlation or cross-autocorrelation, is the cross- correlation of a signal with itself at different points in time (that is what the cross stands for). Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals. In statistics, the autocorrelation of a random process describes the correlation between values of the process at different times, as a function of the two times or of the time lag. Let X be some repeatable process, and i be some point in time after the start of that process. (i may be an integer for a discrete-time process or a real number for a continuous-time process.) Then Xi is the value (or realization) produced by a given run of the process at time i. Suppose that the process is further known to have defined values for mean μi and variance σi2 for all times i. Then the definition of the autocorrelation between times s and t is

where "E" is the expected value operator Test: - • The traditional test for the presence of first-order autocorrelation is the Durbin–Watson statistic or, if the explanatory variables include a lagged dependent variable, Durbin's h statistic. The Durbin-Watson can be linearly mapped however to the Pearson correlation between values and their lags.

• A more flexible test, covering autocorrelation of higher orders and applicable whether or not the regressors include lags of the dependent variable, is the Breusch–Godfrey test. This involves an auxiliary regression, wherein the residuals obtained from estimating the model of interest are regressed on (a) the original regressors and (b) k lags of the residuals, where k is the order of the test. The simplest version of the test statistic from this auxiliary regression is TR2, where T is the sample size and R2 is the coefficient of determination. Under the null hypothesis of no autocorrelation, this statistic is asymptotically distributed as x2 with k degrees of freedom Multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regressions may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. In case of perfect multicollinearity the predictor matrix is singular and therefore cannot be inverted. Under these circumstances, for a general linear model y=Xβ+ε, the ordinary estimator does not exist Test:- Indicators that multicollinearity may be present in a model: 1) Large changes in the estimated regression coefficients when a predictor variable is added or deleted 2) Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the joint hypothesis that those coefficients are all zero (using an F-test) 3) If a multivariable regression finds an insignificant coefficient of a particular explanator, yet a simple linear regression of the explained variable on this explanatory variable shows its coefficient to be significantly different from zero, this situation indicates multicollinearity in the multivariable regression ------END OF UNIT-4------

Model Questions- Unit-5 Part-A 1. What is meant my Manufacturing?  Manufacturing is the production of merchandise for use or sale using labour and machines, tools, chemical and biological processing, or formulation. The term may refer to a range of human activity, from handicraft to high tech, but is most commonly applied to industrial production, in which raw materials are transformed into finished goods on a large scale. Such finished goods may be used for manufacturing other, more complex products, such as aircraft, household appliances or automobiles, or sold to wholesalers, who in turn sell them to retailers, who then sell them to end users – the "consumers"

2. Explain Smart Utilities  S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of drive reliability, with the intent of enabling the anticipation of hardware failures 3. What are the problems related to a project? The following are examples of project issues.  Poor Quality. Deliverables are low quality causing delays or rejection of deliverables.  Scope Creep. ...  Change Management. ...  Benefit Shortfall. ...  Design Issues. ...  Integration. ...  Technical Issues. ...  Resistance to Change

4. Explain about Engineering Design  The engineering design process is a methodical series of steps that engineers use in creating functional products and processes. The process is highly iterative - parts of the process often need to be repeated many times before production phase can be entered - though the part(s) that get iterated and the number of such cycles in any given project can be highly variable 5. What are problems related to a Business?

• Uncertainty about the future. ... • Financial management. ... • Monitoring performance. ... • Regulation and compliance. ... • Competencies and recruiting the right talent. ... • Technology. ... • Exploding data. ... • Customer service • 6. What type of data analysis is feasible for automative department activities? The automotive industry is a wide range of companies and organizations involved in the design, development, manufacturing, marketing, and selling of motor vehicles, some of them are called automakers. It is one of the world's most important economic sectors by revenue. The automotive industry does not include industries will be dedicated to the maintenance of automobiles following delivery to the end-user (maybe), such as automobile repair shops and motor fuel filling stations

7. Name different business problems related to various businesses.

• Uncertainty about the future, Financial management. ... • Monitoring performance, Regulation and compliance. ... • Competencies and recruiting the right talent, Technology. ... • Exploding data, Customer service 9. How do you Understand the business problem Related to engineering?  The BA process can solve problems and identify opportunities to improve business performance. In the process,organizations may also determine strategies to guide operations and help achieve competitive advantages. Typically, solving problems and identifying strategic opportunities to follow are organization decision-making tasks. The latter, identifying opportunities can be viewed as a problem of strategy choice requiring a solution  4. Write short notes on Smart Utilities Self-Monitoring, Analysis and Reporting Technology; often written as SMART is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of drive reliability, with the intent of enabling the anticipation of hardware failures 

10. What is Tech System?

The application of scientific knowledge for practical purposes, especially in industry. Technology can be the knowledge of techniques, processes, and the like, or it can be embedded in machines which can be operated without detailed knowledge of their workings. The human species' use of technology began with the conversion of natural resources into simple tools. Technology has many effects. It has helped develop more advanced economies (including today's global economy) and has allowed the rise of a leisure class. ------

Part-B 1) Explain SMART utilities in detail [10] S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of drive reliability, with the intent of enabling the anticipation of hardware failures. When S.M.A.R.T. data indicates a possible imminent drive failure, software running on the host system may notify the user so stored data can be copied to another storage device, preventing data loss, and the failing drive can be replaced Understand the business problem related to engineering, Identify the critical issues. Set business objectives. The BA process can solve problems and identify opportunities to improve business performance. In the process, organizations may also determine strategies to guide operations and help achieve competitive advantages. Typically, solving problems and identifying strategic opportunities to follow are organization decision-making tasks. The latter, identifying opportunities can be viewed as a problem of strategy choice requiring a solution

OR 2) Why should we gather requirements before starting the implementation of the project?. Explain [10]

Requirements gathering : Gather all the Data related to Business objective There are many different approaches that can be used to gather information about a business. They include the following: Review business plans, existing models and other documentation Interview subject area experts Conduct fact-finding meetings Analyze application systems, forms, artifacts, reports, etc.

The business analyst should use one-on-one interviews early in the business analysis project to gage the strengths and weaknesses of potential project participants and to obtain basic information about the business. Large meetings are not a good use of time for data gathering. Facilitated work sessions are a good mechanism for validating and refining ―draft‖ requirements. They are also useful to prioritize final business requirements. Group dynamics can often generate even better ideas. Primary or local data is collected by the business owner and can be collected by survey, focus group or observation. Third party static data is purchased in bulk without a specific intent in mind. While easy to get (if you have the cash) this data is not specific to your business and can be tough to sort through as you often get quite a bit more data than you need to meet your objective. Dynamic data is collected through a third party process in near real-time from an event for a specific purpose (read into that VERY expensive). Three key questions you need to ask before making a decision about the best method for your firm. What is the timeline required to accomplish your business objective? What is your required return on investment?  Is the data collection for a stand-alone event or for part of a broader data collection effort?

How to interpret Data to make it useful for Business Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability. BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics. BI can be used to support a wide range of business decisions ranging from operational to strategic. Basic operating decisions include product positioning or pricing. Strategic business decisions include priorities, goals and directions at the broadest level. In all cases, BI is most effective when it combines data derived from the market in which a company operates (external data) with data from company sources internal to the business such as financial and operations data (internal data). When combined, external and internal data can provide a more complete picture which, in effect, creates an "intelligence" that cannot be derived by any singular set of data Business intelligence is made up of an increasing number of components including

• Multidimensional aggregation and allocation • Denormalization, tagging and standardization • Realtime reporting with analytical alert • A method of interfacing with unstructured data sources • Group consolidation, budgeting and rolling forecasts • Statistical inference and probabilistic simulation • Key performance indicators optimization • Version control and process management • Open item management

3). Explain the concept of Understanding systems and Engineering Design[10] Engineering Design: The engineering design process is a methodical series of steps that engineers use in creating functional products and processes. The process is highly iterative - parts of the process often need to be repeated many times before production phase can be entered - though the part(s) that get iterated and the number of such cycles in any given project can be highly variable. One framing of the engineering design process delineates the following stages: research, conceptualization, feasibility assessment, establishing design requirements, preliminary design, detailed design, production planning and tool design, and production

Manufacturing: Manufacturing is the production of merchandise for use or sale using labour and machines, tools, chemical and biological processing, or formulation. The term may refer to a range of human activity, from handicraft to high tech, but is most commonly applied to industrial production, in which raw materials are transformed into finished goods on a large scale. Such finished goods may be used for manufacturing other, more complex products, such as aircraft, household appliances or automobiles, or sold to wholesalers, who in turn sell them to retailers, who then sell them to end users – the "consumers". The manufacturing sector is closely connected with engineering and industrial design. Examples in Manufacturers: North America: General Motors Corporation, General Electric, Procter & Gamble, General Dynamics, Boeing, Pfizer, and Precision Cast parts. Europe: Volkswagen Group, Siemens, and Michelin. Asia: Sony, Huawei, Lenovo, Toyota, Samsung, and Bridgestone OR 4). Comparison of business analytics and organization decision-making processes with a flow chart in detail [10] The BA process can solve problems and identify opportunities to improve business performance. In this process, organizations may also determine strategies to guide operations and help to achieve competitive advantages. Typically, solving problems and identifying strategic opportunities to follow organization decision-making tasks. The latter, identifying opportunities can be viewed as a problem of strategy choice requiring a solution

Business intelligence (BI) : Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability. BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics. BI can be used to support a wide range of business decisions ranging from operational to strategic. Basic operating decisions include product positioning or pricing. Strategic business decisions include priorities, goals and directions at the broadest level. In all cases, BI is most effective when it combines data derived from the market in which a company operates (external data) with data from company sources internal to the business such as financial and operations data (internal data). When combined, external and internal data can provide a more complete picture which, in effect, creates an "intelligence" that cannot be derived by any singular set of data. Purpose of Business Intelligence: Business intelligence can be applied to the following business purposes, in order to drive business value. Measurement: program that creates a hierarchy of performance metrics (see also Metrics Reference Model) and benchmarking that informs business leaders about progress towards business goals (business process management). Analytics: program that builds quantitative processes for a business to arrive at optimal decisions and to perform business knowledge discovery. Frequently involves: data mining, process mining, statistical analysis, predictive analytics, predictive modeling, business process modeling, data lineage, and complex event processing and prescriptive analytics. Reporting/enterprise reporting: program that builds infrastructure for strategic reporting to serve the strategic management of a business, not operational reporting. Frequently involves data visualization, executive information system and OLAP Collaboration/collaboration platform: program that gets different areas (both inside and outside the business) to work together through data sharing and electronic data interchange. Knowledge management: program to make the company data-driven through strategies and practices to identify, create, represent, distribute, and enable adoption of insights and experiences that are true business knowledge. Knowledge management leads to learning management and regulatory compliance 5) Discuss the comparison of business analytics and organization decision-making processes with a flow chart [10] The business analyst should use one-on-one interviews early in the business analysis project to gage the strengths and weaknesses of potential project participants and to obtain basic information about the business. Large meetings are not a good use of time for data gathering. Facilitated work sessions are a good mechanism for validating and refining ―draft‖ requirements. They are also useful to prioritize final business requirements. Group dynamics can often generate even better ideas. Primary or local data is collected by the business owner and can be collected by survey, focus group or observation. Third party static data is purchased in bulk without a specific intent in mind. While easy to get (if you have the cash) this data is not specific to your business and can be tough to sort through as you often get quite a bit more data than you need to meet your objective. Dynamic data is collected through a third party process in near real-time from an event for a specific purpose (read into that VERY expensive). Three key questions you need to ask before making a decision about the best method for your firm  What is the timeline required to accomplish your business objective? What is your required return on investment? Is the data collection for a stand-alone event or for part of a broader data collection effort? In the process, organizations may also determine strategies to guide operations and help achieve competitive advantages. Typically, solving problems and identifying strategic opportunities to follow are organization decision-making tasks. The latter, identifying opportunities can be viewed as a problem of strategy choice requiring a solution

Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics. BI can be used to support a wide range of business decisions ranging from operational to strategic.

Basic operating decisions include product positioning or pricing. Strategic business decisions include priorities, goals and directions at the broadest level. In all cases, BI is most effective when it combines data derived from the market in which a company operates (external data) with data from company sources internal to the business such as financial and operations data (internal data). When combined, external and internal data can provide a more complete picture which, in effect, creates an "intelligence" that cannot be derived by any singular set of data OR 6) What is meant by Manufacturing? Explain Smart Utilities [10] Manufacturing: Manufacturing is the production of goods for use or sale using labour and machines, tools, chemical and biological processing, or formulation. The term may refer to a range of human activity, from handicraft to high tech, but is most commonly applied to industrial production, in which raw materials are transformed into finished goods on a large scale. Such finished goods may be used for manufacturing other, more complex products, such as aircraft, household appliances or automobiles, or sold to wholesalers, who in turn sell them to retailers, who then sell them to end users – the "consumers". Manufacturing takes turns under all types of economic systems. In a free market economy, manufacturing is usually directed toward the mass production of products for sale to consumers at a profit. In a collectivist economy, manufacturing is more frequently directed by the state to supply a centrally planned economy. In mixed market economies, manufacturing occurs under some degree of government regulation. Modern manufacturing includes all intermediate processes required for the production and integration of a product's components. Some industries, such as semiconductor and steel manufacturers use the term fabrication instead. The manufacturing sector is closely connected with engineering and industrial design. Examples of major manufacturers in North America include General Motors Corporation, General Electric, Procter & Gamble, General Dynamics, Boeing, Pfizer, and Precision Cast parts. Examples in Europe include Volkswagen Group, Siemens, and Michelin. Examples in Asia include Sony, Huawei, Lenovo, Toyota, Samsung, and Bridgestone Smart Utilities: S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of drive reliability, with the intent of enabling the expectation of hardware failures. When S.M.A.R.T. data indicates a possible forthcoming drive failure, software running on the host system may notify the user so stored data can be copied to another storage device, preventing data loss, and the failing drive can be replaced. Smart Utility Systems is the leading provider of Software-as-a-Service (SaaS) solutions for Customer Engagement, Mobile Workforce, and Big Data Analytics to the Energy and Utility sector. We help utilities improve their operational efficiency and maximize revenue realization, through mobile and cloud technologies

7 ) Write short note on understanding engineering design and manufacturing[10] The engineering design process is a series of steps that engineers follow to come up with a solution to a problem. Many times the solution involves designing a product (like a machine or computer code) that meets certain criteria and/or accomplishes a certain task. The engineering design process is a methodical series of steps that engineers use in creating functional products and processes. The process is highly iterative - parts of the process often need to be repeated many times before production phase can be entered - though the part(s) that get iterated and the number of such cycles in any given project can be highly variable. Engineering Design Process describes the following stages: 1) Research 2) Conceptualization 3) Feasibility assessment 4) Establishing Design requirements 5) Preliminary design 6) Detailed design 7) Production planning and tool design, and 8) Production Research: Research is a careful and detailed study into a specific problem, concern, or issue using the scientific method Research can be about anything, and we hear about all different types of research in the news. Cancer research has 'Breakthrough Cancer-Killing Treatment Has No Side Effects in Mice,' and 'Baby Born with HIV Cured.' Each of these began with an issue or a problem (such as cancer or HIV), and they had a question, like, 'Does medication X reduce cancerous tissue or HIV infections?' But all I've said so far is what research has done (sort of like saying baking leads to apple pie; it doesn't really tell you anything other than the two are connected). To begin researching something, you have to have a problem, concern, or issue that has turned into a question. These can come from observing the world, prior research, professional literature, or from peers. Research really begins with the right question, because your question must be answerable. Questions like, 'How can I cure cancer?' aren't really answerable with a study. It's too vague and not testable Conceptualization: Conceptualization is mental process of organizing one‟s observations and experiences into meaningful and coherent wholes. In research, conceptualization produces an agreed upon meaning for a concept for the purposes of research. Different meaning for a concept for the purposes of research. Different researchers may conceptualize a concept slightly differently. „ Conceptualization describes the indicators we'll use to measure Conceptualization describes the indicators we'll use to measure the concept and the different aspects of the concept. 3. Feasibility assessment: Feasibility studies are almost always conducted where large sums are at stake. Also called feasibility analysis. In order to ensure the manufacturing facility to make a new item the engineers launched a feasibility study to determine the actual steps required to build the product. 4. Design Requirements : The product/component to be analysed is characterised in terms of: functional requirements, objective of the materials selection process, constraints imposed by the requirements of the application, plus the free variable, which is usually one of the geometric dimensions of the product/component, such as thickness, which enables the constraints to be satisfied and the objective function to be maximised or minimised, depending on the application. Hence, the design requirement of the part/component is defined in terms of function, objective, constraints Preliminary Design: The preliminary design, or high-level design includes , often bridges a gap between design conception and detailed design, particularly in cases where the level of conceptualization achieved during ideation is not sufficient for full evaluation. So in this task, the overall system configuration is defined, and schematics, diagrams, and layouts of the project may provide early project configuration. This notably varies a lot by field, industry, and product. During detailed design and optimization, the parameters of the part being created will change, but the preliminary design focuses on creating the general framework to build the project on. 6. Detailed Design: Detailed Design phase, which may consist of procurement of materials as well. This phase further elaborates each aspect of the project/product by complete description through solid modelling, drawings as well as specifications. 7. Production planning and tool design: The production planning and tool design consists of planning how to mass-produce the product and which tools should be used in the manufacturing process. Tasks to complete in this step include selecting materials, selection of the production processes, determination of the sequence of operations, and selection of tools such as jigs, fixtures, metal cutting and metal or plastics forming tools. This task also involves additional prototype testing iterations to ensure the mass- produced version meets qualification testing standards. 8. Production: Production is a process of workers combining various material inputs and immaterial inputs (plans, know-how) in order to make something for consumption (the output). It is the act of creating output, a good or service which has value and contributes to the utility of individuals Manufacturing: Manufacturing is the production of goods for use or sale using labour and machines, tools, chemical and biological processing, or formulation. The term may refer to a range of human activity, from handicraft to high tech, but is most commonly applied to industrial production, in which raw materials are transformed into finished goods on a large scale. Such finished goods may be used for manufacturing other, more complex products, such as aircraft, household appliances or automobiles, or sold to wholesalers, who in turn sell them to retailers, who then sell them to end users – the "consumers". Manufacturing takes turns under all types of economic systems. In a free market economy, manufacturing is usually directed toward the mass production of products for sale to consumers at a profit. In a collectivist economy, manufacturing is more frequently directed by the state to supply a centrally planned economy. In mixed market economies, manufacturing occurs under some degree of government regulation. Modern manufacturing includes all intermediate processes required for the production and integration of a product's components. Some industries, such as semiconductor and steel manufacturers use the term fabrication instead. The manufacturing sector is closely connected with engineering and industrial design. Examples of major manufacturers in North America include General Motors Corporation, General Electric, Procter & Gamble, General Dynamics, Boeing, Pfizer, and Precision Cast parts. Examples in Europe include Volkswagen Group, Siemens, and Michelin. Examples in Asia include Sony, Huawei, Lenovo, Toyota, Samsung, and Bridgestone OR 8) Compare and contrast the manufacturing and production activities data analysis with respect to technology. [10] Production: Production is a process of workers combining various material inputs and immaterial inputs (plans, know-how) in order to make something for consumption (the output). It is the act of creating output, a good or service which has value and contributes to the utility of individuals Manufacturing: Manufacturing is the production of goods for use or sale using labour and machines, tools, chemical and biological processing, or formulation. The term may refer to a range of human activity, from handicraft to high tech, but is most commonly applied to industrial production, in which raw materials are transformed into finished goods on a large scale. Such finished goods may be used for manufacturing other, more complex products, such as aircraft, household appliances or automobiles, or sold to wholesalers, who in turn sell them to retailers, who then sell them to end users – the "consumers". Manufacturing takes turns under all types of economic systems. In a free market economy, manufacturing is usually directed toward the mass production of products for sale to consumers at a profit. In a collectivist economy, manufacturing is more frequently directed by the state to supply a centrally planned economy. In mixed market economies, manufacturing occurs under some degree of government regulation. Modern manufacturing includes all intermediate processes required for the production and integration of a product's components. Some industries, such as semiconductor and steel manufacturers use the term fabrication instead. The manufacturing sector is closely connected with engineering and industrial design. Examples of major manufacturers in North America include General Motors Corporation, General Electric, Procter & Gamble, General Dynamics, Boeing, Pfizer, and Precision Cast parts. Examples in Europe include Volkswagen Group, Siemens, and Michelin. Examples in Asia include Sony, Huawei, Lenovo, Toyota, Samsung, and Bridgestone Production lines: Production lines is an arrangement in a factory in which a thing being manufactured is passed through a set linear sequence of mechanical or manual operations. A production line is a set of sequential operations established in a factory whereby materials are put through a refining process to produce an end-product that is suitable for onward consumption; or components are assembled to make a finished article ------Technology: The application of scientific knowledge for practical purposes, especially in industry. Technology can be the knowledge of techniques, processes, and the like, or it can be embedded in machines which can be operated without detailed knowledge of their workings. The human species' use of technology began with the conversion of natural resources into simple tools. The prehistoric discovery of how to control fire and the later Neolithic Revolution increased the available sources of food and the invention of the wheel helped humans to travel in and control their environment. Developments in historic times, including the printing press, the telephone, and the Internet, have lessened physical barriers to communication and allowed humans to interact freely on a global scale. The steady progress of military technology has brought weapons of ever-increasing destructive power, from clubs to nuclear weapons. Technology has many effects. It has helped develop more advanced economies (including today's global economy) and has allowed the rise of a leisure class. Many technological processes produce unwanted byproducts known as pollution and deplete natural resources to the detriment of Earth's environment. Various implementations of technology influence the values of a society and new technology often raises new ethical questions. Examples include the rise of the notion of efficiency in terms of human productivity, and the challenges of bioethics. Philosophical debates have arisen over the use of technology, with disagreements over whether technology improves the human condition or worsens it. Neo-Luddism, anarcho- primitivism, and similar reactionary movements criticise the pervasiveness of technology in the modern world, arguing that it harms the environment and alienates people; proponents of ideologies such as transhumanism and techno-progressivism view continued technological progress as beneficial to society and the human condition. Until recently, it was believed that the development of technology was restricted only to human beings, but 21st century scientific studies indicate that other primates and certain dolphin communities have developed simple tools and passed their knowledge to other generations

9) Explain SMART utilities in detail [10] SMART Utilities: S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of drive reliability, with the intent of enabling the anticipation of hardware failures. When S.M.A.R.T. data indicates a possible imminent drive failure, software running on the host system may notify the user so stored data can be copied to another storage device, preventing data loss, and the failing drive can be replaced

The BA process can solve problems and identify opportunities to improve business performance. In the process, organizations may also determine strategies to guide operations and help achieve competitive advantages. Typically, solving problems and identifying strategic opportunities to follow are organization decision-making tasks. The latter, identifying opportunities can be viewed as a problem of strategy choice requiring a solution

OR 10) Why should we gather requirements before starting the implementation of the project?. Explain [10] Requirements gathering : Gather all the Data related to Business objective

There are many different approaches that can be used to gather information about a business. They include the following: Review business plans, existing models and other documentation Interview subject area experts Conduct fact-finding meetings Analyze application systems, forms, artifacts, reports, etc.

The business analyst should use one-on-one interviews early in the business analysis project to gage the strengths and weaknesses of potential project participants and to obtain basic information about the business. Large meetings are not a good use of time for data gathering. Facilitated work sessions are a good mechanism for validating and refining ―draft‖ requirements. They are also useful to prioritize final business requirements. Group dynamics can often generate even better ideas Primary or local data is collected by the business owner and can be collected by survey, focus group or observation. Third party static data is purchased in bulk without a specific intent in mind. While easy to get (if you have the cash) this data is not specific to your business and can be tough to sort through as you often get quite a bit more data than you need to meet your objective. Dynamic data is collected through a third party process in near real-time from an event for a specific purpose (read into that VERY expensive). Three key questions you need to ask before making a decision about the best method for your firm. • What is the timeline required to accomplish your business objective? • What is your required return on investment? • Is the data collection for a stand-alone event or for part of a broader data collection effort? How to interpret Data to make it useful for Business Business intelligence (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes. BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability. BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics. BI can be used to support a wide range of business decisions ranging from operational to strategic. Basic operating decisions include product positioning or pricing. Strategic business decisions include priorities, goals and directions at the broadest level. In all cases, BI is most effective when it combines data derived from the market in which a company operates (external data) with data from company sources internal to the business such as financial and operations data (internal data). When combined, external and internal data can provide a more complete picture which, in effect, creates an "intelligence" that cannot be derived by any singular set of data

Business intelligence can be applied to the following business purposes, in order to drive business value.  Measurement – program that creates a hierarchy of performance metrics (see also Metrics Reference Model) and benchmarking that informs business leaders about progress towards business goals (business process management).  Analytics – program that builds quantitative processes for a business to arrive at optimal decisions and to perform business knowledge discovery. Frequently involves: data mining, process mining, statistical analysis, predictive analytics, predictive modeling, business process modeling, data lineage, complex event processing and prescriptive analytics.  Reporting/enterprise reporting – program that builds infrastructure for strategic reporting to serve the strategic management of a business, not operational reporting. Frequently involves data visualization, executive information system and OLAP.  Collaboration/collaboration platform – program that gets different areas (both inside and outside the business) to work together through data sharing and electronic data interchange.  Knowledge management – program to make the company data-driven through strategies and practices to identify, create, represent, distribute, and enable adoption of insights and experiences that are true business knowledge. Knowledge management leads to learning management and regulatory compliance

------END OF IA------