N an Open Source Statistical, Graphics and a Numerical Package, an Extension of S Developed

Total Page:16

File Type:pdf, Size:1020Kb

N an Open Source Statistical, Graphics and a Numerical Package, an Extension of S Developed

INTRO TO R (Part-1)

 An open source Statistical, Graphics and a Numerical package, an extension of S developed by Bell Labs

 An excellent source for

● Linear modeling

● Nonlinear modeling

● Classification

● Time-series analysis

● Clustering

● Integration, Optimization

● Classical Statistical analysis

● Graphical Techniques

 All free

 C/C++, FORTRAN codes could be linked

 R is extended via packages

 R is for data scientists using large databases

 For free download use, for example, (takes about 33MB)

cran.r-project.org/bin/windows/base  Do not download and install all packages. They are too large.

 To start R, click on the desktop icon. To quit R, just use q(). If you want to save the workspace image, indicate this now.

 To see what Libraries are installed in your hard drive, use library()  To see which base packages are currently running, try sessionInfo()

 To see what set of functions is handled by a library package, use command help(package=”package_name”)

 Use the library function to load a specific package from the hard drive into memory library (“package_name”)

 To install a package and use it, try the command

install.packages (“package_name”) library (“package_name”)

Basic R

 R deals with objects like vectors, matrices (arrays), tables, lists, scalars

Example of scalar operations.

> x=2 > x [1] 2 > x+3 [1] 5 > x==2 [1] TRUE > y=5 > (y==4)+(x==2) # Boolean OR [1] 1 > (y==4)^(x==2) # Boolean AND [1] 0 > (y==4)||(x==2) # Logical OR [1] TRUE > (y==4)&&(x==2) # Logical AND [1] FALSE

> y=2; > x=3; > z=y^2+x^3+sin(x) > z [1] 31.14112 >

> x=8.9; > sqrt(x) [1] 2.983287 > log(x) [1] 2.186051 > log10(x) [1] 0.94939 > log2(x) [1] 3.153805 > exp(log2(x)) [1] 23.42504 > 2^x [1] 477.7129

Operators in R

> x=9 > y=2 > x+y [1] 11 > x-y [1] 7 > x*y [1] 18 > x/y [1] 4.5 > x%/%y #integer division [1] 4 > x**y #exponentiation [1] 81 > x%%y #Modulus operation [1] 1 > Examples of vectors in R. All elements in a vector must be of same type.

> x=1:5; > x [1] 1 2 3 4 5 > > x=9:-3 # sequencing in the opposite direction > x [1] 9 8 7 6 5 4 3 2 1 0 -1 -2 -3 > x=seq(1,5,by=0.5) > x [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 > y=seq(1,5,length=15) > y [1] 1.000000 1.285714 1.571429 1.857143 2.142857 2.428571 2.714286 3.000000 3.285714 3.571429 3.857143 4.142857 4.428571 4.714286 5.000000

Another way (via concatenation)

> x=c(1.2,3,-3.4,8.5,0.16) > x [1] 1.20 3.00 -3.40 8.50 0.16 >

> x=c("John","Jacob","Jeff"); > x [1] "John" "Jacob" "Jeff" > y=c("Emily","Amy"); > z=c(x,y) > z [1] "John" "Jacob" "Jeff" "Emily" "Amy" >

> x=c(T,F,F,T,F,T,T) #Vector of logical values > x [1] TRUE FALSE FALSE TRUE FALSE TRUE TRUE >

To repeat a subsequence within a sequence

> x=c(3,6,rep(c(4,-6),3),8,10) # Repeat 3 times > x [1] 3 6 4 -6 4 -6 4 -6 8 10

Great! Now next question! Given a vector, what can we glean from it? Basically, how it looks in terms of central tendency, its some fundamental plots, etc.

What are the usual metrics for a set of data that interest us?

 Min, Max: Extremes  Mean: Central tendency  Median: Central  Standard Deviation, Variance: Spread  Quartiles: Spread  Range: Spread

We start with a vector.

x=c(34,12,-5,41,62,19,81,68,49,88, 86, 13, 24);

> mean(x) [1] 44 > median(x) [1] 41 > var(x) [1] 959.5 > sd(x) [1] 30.9758 > min(x) [1] -5 > max(x) [1] 88 > range(x) [1] -5 88 > quantile(x) 0% 25% 50% 75% 100% -5 19 41 68 88 > IQR(x) #Interquantile range Q3-Q1 [1] 49 >

This gives us an idea of the vector we are dealing with. We get more information by visualizing it.

Some simple plots.

> hist(x,col="blue") #Get a histogram

Histogram of x 0 . 3 5 . 2 0 . 2 y c n e 5 . u 1 q e r F 0 . 1 5 . 0 0 . 0

-20 0 20 40 60 80 100

x

Get a histogram with different colors and labels > hist(x,col=c("blue","yellow","magenta","red"));

Histogram of x 0 . 3 5 . 2 0 . 2 y c n e 5 . u 1 q e r F 0 . 1 5 . 0 0 . 0

-20 0 20 40 60 80 100

x

> hist(x,col=c("blue","yellow","magenta","red"), + main="PERFORMANCE FREQUENCYDISTRIBUTION",xlab="REVENUE", + ylab="PERCENTAGE EARNING"); > PERFORMANCE FREQUENCY DISTRIBUTION 0 . 3 5 . 2 G 0 . N I 2 N R A E

E 5 . G 1 A T N E C R 0 . E 1 P 5 . 0 0 . 0

-20 0 20 40 60 80 100

REVENUE

Matrices and arrays in R.

A matrix is a vector organized into 2D structures with columns and rows. An array is a generalization of this structure into more dimensions.

Matrix construction.

> m1 = matrix(c(5,4,8,6,3,10), nrow=2, ncol=3) > m1 [,1] [,2] [,3] [1,] 5 8 3 [2,] 4 6 10

Note how the matrix is constructed by columns by default.

> t(m1) # transpose of the matrix m1 [,1] [,2] [1,] 5 4 [2,] 8 6 [3,] 3 10

Combining matrices. Suppose we have two matrices with the same number of rows. Then they can be combined to form one composite matrix as shown below using cbind operation

> m2=matrix(c(1,2,-1,-2),nrow=2, ncol=2) > m2 [,1] [,2] [1,] 1 -1 [2,] 2 -2

> m3=cbind(m1,m2) > m3 [,1] [,2] [,3] [,4] [,5] [1,] 5 8 3 1 -1 [2,] 4 6 10 2 -2

We can combine matrices using rbind operation as shown. In this case we have to have same number of columns.

> m2=matrix(c(1,2,3,-4,-5,-6,7,-8,9), nrow=3, ncol=3) > m2 [,1] [,2] [,3] [1,] 1 -4 7 [2,] 2 -5 -8 [3,] 3 -6 9

> m3=rbind(m1,m2) > m3 [,1] [,2] [,3] [1,] 5 8 3 [2,] 4 6 10 [3,] 1 -4 7 [4,] 2 -5 -8 [5,] 3 -6 9 >

More example of using rbind (or cbind):

> m=rbind(c(1,2,3),c(4,5,6)) > m [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 >

Obviously we don’t need both nrow, ncol specifications. Any one will do as long as nrowncol = # of matrix elements.

> m1 = matrix(c(5,4,8,6,3,10), ncol=3) > m1 [,1] [,2] [,3] [1,] 5 8 3 [2,] 4 6 10 >

But what if we want the entire vector to be distributed over 4 columns? See, how R handles it.

> m1=matrix(c(5,4,8,6,3,10), ncol=4)

Warning message: In matrix(c(5, 4, 8, 6, 3, 10), ncol = 4) : data length [6] is not a sub-multiple or multiple of the number of columns [4]

> m1 [,1] [,2] [,3] [,4] [1,] 5 8 3 5 [2,] 4 6 10 4 >

Here is a shorter way to construct a matrix m out of a vector v.

> v=c(3,1,2,-5,-6,-7); > m= matrix(v,3,2) # 3 rows and 2 cols > m [,1] [,2] [1,] 3 -5 [2,] 1 -6 [3,] 2 -7

What if I want to construct a matrix by row instead of by column (the default option)?

> m=matrix(v,ncol=2,byrow=TRUE)

> m [,1] [,2] [1,] 3 1 [2,] 2 -5 [3,] -6 -7 >

To construct a set of data into a matrix organization, we use the function scan to organize it accordingly. Suppose, data.txt file contains the data.

2 3 4 8 18 -9 5 6 7 8 9 67

> v=scan("D:\\Rdata\\data1.txt") Read 12 items > v [1] 2 3 4 8 18 -9 5 6 7 8 9 67

> m=matrix(v,ncol=3) > m [,1] [,2] [,3] [1,] 2 18 7 [2,] 3 -9 8 [3,] 4 5 9 [4,] 8 6 67

Standard Input from a file and Output to a file could be evoked via source (“myfile”) and sink (“myfile”, append=FALSE, split=FALSE)

If append option is FALSE, output overwrites the file, otherwise appends it. If split is TRUE output is directed both to the file and to the console, otherwise to the file only.

We can support a matrix object also in the following way: start with a vector, then impose its dimension.

> v=1:18 > dim(v)=c(3,6) # 3 rows and 6 cols > v [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 17 [3,] 3 6 9 12 15 18 >

We can also do this with an array function.

> m=array(1:18, dim=c(3,6)) > m [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 4 7 10 13 16 [2,] 2 5 8 11 14 17 [3,] 3 6 9 12 15 18 >

Obtaining parts from a matrix m.

The 4th row of m goes to p, and 3rd column of m goes to q.

> p=m[4,] #4th row of m assigned to p > p [1] 8 6 67

> q=m[,3] #3rd column of m assigned to q > q [1] 7 8 9 67 > > p=m[1:2,5:6] #select row 1 to 2 and col 5 to 6 > p [,1] [,2] [1,] 13 16 [2,] 14 17

> p=m[,-6] #delete col 6 from m and assign rest to p > p [,1] [,2] [,3] [,4] [,5] [1,] 1 4 7 10 13 [2,] 2 5 8 11 14 [3,] 3 6 9 12 15 >

A Data frame. Most referred object in R. It is an extension of a Matrix where columns may appear in different modes: numeric, character, factor. All the columns of a Data frame are related. In this sense, a data frame is a table where each row represents a case or an example, while each column represents measurements on one variable.

> marks=c(45,62,89,76,92); > names=c("John","Ed","Scott","Paul","Bob"); > passed=c(F,F,T,T,T);

> mydf=data.frame(names,marks,passed) > mydf names marks passed 1 John 45 FALSE 2 Ed 62 FALSE 3 Scott 89 TRUE 4 Paul 76 TRUE 5 Bob 92 TRUE

How to add new rows to data frame? Create a new Data frame of same structure and then use rbind to bind it to the table.

> newrow=data.frame(names="Chris",marks="42",passed=F); > mydf=rbind(mydf,newrow) > mydf names marks passed 1 John 45 FALSE 2 Ed 62 FALSE 3 Scott 89 TRUE 4 Paul 76 TRUE 5 Bob 92 TRUE 6 Chris 42 FALSE

How about appending multiple rows at a time to a data frame?

> mydf=rbind(mydf, + data.frame(names="Meghan",marks=72,passed=T), + data.frame(names="Laurie",marks=93,passed=T), + data.frame(names="Charlie",marks=22,passed=F))

> mydf

names marks passed 1 John 45 FALSE 2 Ed 62 FALSE 3 Scott 89 TRUE 4 Paul 76 TRUE 5 Bob 92 TRUE 6 Chris 42 FALSE 7 Meghan 72 TRUE 8 Laurie 93 TRUE 9 Charlie 22 FALSE

One can preallocate a Data Frame particularly if it is large. Using the numeric and the character functions we indicate how much memory would be required to build the Data Frame.

Example. > n=100000 > dfrm=data.frame(names=character(n), marks=numeric(n), passed=character(n))

We now have a table of 100000 size. We have to be careful though in handling Factors. Suppose a particular variable gets Factor values. Suppose income is a variable and it gets only one factor value: “Low”, “Medium”, “High”. We would be dealing with the Data Frame initialization as

> n=100000 > dfrm=data.frame(names=character(n), marks=numeric(n), + passed=character(n), income=factor(n, levels=c(“low”, + “medium”, “high”)) >

Factors are categorical or nominal variables. Like “gender” of a person, given by two-levels “male”, and “female”. Levels in a factor may be changed if we so need. x=c(1,2,2,3,1,1,3,2,3,1,1,2) > factor(x) [1] 1 2 2 3 1 1 3 2 3 1 1 2 Levels: 1 2 3 > y=factor(x,labels=c("Mini","Minie","Moe")) > y [1] Mini Minie Minie Moe Mini Mini Moe Minie Moe Mini Mini Minie Levels: Mini Minie Moe >

Since factors are enumerated data, one cannot take their mean, standard deviations, etc. We have to convert such data into numeric form first. Notice the way we glean the numeric equivalent of factors.

> x=c(14, 8, 3, 6, 21, 4) > f=factor(x) > f [1] 14 8 3 6 21 4 Levels: 3 4 6 8 14 21 > as.numeric(f) [1] 5 4 1 3 6 2 #This shows rank-ordering of the nominal values

Lists as R objects.

 List is an extension of a vector object whose components may be of different types. The components of a list may be lists.

Example. Start with 3 vectors a, b and c. Then form a list using them.

> a=c(1,2,5,4,8); > b=c("John","Amos","Rafa","Nadia"); > c=c(T,F,T,F); > x=list(14,a,b,c)

> #list slicing gives us elements of a list. > x[1] #1st element of list x [[1]] [1] 14

> x[3] [[1]] [1] "John" "Amos" "Rafa" "Nadia"

> x[8] [[1]] NULL

> #Member Reference. Use Double square brackets [[]] #to refer to a member

> x[[2]] [1] 1 2 5 4 8 > x[[4]] [1] TRUE FALSE TRUE FALSE >

List reference by named members. The members of a list must be named.

> somelist=list(bride=c(281,12,46),groom=c("John","Bob","Cindy")) > somelist

$bride [1] 281 12 46

$groom [1] "John" "Bob" "Cindy"

> somelist[["bride"]] [1] 281 12 46 > somelist[["groom"]] [1] "John" "Bob" "Cindy" >

A Data Frame is a list. Let us create a data frame and see what we can do with it.

> ID=c(6249,3456,1238,4163,6709,5609,4512,7890,2345,3456, + 2363,2164,5431,5896,8856,8712,8012,7809,3234,3246)

> Weight=c(129,78,89,134,182,96,112,132,202,101, + 133,78,89,134,96,114,145,133,203,102)

> Sleeptime=c(10,7,7,8,10,6,8,8,12,7, + 8,7,7,9,6,8,9,7,12,7)

> Gender=c("M","F","F","M","M","F","M","M","M","F", + "F","F","M","M","M","F","M","M","M","F")

Performance=c(76,82,86,88,45,79,92,89,48,86,88,89,77,75,77,67,68,45, + 82,87)

> dfrm=data.frame(ID,Weight,Sleeptime,Gender,Performance) > dfrm

To see the structure of our data frame, run a string function str().

> str(dfrm)

'data.frame': 20 obs. of 5 variables: $ ID : num 6249 3456 1238 4163 6709 ... $ Weight : num 129 78 89 134 182 96 112 132 202 101 ... $ Sleeptime : num 10 7 7 8 10 6 8 8 12 7 ... $ Gender : Factor w/ 2 levels "F","M": 2 1 1 2 2 1 2 2 2 1 ... $ Performance: num 76 82 86 88 45 79 92 89 48 86 ... >

This gives you a precise idea of what to expect within this data frame. Use this function first, to figure out the structure of an unknown data frame.

Adding a new Column. I want to add a column called var1, where var1=(Weight)*(Sleeptime)/(Performance^2)

This is how I’d do it. dfrm$var1=dfrm$Weight*dfrm$Sleeptime/(dfrm$Performance^2)

Now the new dfrm looks like

Well, we have too many digits after the decimal sign in var1, probably we should reduce it as follows using the round function

> dfrm$var1=round(dfrm$var1,3) # Use 3 digits after decimal only > dfrm  Fair enough! If we can add a new column, we should be able to delete a column as well. How do we delete column 6th, for instance? > dfrm=dfrm[,-6] > dfrm How else can we enter another column?

□ Use transform( ) function as in below

> dfrm = transform(dfrm, var1=Weight*Sleeptime/(Performance^2)) > dfrm

□ Use apply ( ) function. It is a little tricky, though. The generic form of the apply operation is as follows:

> dfrm$var2=apply(dfrm, 1, function(x) { })

Meaning: The parameter listing at the RHS implies, “apply” would be working on the entire dfrm table. The second parameter 1 implies the operation has to be done on each row. If we want to apply the function on each column, the parameter would be 2, instead. The variable x of the function implies any variable on a row. The body of the function follows next. Therefore dfrm$var2=apply(dfrm, 1, function(x) sum(x)) implies adding all row elements to get var2. In our case, it would not work since we have factors in our table. One way of doing it, would be to select only the specific columns of dfrm which we want to work with. This is how we may do it:

Recommended publications