
e-PGPathshala Subject : Computer Science Paper: Data Analytics Module No 39: CS/DA/39 - Graphics and Data Visualization in R Quadrant 1 – e-text 1.1 Introduction R is an integrated suite of software facilities for data manipulation, calculation and graphical display. R provides a wide variety of statistical and graphical techniques, and is highly extensible. This chapter gives an overview of graphics and visualization support of R. 1.2 Learning Outcomes • To understand about R support for graphics and visualization • To learn various R commands for graphics and visualization of data 1.3 About R R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. They primary R system is available from the CRAN. CRAN stands for Comprehensive R Archive Network. CRAN hosts many add-on packages that can be used to extend the functionality of R. There are over 4000 packages on CRAN that have been developed by users and programmers around the world. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac. This programming language was named R, based on the first letter of first name of the two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs Language S. The current R is the result of a collaborative effort with contributions from all over the world. It is highly extensible and flexible. R is an interpreted language; users typically access it through a command-line interpreter. 1.4 Graphics in R R is a language and environment for statistical computing and graphics. One of the main reasons data analysts turn to R is for its strong graphic capabilities. In R, graphs are typically created interactively. Huge number of R packages for graphics are available. These include density plots (histograms and kernel density plots), dot plots, bar charts (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and Scatterplots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots).R graphs can be saved in various formats like pdf/png/jpg/wmf/tiff/ps etc. 1.5 Graphical Environments in R An environment is just a place to store variables – a set of bindings between symbols and objects. Environments can be thought of as consisting of two things. A frame, consisting of a set of symbol-value pairs, and an enclosure, a pointer to an enclosing environment. When R looks up the value for a symbol the frame is examined and if a matching symbol is found its value will be returned. If not, the enclosing environment is then accessed and the process repeated. There are two graphical environments for R:-low level and high level graphics Here we discuss low level graphics - Base graphics and -Grid graphics , and in high level graphics :- lattice and ggplot2 1.6.1 Base graphics One of the most powerful functions of R is it's ability to produce a wide range of graphics to quickly and easily visualise data. Plots can be replicated, modified and even publishable with just a handful of commands. Base graphics are used most commonly and are a very powerful system for creating 2-D graphics. Making the leap from chiefly graphical programmes, such as Excel and Sigmaplot. may seem tricky. However, with a basic knowledge of R, just investing a few hours could completely revolutionise the data visualisation and workflow. Data types R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists. Vectors a <- c(1,2,5.3,6,-2,4) # numeric vector b <- c("one","two","three") # character vector c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector Matrices All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. The general format is mymatrix<- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames)) byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the matrix should be filled by columns (the default). dimnames provides optional labels for the columns and rows. # generates 5 x 4 numeric matrix y<-matrix(1:20, nrow=5,ncol=4) # another example cells <- c(1,26,24,68) rnames <- c("R1", "R2") cnames <- c("C1", "C2") mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames)) Arrays Arrays are similar to matrices but can have more than two dimensions. See help(array)for details. Data Frames A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). This is similar to SAS and SPSS datasets. d <- c(1,2,3,4) e <- c("red", "white", "red", NA) f <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(d,e,f) names(mydata) <- c("ID","Color","Passed") # variable names There are a variety of ways to identify the elements of a data frame . Lists An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. # example of a list with 4 components - # a string, a numeric vector, a matrix, and a scaler w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3) # example of a list containing two lists v <- c(list1,list2) Factors Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers. # variable gender with 20 "male" entries and # 30 "female" entries gender <- c(rep("male",20), rep("female", 30)) gender <- factor(gender) # stores gender as 20 1s and 30 2s and associates # 1=female, 2=male internally (alphabetically) # R now treats gender as a nominal variable summary(gender) • Plotting functions: Even for beginners, its pretty easy to get out a basic plot to learn something about a data set. The trouble starts, however, when you want to show the plot to someone else. And, that is not the documentation's "fault". The documentation is all there. Typing help(package="graphics") at the command line will put most every thing you need to know about base graphics at your reach. However, I think that until a person gets used to thinking of R as a system of interacting functions it takes a bit of ingenuity for a beginner to figure out how to accomplish a basic "functional" task like produce a plot that you can show around. So, in the spirit of trying to make things a little easier I have collected a few resources that might be helpful in mastering base graphics. x-y plotting: The most used plotting function in R programming is the plot() function. It is a generic function, meaning, it has many methods which are called according to the type of object passed to plot().In the simplest case, we can pass in a vector and we will get a scatter plot of magnitude vs index. But generally, we pass in two vectors and a scatter plot of these points are plotted. For example, the command plot(c(1,2),c(3,5)) would plot the points (1,3) and (2,5). Here is a more concrete example where we plot a sine function form range -pi to pi. x <- seq(-pi,pi,0.1) plot(x, sin(x)) Figure 1 : sine function plot Bar plots A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot() to create bar charts. R can draw both vertical and horizontal bars in the bar chart. In bar chart each of the bars can be given different colors. barplot(H, xlab, ylab, main, names.arg, col) H is a vector or matrix containing numeric values used in bar chart. xlab is the label for x axis. ylab is the label for y axis. main is the title of the bar chart. names.arg is a vector of names appearing under each bar. col is used to give colors to the bars in the graph. Example A simple bar chart is created using just the input vector and the name of each bar. The below script will create and save the bar chart in the current R working directory. # Create the data for the chart. H <- c(7,12,28,3,41) # Give the chart file a name. png(file = "barchart.png") # Plot the bar chart. barplot(H) # Save the file. dev.off() When we execute the above code, it produces the following result Figure 2 : Bar plot Box plot The box plot of an observation variable is a graphical representation based on its quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the data distribution. Problem Find the box plot of the eruption duration in the data set faithful. Solution We apply the boxplot function to produce the box plot of eruptions. > duration = faithful$eruptions # the eruption durations > boxplot(duration, horizontal=TRUE) # horizontal box plot Answer The box plot of the eruption duration is: Figure 3 :Box plot Hisogram A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a quantitative variable.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages17 Page
-
File Size-