<<

e-PGPathshala Subject : Computer Science Paper: Analytics Module No 39: CS/DA/39 - Graphics and in R

Quadrant 1 – e-text

1.1 Introduction

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. R provides a wide variety of statistical and graphical techniques, and is highly extensible. This chapter gives an overview of graphics and visualization support of R.

1.2 Learning Outcomes

• To understand about R support for graphics and visualization

• To learn various R commands for graphics and visualization of data

1.3 About R

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

They primary R system is available from the CRAN. CRAN stands for Comprehensive R Archive Network. CRAN hosts many add-on packages that can be used to extend the functionality of R. There are over 4000 packages on CRAN that have been developed by users and programmers around the world.

R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.

This programming language was named R, based on the first letter of first name of the two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs Language S.

The current R is the result of a collaborative effort with contributions from all over the world. It is highly extensible and flexible. R is an interpreted language; users typically access it through a command-line interpreter.

1.4 Graphics in R

R is a language and environment for statistical computing and graphics. One of the main reasons data analysts turn to R is for its strong graphic capabilities. In R, graphs are typically created interactively. Huge number of R packages for graphics are available. These include density plots ( and kernel density plots), dot plots, bar (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and Scatterplots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots).R graphs can be saved in various formats like pdf/png/jpg/wmf/tiff/ps etc.

1.5 Graphical Environments in R

An environment is just a place to store variables – a set of bindings between symbols and objects. Environments can be thought of as consisting of two things. A frame, consisting of a set of symbol-value pairs, and an enclosure, a pointer to an enclosing environment. When R looks up the value for a symbol the frame is examined and if a symbol is found its value will be returned. If not, the enclosing environment is then accessed and the process repeated. There are two graphical environments for R:-low level and high level graphics

Here we discuss low level graphics - Base graphics and -Grid graphics , and in high level graphics :- lattice and ggplot2

1.6.1 Base graphics

One of the most powerful functions of R is it's ability to produce a wide of graphics to quickly and easily visualise data. Plots can be replicated, modified and even publishable with just a handful of commands. Base graphics are used most commonly and are a very powerful system for creating 2-D graphics. Making the leap from chiefly graphical programmes, such as Excel and Sigmaplot. may seem tricky. However, with a basic knowledge of R, just investing a few hours could completely revolutionise the data visualisation and workflow.

 Data types

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Vectors a <- c(1,2,5.3,6,-2,4) # numeric vector b <- c("one","two","three") # character vector c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector

Matrices

All columns in a matrix must have the same (numeric, character, etc.) and the same length. The general format is mymatrix<- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames)) byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the matrix should be filled by columns (the default). dimnames provides optional labels for the columns and rows.

# generates 5 x 4 numeric matrix

y<-matrix(1:20, nrow=5,ncol=4)

# another example

cells <- c(1,26,24,68)

rnames <- c("R1", "R2")

cnames <- c("C1", "C2")

mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,

dimnames=list(rnames, cnames))

Arrays Arrays are similar to matrices but can have more than two dimensions. See help(array)for details.

Data Frames

A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). This is similar to SAS and SPSS datasets. d <- c(1,2,3,4) e <- c("red", "white", "red", NA) f <- c(TRUE,TRUE,TRUE,FALSE) mydata <- data.frame(d,e,f) names(mydata) <- c("ID","Color","Passed") # variable names

There are a variety of ways to identify the elements of a data frame .

Lists

An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name.

# example of a list with 4 components -

# a string, a numeric vector, a matrix, and a scaler

w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)

# example of a list containing two lists

v <- c(list1,list2)

Factors Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers. # variable gender with 20 "male" entries and # 30 "female" entries gender <- c(rep("male",20), rep("female", 30)) gender <- factor(gender) # stores gender as 20 1s and 30 2s and associates # 1=female, 2=male internally (alphabetically)

# R now treats gender as a nominal variable summary(gender)

• Plotting functions:

Even for beginners, its pretty easy to get out a basic plot to learn something about a data set. The trouble starts, however, when you want to show the plot to someone else. And, that is not the documentation's "fault". The documentation is all there. Typing help(package="graphics") at the command line will put most every thing you need to know about base graphics at your reach. However, I think that until a person gets used to thinking of R as a system of interacting functions it takes a bit of ingenuity for a beginner to figure out how to accomplish a basic "functional" task like produce a plot that you can show around. So, in the spirit of trying to make things a little easier I have collected a few resources that might be helpful in mastering base graphics. x-y plotting:

The most used plotting function in R programming is the plot() function. It is a generic function, meaning, it has many methods which are called according to the type of object passed to plot().In the simplest case, we can pass in a vector and we will get a of magnitude vs index. But generally, we pass in two vectors and a scatter plot of these points are plotted. For example, the command plot(c(1,2),c(3,5)) would plot the points (1,3) and (2,5).

Here is a more concrete example where we plot a sine function form range -pi to pi. x <- seq(-pi,pi,0.1) plot(x, sin(x))

Figure 1 : sine function plot

Bar plots

A bar represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot() to create bar charts. R can draw both vertical and horizontal bars in the bar chart. In bar chart each of the bars can be given different colors. barplot(H, xlab, ylab, main, names.arg, col)

 H is a vector or matrix containing numeric values used in bar chart.

 xlab is the label for x axis.

 ylab is the label for y axis.

 main is the title of the bar chart.

 names.arg is a vector of names appearing under each bar.

 col is used to give colors to the bars in the graph.

Example A simple bar chart is created using just the input vector and the name of each bar.

The below script will create and save the bar chart in the current R working directory.

# Create the data for the chart. H <- c(7,12,28,3,41) # Give the chart file a name. png(file = "barchart.png") # Plot the bar chart. barplot(H)

# Save the file. dev.off()

When we execute the above code, it produces the following result

Figure 2 : Bar plot

Box plot

The of an observation variable is a graphical representation based on its quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the data distribution. Problem Find the box plot of the eruption duration in the data set faithful. Solution We apply the boxplot function to produce the box plot of eruptions. > duration = faithful$eruptions # the eruption durations > boxplot(duration, horizontal=TRUE) # horizontal box plot

Answer The box plot of the eruption duration is:

Figure 3 :Box plot

Hisogram A consists of parallel vertical bars that graphically shows the distribution of a quantitative variable. The area of each bar is equal to the requency of items found in each class.

Example In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars showing the number of eruptions classified according to their durations. Problem Find the histogram of the eruption durations in faithful. Solution We apply the hist function to produce the histogram of the eruptions variable. > duration = faithful$eruptions > hist(duration, # apply the hist function + right=FALSE) # intervals closed on the left

Answer The histogram of the eruption durations is:

Figure 4 : Histogram

Pie chart

A of a qualitative data sample consists of pizza wedges that shows the graphically. Example In the data set painters, the pie chart of the School variable is a collection of pizza wedges showing the proportion of painters in each school. Problem Find the pie chart of the painter schools in the data set painters. Solution We first apply the table function to produce the frequency distribution of School.

> library(MASS) # load the MASS package > school = painters$School # the painter schools > school.freq = table(school) # apply the table function

Then we apply the pie function to produce its pie chart. > pie(school.freq) # apply the pie function

Answer The pie chart of the school variable is:

Figure 5 : Pie chart Scatter plot

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen in the horizontal axis and another in the vertical axis.

The simple scatterplot is created using the plot() function. lot(x, y, main, xlab, ylab, xlim, ylim, axes) Following is the description of the parameters used −

 x is the data set whose values are the horizontal coordinates.

 y is the data set whose values are the vertical coordinates.

 main is the tile of the graph.

 xlab is the label in the horizontal axis.

 ylab is the label in the vertical axis.

 xlim is the limits of the values of x used for plotting.

 ylim is the limits of the values of y used for plotting.

 axes indicates whether both axes should be drawn on the plot.

 The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles per gallon).

 # Get the input values.

 input <- mtcars[,c('wt','mpg')]   # Give the chart file a name.

 png(file = "scatterplot.png") 

 # Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.  plot(x = input$wt,y = input$mpg,  xlab = "Weight",

 ylab = "Milage",  xlim = c(2.5,5),

 ylim = c(15,30),

 main = "Weight vs Milage"  )

  # Save the file.

 dev.off()

 When we execute the above code, it produces the following result −

(a)

(b) Scatter plot pair (c)scatter plot with label

Figure 6 : Examples of scatter plots

1.6.2 Grid Graphics Environment

R is not only a powerful platform for generating statistical plots. R is also a general purpose graphics system. It is not only possible to produce a wide variety of plots. It is possible to produce an infinite variety of graphical images with R. This tutorial will present an introduction to the grid graphics package, a low-level graphics system that provides full access to the graphics facilities in R. Knowledge of this system will allow you to make arbitrary modications to plots produced by packages such as lattice and ggplot2 and it will allow you to produce a range of and data visualizations that is limited only by your imagination.

Grid is a low-level graphics system which provides a great deal of control and flexibility in the appearance and arrangement of graphical output. grid does not provide high-level functions which create complete plots. the grid package: viewports, units, and graphical parameters.

In grid there can be any number of graphics regions. A graphics region is referred to as a viewport and is created using the viewport() function. A viewport can be positioned anywhere on a graphics device (page, window, . . . ), it can be rotated, and it can be clipped to.

> viewport(x = 0.5, y = 0.5, width = 0.5, height = 0.25, angle=45)

The selection of which coordinate system to use within the current viewport is made using the unit() function. The unit() function creates an object which is a combination

of a value and a coordinate system (plus some extra for certain coordinate systems). grid provides an alternative method for positioning viewports within each other based on layouts. A layout may be specified for any viewport. Any viewport pushed immediately after a viewport containing a layout may specify its location with respect to that layout. In the following simple example, a viewport is pushed with a layout with 4 rows and 5 columns,then another viewport is pushed which occupies the second and third columns of the third row of the layout. grid provides a standard set of graphical primitives: lines, text, points, rectangles, polygons, and circles. There are also two higher level components: x- and y-axes.

grid.text , grid.circle, grid.polygon, grid.rect - Can specify angle of rotation. grid.points, grid.lines , grid.segments - Can specify type of plotting symbol. grid.grill, grid.move.to, grid.line.to - Convenience function for drawing grid lines grid.xaxis - Top or bottom axis grid.yaxis - Left or right axis

grid graphical primitives

1.6.3 Lattice

The lattice add-on package is an implementation of Trellis graphics for R. It is a powerful and elegant high-level data visualization system with an emphasis on multivariate data. It is designed to meet most typical graphics needs with minimal tuning, but can also be easily extended to handle most nonstandard requirements.

The Lattice user interface primarily consists of several ‘high-level’ generic functions (listed below in the “See Also” section), each designed to create a particular type of display by default. Although the functions produce different output, they share many common features, reflected in several common arguments that affect the resulting displays in similar ways. These arguments are extensively (sometimes only) documented in the help page for xyplot, which also includes a discussion of the important topics of conditioning and control of the Trellis layout. Features specific to other high-level functions are documented in their respective help pages.

The lattice package is based on the Grid graphics engine and requires the grid add- on package. One consquence of this is that it is not (readily) compatible with traditional R graphics tools. In particular, changing par() settings usually has no effect on Lattice plots; lattice provides its own interface for querying and modifying an extensive set of graphical and non-graphical settings.

Lattice plot example library(lattice)

> p1 <- xyplot(1:8 ~ 1:8 | rep(LETTERS[1:4], each=2), as.table=TRUE)

> plot(p1)

Figure 6 : Lattice plot

Lattice plot: Density plot

> data(Chem97, package = "mlmRev")

> densityplot(~ gcsescore | factor(score),data= Chem97, groups = gender,

+ plot.points = FALSE, auto.key = TRUE)

Figure 7 :Density plot

1.6.4 Graphics with ggplot2

The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots. Its popularity in the R community has exploded in recent years. Origianlly based on Leland Wilkinson's The Grammar of Graphics, ggplot2 allows you to create graphs that represent both univariate and multivariate numerical and categorical data in a straightforward manner. Grouping can be represented by color, symbol, size, and transparency. The creation of trellis Plot is relatively simple. Mastering the ggplot2 language can be challenging. There is a helper function called qplot() (for quick plot) that can hide much of this complexity when creating standard graphs.qplot() The qplot() function can be used to create the most common graph types. While it does not expose ggplot's full power, it can create a very wide range of useful plots. The format is:

qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=,

formula=, facets=, xlim=, ylim= xlab=, ylab=, main=,

Figure 8 : ggplot

1.7 Saving Graphics to Files

If we want to save our plot to a file in R and then import this graphics file into another document. Much of the time however, we may simply want to use R graphics in an interactive way to explore our data.

To save a plot to an image file, you have to do three things in sequence:

1. Open a graphics device. The default graphics device in R is your computer screen. To save a plot to an image file, you need to tell R to open a new type of device — in this case, a graphics file of a specific type, such as PNG, PDF, or JPG. The R function to create a PNG device is png(). Similarly, you create a PDF device with pdf() and a JPG device with jpg(). 2. Create the plot. 3. Close the graphics device. You do this with the dev.off() function.

Put this in action by saving a plot of faithful to the home folder on your computer. First set your working directory to your home folder (or to any other folder you prefer). If you use Linux, you’ll be familiar with using “~/” as the shortcut to your home folder, but this also works on Windows and Mac:

> setwd("~/")

> getwd() [1] "C:/Users/Andrie"

Next, write the three lines of code to save a plot to file:

> png(filename="faithful.png") > plot(faithful) > dev.off()

Now you can check your file system to see whether the file faithful.png exists. The result is a graphics file of type PNG that you can insert into a presentation, document, or website.

• After the pdf() command all graphs are redirected to file test.pdf. Works for all common formats similarly: jpeg, png, ps, ti, ...

> pdf("test.pdf")

> plot(1:10, 1:10)

> dev.off()

• To generate Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape.

> svg("test.svg")

> plot(1:10, 1:10)

>dev.off()

Summary

 Data visualization for social networks is complex when compares for other domains.

 Variety of successful tools and techniques have been adopted in the history of social networks to effectively visualize