An Introduction to the SAS System

Dileep K. Panda Directorate of Water Management Bhubaneswar-751023 [email protected]

Introduction The SAS – Statistical Analysis System (erstwhile expansion of SAS) - is the one of the most widely used Statistical package by the academic circles and Industry. The SAS software was developed in late 1960s at North Carolina State University and in 1976 SAS Institute was formed. The SAS system is a collection of products, available from the SAS Institute in North Carolina. SAS software is a combination of a statistical package, a data – base management system, and a high level programming language. The SAS is an integrated system of software solutions that performs the following tasks:

 Data entry, retrieval, and management  Report writing and graphics design  Statistical and mathematical analysis  Business forecasting and decision support  Operations research and project management  Applications development

At the core of the SAS System is the Base SAS software. The Base SAS software includes a fourth-generation programming language and ready-to-use programs called procedures. These integrated procedures handle data manipulation, information storage and retrieval, statistical analysis, and report writing. Additional components offer capabilities for data entry, retrieval, and management; report writing and graphics; statistical and mathematical analysis; business planning, forecasting, and decision support; operations research and project management; quality improvement; and applications development. In general, the Base SAS software has the following capabilities  A data management facility  A programming language  Data analysis and reporting utilities

Learning to use Base SAS enables you to work with these features of SAS. It also prepares you to learn other SAS products, because all SAS products follow the same basic rules. The SAS products include

 SAS Foundation 9.2  SAS Enterprise Guide 4.2  SAS OLAP Cube Studio 4.2  SAS Information Map Studio 4.2  SAS Enterprise Miner 6.1  SAS IML Studio  SAS BI Server  The JMP software (not part of Base SAS)

The majority of the enhancements included in SAS 9.2 are a direct response to requests from the users. The Base SAS 9.2 includes more analytical and graphical capabilities such as

 Base SAS - data management and basic procedures  SAS/STAT - statistical analysis  SAS/GRAPH - presentation quality graphics  SAS/Genetics- Genetics data analysis  SAS/OR - Operations research  SAS/ETS - Econometrics and Time Series Analysis  SAS/IML - Interactive Matrix Language  SAS/SQL – Structural Query Language  There are other specialized products for access to databases, connectivity between different machines running SAS

With the enhancements to SAS Data Integration Studio, you can create and debug jobs more efficiently. In the SAS Business Intelligence applications, all of the user interfaces have been updated to increase productivity and usability. The SAS 9.2 includes significant performance improvements through better use of the underlying hardware and software. The improvements include more widespread use of 64-bit computing, as well as an increased ability to utilize and manage grids. Support for multi-threaded processing has been extended across the platform, from SAS procedures to SAS servers to grid enablement. In addition, new in-database features substantially improve performance by moving analytic tasks closer to your data.

2

An Introduction to the SAS System

The SAS Windowing Environment

The SAS windowing environment is composed of windows that enable you to accomplish specific tasks. Program Editor enables you to enter, edit, and submit SAS programs. Enhanced Program Editor (available only in the Windows operating environment) enables you to enter, edit, and submit SAS programs. The Editor provides a number of useful editing features such as: colour coding and syntax checking of the SAS language. The initial Editor Window title is Editor - Untitled until you open a file or save the contents of the editor to a file. Then the window title changes to reflect that filename. When the content of the editor is modified, an asterisk is added to the title. Log displays messages about your SAS session and any SAS programs you submit. Output enables you to browse output from SAS programs that you submit. By default, the output window is positioned behind the other windows. When you create output, the Output window automatically moves to the front of your display. Results helps you navigate and manage SAS programs that you submit. You can view, save, or print individual items of output. The window on the left side of the display is the SAS Explorer window, which you can use to assign and locate SAS libraries, files, and other items.

The window at the top right is the Log window; it contains the SAS log for the session. The window at the bottom right is the Program Editor window. This window provides an editor in which you edit your SAS programs.

 Invoke the SAS System and include a SAS program into your session.  Submit a program and browse the results.

3

An Introduction to the SAS System

 Navigate the SAS windowing environment.  Program Editor for submitting SAS Programs (New or Existing).  Multiple Windows can be opened with different SAS programs  Log Window for information about the processing of SAS programs including errors, warnings, time taken etc.  Output Window for showing the output produced by SAS programs

Data Management Facility

SAS organizes data into a rectangular form or table that is called a SAS data set. The following shows a SAS data set. The data describes participants in a 16-week weight program at a health and fitness club. The data for each participant includes an identification number, name, team name, and weight (in U.S. pounds) at the beginning and end of the program. A rectangular form of a SAS Data Set is given by

In a SAS data set, each row represents information about an individual entity and is called an observation. Each column represents the same type of information and is called a variable. Each separate piece of information is a data value. In a SAS data set, an observation contains all the data values for an entity; a variable contains the same type of data value for all entities.

A SAS program is a sequence of steps that the user submits for execution. DATA steps are typically used to create SAS data sets. PROC steps are typically used to process SAS data sets (that is, generate reports and graphs, edit data, and sort data). To build a SAS data set with Base SAS, you write a program that uses statements in the SAS programming language. A SAS program that begins with a DATA statement and typically creates a SAS data set or a report is called a DATA step. The following SAS program creates a SAS data set named WEIGHT_CLUB from the health club data:

data weight_club; 1 input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; 2 Loss=StartWeight-EndWeight; 3 datalines; 4 1023 David Shaw red 189 165 5 4

An Introduction to the SAS System

1049 Amelia Serrano yellow 145 124 5 1219 Alan Nance red 210 192 5 1246 Ravi Sinha yellow 194 177 5 1078 Ashley McKnight red 127 118 5 ; 6 run;

The following list corresponds to the numbered items in the preceding program:

The DATA statement tells SAS to begin building a SAS data set named WEIGHT_CLUB.

The INPUT statement identifies the fields to be read from the input data and names the SAS variables to be created from them (IdNumber, Name, Team, StartWeight, and EndWeight).

The third statement is an assignment statement. It calculates the weight each person lost and assigns the result to a new variable, Loss.

The DATALINES statement indicates that data lines follow.

The data lines follow the DATALINES statement. This approach to processing raw data is useful when you have only a few lines of data. (Later sections show ways to access larger amounts of data that are stored in files.)

The semicolon signals the end of the raw data, and is a step boundary. It tells SAS that the preceding statements are ready for execution.

By default, the data set WEIGHT_CLUB is temporary; that is, it exists only for the current job or session.

Programming Language Elements of the SAS Language

The statements that created the data set WEIGHT_CLUB are part of the SAS programming language. The SAS language contains statements, expressions, functions and CALL routines, options, formats, and informats - elements that many programming languages share. However, the way you use the elements of the SAS language depends on certain programming rules. The most important rules are listed in the next two sections.

Rules for SAS Statements

The conventions that are shown in the programs in this documentation, such as indenting of subordinate statements, extra spacing, and blank lines, are for the purpose of clarity and ease of use. They are not required by SAS. There are only a few rules for writing SAS statements:

5

An Introduction to the SAS System

 SAS statements end with a semicolon.  You can enter SAS statements in lowercase, uppercase, or a mixture of the two.  You can begin SAS statements in any column of a line and write several statements on the same line.  You can begin a statement on one line and continue it on another line, but you cannot split a word between two lines.  Words in SAS statements are separated by blanks or by special characters (such as the equal sign and the minus sign in the calculation of the Loss variable in the WEIGHT_CLUB example).

Rules for SAS statements SAS names are used for SAS data set names, variable names, and other items. The following rules apply: • Usually consist of Keywords (DATA, PROC etc.); • Always end with a semi-colon (;) • Case InSeNSItive • One or more blank spaces can be used to separate words • Free format – can start, stop anywhere • Single statement can span multiple lines • More than one statement can be on the same line

The rules for naming SAS Dataset and variables:

 A SAS name can contain from one to 32 characters.  The first character must be a letter or an underscore (_).  Subsequent characters must be letters, numbers, or underscores.  Blanks cannot appear in SAS names. 6

An Introduction to the SAS System

 Can be upper/lower/mixed cases  Not case sensitive No space between two words are not allowed so as other special characters such as #,/, ( etc.

The common mistakes encountered are:  Missing semi-colon (;)  Missing close quotes (‘ or ‖)  Invalid options  Misspelled keywords

The execution of a particular statement can be blocked by putting the statement within /* ...... */

Special Rules for Variable Names

For variable names only, SAS remembers the combination of uppercase and lowercase letters that you use when you create the variable name. Internally, the case of letters does not matter. "CAT," "cat," and "Cat" all represent the same variable. But for presentation purposes, SAS remembers the initial case of each letter and uses it to represent the variable name when printing it.

Correct names: Area5, areaha, area_ha

Wrong names: Area (ha), area ha, area@ha, 5area, area#5, area-ha

A SAS program is constructed from two building blocks o DATA Step o PROC Step /*For Example*/ /* DATA Step*/ DATA popul; population = 353455256; area = 73613.88; density=population/area; /* PROC Step*/ Proc print DATA= popul; run;

DATA Statement

The DATA step creates the data sets that are used in a SAS program's analysis and reporting procedures. A typical SAS program starts with a DATA step to create a SAS dataset and then passes on to the PROC step to do the analysis. Understanding the basic structure, functioning, and components of the DATA step is fundamental to learning how to create your own SAS data sets. Particularly, what a SAS data set is and why it is needed; how the DATA step works; and what information you have to supply to SAS so that it can construct a SAS data set for you. SAS enables you to solve problems by providing methods to analyze or to 7

An Introduction to the SAS System

process your data in some way. You need to first get the data into a form that SAS can recognize and process. After the data is in that form, you can analyze it and generate reports. The following figure shows this process in the simplest case.

From Raw Data to Final Analysis

You begin with raw data, that is, a collection of data that has not yet been processed by SAS. You use a set of statements known as a DATA step to get your data into a SAS data set. Then you can further process your data with additional DATA step programming or with SAS procedures. In its simplest form, the DATA step can be represented by the three components as presented below. The SAS processes input in the form of raw data and creates a SAS data set.

Reading data inline

Reading data inline means that the raw data is part of the SAS program. Enter the actual data points inside the PROGRAM EDITOR. The following DATA step creates a data set named POPULATION with for 7 states of the eastern India. The input variables are state, total_population, rural, urban, geographical_area, net_cul, and cropping_intensity. The INPUT statement defines the variables to be read in each line of data. The DATALINES statement indicates to SAS that DATA step statements are completed and the next line contains real data. Notice that the lines of data do not end in a semicolon. Instead of the

8

An Introduction to the SAS System

DATALINES statement you can also use the CARDS statement. The two are equivalent. CARDS stems from way back when reading data meant feeding punch cards.

The execution of a particular statement can be blocked by putting the statement within /*******************/. No output will emerge from the statement /* proc print data=population; run;*/. Further, the colour of the statement will change from blue to green. When SAS encounters a data error, a note that describes the error is printed in the SAS log. The input record being read is displayed in the SAS log (contents of the input buffer). Changing the variable name from total_population to total-population resulted in error as evident from the log window below.

9

An Introduction to the SAS System

In order to create a SAS data set from a raw data file

 Start a DATA step and name the SAS data set being created (DATA statement)  Identify the location of the raw data file to read (INFILE statement)  Describe how to read the data fields from the raw data file (INPUT statement).

1. DATA dataset-name; 2. INFILE file-specification ; 3. INPUT <@|@@>; 4. Run;

General form of the INPUT statement:

 names the SAS variables  identifies the variables as character or numeric  specifies the locations of the fields in the raw data can be specified as column, formatted, list or named input.

The dataset to be created has two level names – libref.dset where libref is the library name and dset is the name of the dataset to be created. If libref is not provided, dataset will be created in WORK library (WORK is a reserved libref that SAS automatically assigns to a temporary SAS data library.), which is temporary. To create a permanent SAS data set, you must indicate a SAS data library other than WORK. Use a LIBNAME statement to assign a libref to a SAS data library on your operating environment's file system. The libref functions as a shorthand way of referring to a SAS data library. Here is the form of the LIBNAME statement: LIBNAME libref 'your-data-library'; where libref is a shortcut name to where your SAS files are stored. libref must be a valid SAS name. It must begin with a letter or an underscore, and it can contain uppercase and lowercase letters, numbers, or underscores. A libref has a maximum length of 8 characters. The 'your-data-library' must be the physical name for your SAS data library. The physical name is the name that is recognized by the 10

An Introduction to the SAS System

operating environment. The general form of the INFILE statement is infile 'c:\workshop\winsas\prog1\dfwlax.dat';

Creating a permanent data set

By default all data sets created during a SAS session are temporary data sets and are deleted when you close SAS. While this is not a real problem for small data sets that you can recreate on the fly in the program, it is inconvenient, if you work with large data sets or your data sets require a lot of manipulation before they are in the form that can be used by a certain procedure. A permanent data set on the other hand is a data set that will not be deleted when SAS is exited. It is available directly when a new SAS session is started. By using the menu, a permanent data sets NAIP has been created.

File import through SAS Import Wizard

 Open the wizard by clicking on the File menu and selecting Import Data. The  Import Wizard should appear as given below.  Select the type of data you want to import.  Browse for the file. Open and then click Next.  Choose the Library and Member (which is the file and you have to name it).  The Import Wizard will then ask you if you would like to create a file containing the PROC IMPORT command. Use this only if you are going torepeatedly want to import the same data. This will then allow you to just type PROC IMPORT as a command in the Program Editor to import the same data without having to go through the Import Wizard again.  Click on Finish

11

An Introduction to the SAS System

12

An Introduction to the SAS System

13

An Introduction to the SAS System

Importing a text from Desktop

PROC IMPORT OUT= WORK.STATE1 DATAFILE= "C:\Documents and Settings\DK PANDA\Desktop\State_ report2.txt" DBMS=TAB REPLACE; GETNAMES=YES; DATAROW=2; RUN; proc print data=work.state1; run;

Importing an excel file from NAIP-Training folder in C drive.

PROC IMPORT OUT= WORK.rice DATAFILE= "C:\NAIP-Training\naip_rice.xls" DBMS=EXCEL REPLACE; RANGE="ana$"; GETNAMES=YES; MIXED=YES; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; proc print data=rice; run;

14

An Introduction to the SAS System

Creating data set using SAS program

You can also use data that is already stored in a SAS data set as input to a new data set. To read data from an existing SAS data set, you must specify the existing data set's name in one of these statements:  SET statement  MERGE statement  MODIFY statement  UPDATE statement

Further, the KEEP=, DROP= and RENAME= options are used to select the variables for analysis. Different operators to create and modify variables are Arithmetic Operators ** Exponentiation x**2 * Multiplication x*y / Division x/y + Addition x+y - Subtraction x-y Comparison Operators = EQ ^= NE > GT >= GE < LT <= LE Logical Operators & AND | O R ^ NOT Some Numerical Functions mean(arglist) var(arglist) std (arglist) sum(arglist) min (arglist) ma x(arglist) abs(arg) mod(num,div) log(arg) exp(arg) sqrt(arg) int(arg) ceil(arg) floor(arg) rannor(seed) ranuni(seed) ranbin(seed,n,p) ranpoi(seed,m)

EXAMPLE data population; input state $ population rural urban area net_cul crop_intensity; datalines; Assam 26638407 23000000 3638407 7843.8 2751.18 145.3 Bihar 82878796 74251113 8627683 9359.56 5603.2 142 Chhattisgarh 20796000 17268061 3527939 13780 5480 115 East_UP 73299842 64486715 8813127 10402 6328.46 151.44 Jharkhand 32816380 26909400 5906980 7970 1807.9 117.6 Orissa 36804660 31283961 5520699 15571 6165 152 W_Bengal 80221171 75734690 4486481 8687.52 5427.6 177 ; run; /*proc print data=population(drop=rural urban);(drop=urban) */ data population1; set population (drop= net_cul crop_intensity); 15

An Introduction to the SAS System

density=population/area; rural_p=rural/population; proc print data=population1; run; /*****************************************************************/ data population2; set population (keep= state area net_cul crop_intensity); net_cul_p=net_cul/area; proc print data=population2; run; /*****************************************************************/ data population3 ( rename = ( area = geographical_area ) ); /*set population1 population2;*/ merge population1 population2; run; /*proc print data=population3;run;*/ proc sort data = population3; by population; run; proc print data=population3; run;

/*RICE experiment creating library NAIP*/

PROC IMPORT OUT= NAIP.RICE_exa DATAFILE= "C:\NAIP-Training\naip_rice.xls" DBMS=EXCEL REPLACE; RANGE="ana$"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; proc print data=naip.rice_exa; title 'RICE EXPERIMENT'; run; /* SET statement, MERGE statement, MODIFY statement UPDATE statement */ data naip.rice_exa1; set naip.rice_exa ; /*if year=2002;if manure ='fmt';*/ if manure ='FMT' then yield1=yield+2; else if manure ='SMT' then yield1=yield; /*yield1=yield/ebt;manure eq "fmt"*/ run; proc print data=naip.rice_exa1; run;

16

An Introduction to the SAS System

PROC statements

PROC MEANS produces descriptive statistics (means, standard deviation, minimum, maximum, etc.) for numeric variables in a set of data. PROC MEANS can be used for describing continuous data where the average has meaning; describing the means across groups; searching for possible outliers or incorrectly coded values; performing a single sample t-test.

The syntax of the PROC MEANS statement is: PROC MEANS ; ;

Statistical options that may be requested are: (default statistics are underlined.)  N - Number of observations  MEDIAN – 50th percentile

 NMISS - Number of missing observations  P1 – 1st percentile

 MEAN - Arithmetic average)  P5 - 5th percentile

 STD - Standard Deviation  P10 – 10th percentile

 MIN - Minimum (smallest)  P90 - 90th percentile

 MAX - Maximum (largest)  P95 – 95th percentile

 RANGE - Range  P99 - 99th percentile

 SUM - Sum of observations  Q1 - 1st quartile

 VAR - Variance  Q3 - 3rd quartile

 USS – Uncorr. sum of square  QRANGE – Quartile range

 T - Student‘s t value for testing Ho: md =0  Q3 - 3rd quartile

Other commonly used options available in PROC MEANS include:

 DATA= Specify data set to use  NOPRINT Do not print output  MAXDEC=n Use n decimal places to print output

Commonly used statements with PROC MEANS include:  BY variable list -- Statistics are reported for groups in separate tables  CLASS variable list – Statistics reported by groups in a single table  VAR variable list – specifies which numeric variables to use  OUTPUT OUT = datasetname – statistics will be output to a SAS data file  FREQ variable - specifies a variable that represents a count of observations

17

An Introduction to the SAS System

Uses of PROC Means, Univariate, Plot and Frequency data aquifer; input block $ Depth Discharge Thickness Transmissivity Storativity ; datalines; jaleswar 60.90 11.50 22.00 2.18 6.30 jaleswar 48.70 11.28 17.06 3.03 2.20 jaleswar 88.40 31.00 30.48 2.97 3.06 jaleswar 50.00 7.75 20.50 2.88 9.30 jaleswar 94.50 27.50 33.80 3.07 4.50 bhograi 42.00 12.17 17.00 3.43 5.40 bhograi 40.00 17.91 19.00 3.53 3.20 bhograi 91.50 20.90 31.08 2.53 3.80 bhograi 88.30 24.50 27.43 3.30 2.70 baliapal 64.00 22.43 24.38 2.75 1.10 baliapal 64.30 34.16 21.34 3.03 9.20 baliapal 82.30 28.98 21.53 2.84 6.50 basta 92.50 53.50 56.00 3.43 1.20 basta 67.00 25.00 20.60 2.12 1.30 basta 35.00 14.81 18.50 3.28 1.60 basta 79.24 11.50 18.46 2.09 7.10 basta 30.00 4.28 13.50 3.27 1.58 ; proc print data=aquifer; run; proc means maxdec=2 mean stderr median; var Depth Discharge Thickness Transmissivity Storativity; class block; output out=stat; run; PROC EXPORT DATA= WORK.STAT OUTFILE= "C:\NAIP-Training\aquifer_stat1.xlsx" DBMS=EXCEL REPLACE; SHEET="dkp"; RUN; proc univariate data=aquifer; var Depth Discharge Thickness Transmissivity Storativity; class block; output out=Pctls pctlpts=20 40 pctlpre=PreTest_ PostTest_ pctlname=P20 P40; run; ods pdf file="C:\NAIP-Training\dkk"; proc plot data = aquifer; plot discharge*depth='*' 18

An Introduction to the SAS System

thickness*depth='o' / overlay box; run; ods pdf close; ods rtf file="C:\NAIP-Training\dkk1";

PROC FREQ data=aquifer; tables discharge; run; ODS rtf CLOSE;

/* To find the normality distribution of the variables*/ proc univariate data=aquifer; var Depth Discharge; histogram; qqplot / normal (mu=est sigma=est); run; quit;

The PROC UNIVARIATE will give more extensive list of statistics including tests of normality, stem and leaf plots, box plots, etc. proc univariate data=aquifer; var Depth Discharge Thickness Transmissivity Storativity; class block; output out=Pctls pctlpts=20 40 pctlpre=PreTest_ PostTest_ pctlname=P20 P40; run;

19

An Introduction to the SAS System

Dealing with SAS OUTPUT

• SAS produces text output by default • Use ODS statement to produce other forms of output such as PDF, RTF, HTML etc. • General form of ODS statement is ODS PDF FILE = ‗filename‘; ODS RTF FILE = ‗filename‘; • At the end, add the following statement: ODS CLOSE; • More than one destination can be used at the same time.

Example

PROC IMPORT OUT= WORK.rice DATAFILE= "C:\NAIP-Training\naip_rice.xls" DBMS=EXCEL REPLACE; RANGE="ana$"; GETNAMES=YES; MIXED=YES; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; ODS PDF FILE = "C:\NAIP-Training\dkk"; proc print data=rice; run; ODS PDF close; quit;ODS PDF FILE = "C:\NAIP-Training\dkk"; proc print data=rice; run; ODS PDF close; quit;

20

An Introduction to the SAS System

Statistical Analysis using Enterprise Guide: A Mouse-driven Approach

Dileep Kr. Panda & Saswat Kr. Sahoo DWM, Bhubaneswar-751023 [email protected]

Introduction SAS Enterprise Guide is hard to describe, difficult to pigeonhole, a very powerful tool that is the ―Excel‖ of SAS, and the closest thing there is to an Integrated Development Environment (IDE) for SAS. Since the release of SAS 9, Enterprise Guide is the new face of SAS for general purpose use in a variety of situations.

SAS is one of the best known and most widely used statistical packages in the world. Analysis using SAS are conducted by writing a program in the SAS language, running the program, and inspecting the results. Using SAS requires both a knowledge of programming concepts in general and of the SAS language in particular. One also needs to know what to do when things don‘t go smoothly; i.e., knowing about error messages, their meanings, and solutions.

SAS Enterprise Guide is a Windows interface to SAS whereby statistical analyses can be specified and run using normal windowing point-and-click style operations and hence without the need for programming or any knowledge of the SAS programming language. As such, SAS Enterprise Guide is ideal for those who wish to use SAS to analyze their data, but do not have the time, or perhaps inclination, to undertake the considerable amount of learning involved in the programming approach. For example, those who have used SAS in the past, but are a bit ―rusty‖ in their programming, may prefer SAS Enterprise Guide. Then again, those who would like to become proficient SAS programmers could start with SAS Enterprise Guide and examine the programs it produces.

It should be born in mind that SAS Enterprise Guide is not an alternative to SAS; rather, it is an addition which allows an alternative way of working. SAS itself needs to be present or at least available. The need for SAS to be present is because SAS Enterprise Guide works by translating the user‘s point-and-click operations into a SAS program. SAS Enterprise Guide then uses SAS to run that program and captures the output for the user. The computer on which SAS runs is referred to as the SAS Server. Usually the SAS Server will be the same computer, referred to as the Local Computer, but need not be. We assume that both SAS and SAS Enterprise Guide will have already been set up. It provides the following features:

 Enterprise Guide provides users with very powerful data management capabilities.  Frequently used data management capabilities include a sophisticated query builder.  Data sampling, and ranking, transposing, and even creating and editing data.  An intuitive , visual , customizable interface.  Ready-to-use tasks for analysis and reporting.  A code editing facility

21

An Introduction to the SAS System

SAS Enterprise Guide also connects to a SAS Metadata Repository where information about objects are stored.

To Start Enterprise Guide To open SAS Enterprise guide , click the start menu--->SASEnterprise Guide 4.2 or click the icon SAS EG on desk top. At the time of opening the Enterprise Guide , a welcome window appears on the screen containing few options , like select New Project open an existing project. A project serves as a collection of  data sources  SAS programs and logs  tasks and queries  results  informational notes for documentation.

You can control the contents ,sequencing, and updating of a project.

Any project in SAS Enterprise Guide is saved with extension .egp

22

An Introduction to the SAS System

Useful Pull-down Menus Like most Windows programs, SAS Enterprise Guide has a toolbar and menu bar with pull- down menus that you can use to access many of the features of the program. The toolbar contains buttons for more commonly used procedures. To see what each button does, hold the mouse over the button for a moment and a description of what the button does will appear. The following is a summary of the main pull down menus and their functions:

Menu Functions  File Open new project and save project, data, code, report, and process flow. Import and export data. Print process flow. Schedule and Run project  Edit Modify or copy text, search and replace data. Expand or collapse data.  View Customize the look of the SAS Enterprise Guide window by selecting to view the tool bars, project designer, project flow, task list, task status, and what is window.

 Tasks Describe List data. Generate summarize statistics of the data. Distribution analysis. One-way frequencies. Wizards for summary statistics available, etc. Anova t Test , One way ANOVA, Non Parametric, Linear Model and Mixed Model statistics can be obtained by using this part of sub menu. Graph Create charts and graphs such as scatter plots, bar charts, line graphs, and pie charts, donut charts, and box plot. Wizard for bar charts, pie charts, and line plot available. Analyze Perform statistical procedures to produce descriptive and inferential statistics.  Add-in The Add-in Manager menu enables you to add and remove add-in tasks.  Tools Combine multiple reports into one. Set style of report. Schedule and order tasks.  Help Get help on SAS Enterprise Guide procedures. Getting Started tutorial. Connect to the online SAS Enterprise Guide resources.

23

An Introduction to the SAS System

New: To open the new project or data or program, etc.

Open: To open already existing project or data or program report or OLAP cube , Information Map , etc. Close Project: To close the project presently opened . Save Project: I t is used to save the already created project an available presently in the process flow.

Data: This part of sub menu contain the sub menu (i) Filter and Sort (ii) Query Builder (iii) Append Table (iv) Sort data (v) Creating Format (vi) Transpose (vii) Random Sample (viii) Rank

4. Storing SAS Data Sets: Libraries The SAS data sets created so far have been left with default names and locations. Some data set labels were altered to make the process flow easier to read. In most cases, it is not necessary to alter names and locations. When you want to control where project data sets are stored, use libraries. Essentially, a library is a folder where SAS data sets are stored. Rather than refer to the folder explicitly, the folder is assigned an alias: the library name. For example, the data sets created by the Import Data task were automatically given names like SASUSER.IMP. The part of the name before the period, SASUSER, is the library name and is an alias for c:\My SAS Files\9.2 on our system. To store data sets in a particular folder: 1. Assign a library name for that folder using the Assign Library wizard (ToolsAssign Library).

2. Type in a name, which should follow the rules for data set names but be eight characters or less; e.g., saseg.

3. Add a description if required. 4. When prompted, browse to the path of the folder; e.g., c:\saseg\project. 5. Continue through the wizard accepting defaults and an Assign Library icon should be added to the process flow. This needs to be run before the library can be used in the project, so it is best to set up the libraries at the beginning of the project. Having set up the library, any data set that is given a name beginning with project, such as project.water, will be stored in the folder D:\libraries\project.

All SAS data sets are stored in a library. If a data set name is not prefixed with a library name, it has the implicit library name of WORK which, like SASUSER, is one of the libraries assigned automatically by SAS Enterprise Guide. However, WORK is a temporary library which means that data sets stored in it will be deleted and removed from the project when SAS Enterprise Guide is closed, although the option to move the data sets to another library is offered at that point.

24

An Introduction to the SAS System

Data Access in SAS Enterprise Guide Enterprise Guide provides the facility to access data kept in the following formats:  SAS data sets( files containing data that are logically arranged in a form that SAS can understand)  Data tables from database (DB2 and Oracle that use licensed SAS database engines)  Local data in other formats( such as Excel, Access, Text,HTML, Lotus,etc)  OLAP cubes ( with a connection to an OLAP server)  SAS Enterprise Guide can open and run tasks on various types of data but if you want to edit data, it must be opened as SAS data set. SAS Enterprise Guide enables you to import many data files from different sources to create a SAS data sets.

The SAS Enterprise Guide Interface

When SAS Enterprise Guide starts, it first attempts to connect to SAS servers that it knows about. In most cases, connecting to SAS servers simply means that it finds that SAS is installed on the same computer. SAS Enterprise Guide then offers to open one of the projects that have recently been opened or to create a new project as shown below.

25

An Introduction to the SAS System

Projects A project is the way in which SAS Enterprise Guide stores statistical analyses and their results: It records which data sets were used, what analyses were run, and what the results were. o A project is a set of data, tasks, code and results. o All work in Enterprise Guide must be done within a project. o Projects can be:• Saved • Run • Scheduled • Shared Thus, a project is a piece of statistical analysis in the same way that a document is a piece of writing. In terms of scope, a project might be the user‘s approach to answering one particular question of interest. It should not be so large or diffuse that it becomes difficult to manage.

The most familiar elements of the interface are the menu bar and toolbar at the top of the window. There are four windows open and visible:  the Project Explorer window  the Project Designer window  the Task Status window  the Task List window

 Project Designer window It is a container for one or more process flows for the project. A process flow is a relational view of the objects in the project. As you add data, run tasks, and generate results, an icon for each object is added to the process flow and any relationship between the objects is shown with an arrow.

 Task List Offer alternative, sometimes quicker, ways to access features of SAS Enterprise Guide.  Task Status window Shows what is happening while SAS Enterprise Guide is using SAS to run a program.  Project Explorer window Offers an alternative view of the project to that presented in the Project Designer window. It tends to show more detail, which can be useful in some cases.

26

An Introduction to the SAS System

Shows everything in a project as a collapsible list: • Data • Tasks • Queries • Code • Logs • Results • Notes • Documents • Reports

The Workspace and Process Flow Within the Project Designer window, we can see an element labeled Process Flow, which is another concept central to SAS Enterprise Guide. Essentially, a process flow is a diagram consisting of icons that represent data sets, tasks, and outputs with arrows joining them to indicate how they relate to each other. The general term tasks includes not only statistical analyses but data manipulation.

The Task List The Task List is hidden by default but it can be displayed by clicking the Task List button in the resource pane in the lower left corner of the workspace .We can use tasks to do everything from manipulating data , running specific analytical procedure , crating reports. Many tasks are also available as wizard .For this choose tasks and wizards by using the Task lists or by using the menus .It can provide a quick and easy way to use some of the tasks . the views of the tasks are as follows. (i) Tasks by Category : View lists individual tasks grouped by type (ii) Tasks by Name : View lists individual task alphabetically . (iii) Tasks by Name : View also lists the SAS procedure that are related to the tasks.

Local and Remote Computer When we open data in SAS Enterprise Guide , We must select whether we want to look for the data on local computer , a SAS server, or in SAS folder. On clicking Local Computer, by browsing the directory structure of local computer, one can open the type of data file that SAS Enterprise Guide can read. On clicking the Servers it looks for the data in the server. In the server there are icons that we can select for Libraries and Files .Libraries are shortcut names for directory locations that SAS knows about. Some libraries are defined by SAS, and some are defined by SAS Enterprise Guide. Libraries contain only SAS data sets.

27

An Introduction to the SAS System

Navigating data stored in a Local Computer To navigate the data stored in your personal computer, select Local Computer .Navigate to the location where data is stored in your Local Computer.

Editing Tables in the Data Grid When You read a data set to a project , the Data Grid‘s default behavior is to display the data automatically in read-only mode.SAS data sets can only be modified if the user has appropriate authority and if the data set resides in a server where SAS is installed.To modify the data following actions are available in update mode:

. Change the format of the Variable or Coulumn . Edit data values . create new columns and add rows . Change the name of the column . apply labels and formats to culumns . sort by multiple columns in ascending or descending order

Change the format of theVariable To change the format of the variables click on Modify Task which will open a wizard displaying all the columns with their names, labels, type and format. By clicking on modify and by selecting any single column or all columns, you can modify the attributes. Click Finish to make it effective.

Edit data values To edit data values in SAS data sets, Click on the edit option from menu bar and uncheck the Protect Data option.The default option is Checked. Double click on any data value and modify the value as per the requirement.

28

An Introduction to the SAS System

Create/Delete columns and add rows To add or append new rows in SAS data sets, Click on the edit option from menu bar and uncheck the Protect Data option.The default option is Checked.Select the last row of the data set and right click and the select Insert a row or Delete rows or Append a row.To add a new column , Go to Edit, select Column option and the Insert a new column.

Splitting Data Sets: Using Filters Using filters to produce subsets of the observations in a data set. We might want to form a subset of the observations in order to discard observations that have errors, or because we wish to focus our analysis on one particular group of observations. 1. Click on the data set to make it the active data set. 2. Select Data 3.Select Filter and Query. 3.Click on Select Data. 3. Click on the Filter Data tab. 4.State is the variable we want to filter on, so we drag and drop that into the Filter Data pane. The Edit Filter window pops up. 5. Suppose the value of state that we want to select is in a list . We could simply type that into the value box, but it would be safer to use the drop-down button and select Get Values. The reason for preferring Get Values is that filters which use character variables are case sensitive: Orissa is different from orissa, so if both occurred in the data set, the filter would need to include both. Using Get Values would give us the correct spelling and case as well as alerting us to any misspellings that there might be in the data set. In our example here, the situation is straightforward and the Query Builder window should look like. A more complex filter can be constructed by clicking the new filter button and selecting New Advanced Filter, which brings up the Advanced Expression Editor seen earlier.

29

An Introduction to the SAS System

Concatenating and Merging Data Sets: Appends and Joins You might need to join tables to accomplish the followings: o Create a report with columns from more than one table. o Compute a new column when the input columns are located in separate tables. o Add information from a lookup table into a base table. o Identify values of column that do not occur in other tables. Append and Join Where two or more data sets contain the same variables (or mostly the same) but different observations, they can be combined into a single data set selecting Data and Right Clicking on the table by Query Builder and specifying the table(s) to be concatenated with the active data set Concatenation is where two data sets contain mostly the same observations but different variables, they can be combined to create a data set with all the variables using a join. Joins are yet another function of the Filter and Query task. We will illustrate a join again using the data sets. . 1. Make the imported data set the active data set. 2. Select DataFilter and Query. 3. Click Add tables. Select Join Tables option to access the tables and Joins window. This window enables you to verify or change the criteria used to join multiple tables.

30

An Introduction to the SAS System

Sorting the Data

You can use the Filter and Sort window to modify a single table in SAS Enterprise Guide. When you use the Filter and Sort window, you create an output table that you can use in another query or task. Any changes that you make to the data are not applied to the original data source. If you need to create a more complex query that joins multiple tables together, creates computed columns, or uses prompts, then you need to use the Query Builder. You can use the Filter and Sort window to modify data in the following ways:  Select specific variables to include in the output table  Filter the data based on criteria that you specify  Sort the data You can also choose to validate and preview the modifications that you are making.

To select the variables to include in the output table

1. Open the data that you want to use and click Filter and Sort on the workspace toolbar. The Filter and Sort window opens. 2. You can also open the Filter and Sort window by selecting Tasks Data Filter and Sort 3. Click the Variables tab if it is not already selected.

Rank

Each column assigned to this role will be ranked. You must assign at least one variable to this role. By default, the rankings column is given the name rank_column-name, where column-

31

An Introduction to the SAS System

name is the name of the original column. To specify a new name, type the name in the Rank column name text box. By default, the output table contains the original columns as well as the ranked columns. If you want to replace the original column with the ranked columns, then clear Include ranking values check box.

Rank by When you assign one or more columns to this role, the input table is sorted by the selected column or columns and rankings are calculated within each group.

Using prompts in a query

Prompts in a query require the user to select or enter a value when the query runs. In a query, prompts can be used as the comparison value in a conditional expression for a filter or a recoded column. The value that the user selects or enters for the prompt is compared with the value of the column that is used in the conditional expression. You can also use a prompt to supply a value for a title or footnote.

To create a new prompt by using the Prompt Manager in the Query Builder

1. In the Query Builder, click Prompt Manager. The Prompt Manager window opens. 2. To create a new prompt, click Add. The Add New Prompt dialog box opens. For more information, see creating a prompt.

To associate an existing prompt with a query

1. In the Query Builder, select Options Options for This Query. The Query Options dialog box opens. 2. Select Prompts in the selection pane, and then click Add. The Select Prompts dialog box opens. 3. Select the prompt that you want to associate with the query and click OK. 4. Click OK again to close the dialog box. When you use the prompt in a query, the prompt should be added to the query as &prompt-name.

To remove the association between a prompt and a query, select the prompt that you want to remove on the Prompts page of the Query Options dialog box. Then click Remove. You might need to edit or delete any filters, computed columns, titles, or footnotes that use that prompt.

Graphs SAS Enterprise Guide also makes the powerful graphics facilities of SAS much easier to use. Some of these graphic facilities are available within analysis tasks and others are accessed from the TaskGraph menu. A wide range of plots and charts are described in later chapters. Rather than describe the graph tasks here, the interested reader is referred to the index. One point to note, however, is that the graphs produced are dependent both on the format of the results and the graph format. Both formats are specified under ToolsOptionsResultsResults General and ToolsOptionsResultsGraph. One major difference is that, when the output format is RTF, the graphs are included in the same file as the textual output and tables; when HTML output is chosen, each graph appears in a

32

An Introduction to the SAS System

separate file with its own icon in the process flow. A Bar Graph indicating rainfall across various states

A pie chart showing agricultural growth

Running Parts of the Process Flow So far, we have described running individual tasks. It is also possible to run a branch of the process flow or the whole process flow. If we right-click on any task within a process flow, we will have the option to run that task or to run the branch from that task. The branch is everything to the right of the task which is directly or indirectly connected to it by the arrows. To run the whole process flow, right-click on its tab and select Run.

33

An Introduction to the SAS System

Statistical Analysis Tasks

Once data in a SAS data set have been added to a project, whether directly or by importing raw data, the analysis can begin. One point to bear in mind is that not all tasks that might be considered as analysis are under the Task menu. Several are accessed from the Describe menu, and some of the tasks under the Data menu could form part of an analysis. A typical analysis task consists of a number of panes, each of which allows some aspect of the analysis or set of options to be specified.

The Task Roles pane, which is selected, is where the variables that are to be used in the analysis are selected and their roles in the analysis specified. The available variables are listed in the central section, and they can be dragged from there to the specific roles in the right-hand section. The available roles vary depending on the task, but some of the most common are included here:

The Dependent variable is the response variable, the one whose values we are modeling. The numeric icon to the left indicates that only numeric variables can be assigned this role and (Limit: 1 to the right indicates that only one response variable can be included in the model. Quantitative variables are also numeric. The dashed line around it shows that it has been selected (clicked on) and a description of the role appears in the box below, explaining that these are continuous explanatory variables. There are no variables assigned to this role. Classification variables are discrete explanatory variables. They can be numeric or character. If they are numeric, classification variables will tend to have relatively few distinct values. Group analysis by variables are also discrete, numeric, or character—variables which define groups in the data. When a variable is assigned this role, the analysis is repeated for each group defined by the variable. For example, if a variable, gender with values male and female was assigned this role, the analysis would be repeated for males and females separately. Frequency count variables are used with grouped data, where each observation represents a number of individuals.

34

An Introduction to the SAS System

Capability analysis of: ph, ec

Simple Statistics

Variable N Mean Std Dev Median Minimum Maximum ph 30 7.45900 0.31845 7.37000 6.83000 8.16000 ec 30 0.70167 0.44885 0.50000 0.10000 1.90000 sar 30 3.11533 2.85073 1.31000 0.58000 10.03000 no3 30 1.77367 1.72040 1.54500 0.03000 4.36000 k 30 3.39633 4.16165 1.08000 0.29000 13.44000 fe 30 0.03370 0.02919 0.03550 0 0.08200

Scatter Plot

35

An Introduction to the SAS System

One-Way Analysis of Variance

36

An Introduction to the SAS System

ANOVA

Source DF Sum of Squares Mean Square F Value Pr > F

Model 2 0.98592000 0.49296000 6.81 0.0040

Error 27 1.95495000 0.07240556

Corrected Total 29 2.94087000

Means with the same letter are not significantly different.

Tukey Grouping Mean N block

A 7.6790 10 bhogorai

A

B A 7.4630 10 jaleswar

B

B 7.2350 10 basta

Bon Grouping Mean N block

A 7.6790 10 bhogorai

A

B A 7.4630 10 jaleswar

B

B 7.2350 10 basta

37

An Introduction to the SAS System

Summary of Forward Selection

Variable Number Partial Model

Step Entered Vars In -Square R-Square C(p) F Value Pr > F

1 Thickness 1 0.6939 0.6939 0.4560 34.01 <.0001

2 Depth 2 0.0139 0.7078 1.8449 0.67 0.4281

3 Transmissivity 3 0.0192 0.7270 3.0014 0.91 0.3566

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 3 1687.64092 562.54697 11.54 0.0006

Error 13 633.68539 48.74503

Corrected Total 16 2321.32631

Parameter Estimates

38

An Introduction to the SAS System

Parameter Standard Variance

Variable DF Estimate Error t Value Pr > |t| Inflation

Intercept 1 -20.38198 15.95416 -1.28 0.2238 0

Depth 1 0.16668 0.13727 1.21 0.2462 2.97731

Thickness 1 0.71928 0.29755 2.42 0.0311 2.82815

Transmissivity 1 4.47197 4.67852 0.96 0.3566 1.53157

Collinearity Diagnostics

Condition Proportion of Variation

Number Eigenvalue Index Intercept Depth Thickness Transmissivity

1 3.84597 1.00000 0.00074169 0.00207 0.00306 0.00097199

2 0.10885 5.94411 0.01979 0.05690 0.13473 0.04203

3 0.03941 9.87868 0.01800 0.36405 0.48075 0.03332

4 0.00577 25.81353 0.96147 0.57699 0.38146 0.92368

39

An Introduction to the SAS System

40

An Introduction to the SAS System

41

An Introduction to the SAS System

42

An Introduction to the SAS System

SPLIT PLOT ANALYSIS

43

An Introduction to the SAS System

44

An Introduction to the SAS System

45

An Introduction to the SAS System

Linear Models

The GLM Procedure

Class Level Information

Class Levels Values

rep 3 1 2 3

ms 6 1 2 3 4 5 6

mt 4 1 2 3 4

Number of Observations Read 72

Number of Observations Used 72

Source DF Sum of Squares Mean Square F Value Pr > F

Model 35 192163043.2 5490372.7 15.71 <.0001

Error 36 12584873.2 349579.8

Corrected Total 71 204747916.3

R-Square Coeff Var Root MSE yield Mean

0.938535 10.79144 591.2527 5478.903

Source DF Type I SS Mean Square F Value Pr > F

ms(rep) 17 32931455.07 1937144.42 5.54 <.0001

rep 0 0.00 . . .

ms 0 0.00 . . .

46

An Introduction to the SAS System

mt 3 89888101.15 29962700.38 85.71 <.0001

ms*mt 15 69343486.93 4622899.13 13.22 <.0001

Source DF Type III SS Mean Square F Value Pr > F

ms(rep) 10 1419678.81 141967.88 0.41 0.9348

rep 2 1082576.69 541288.35 1.55 0.2264

ms 5 30429199.57 6085839.91 17.41 <.0001

mt 3 89888101.15 29962700.38 85.71 <.0001

ms*mt 15 69343486.93 4622899.13 13.22 <.0001

Parameter Estimate Standard Error t Value Pr > |t|

Intercept 1663.416667 B 418.0788264 3.98 0.0003 ms(rep) 1 1 2753.083333 B 591.2527465 4.66 <.0001 ms(rep) 2 1 3144.416667 B 591.2527465 5.32 <.0001 ms(rep) 3 1 3000.833333 B 591.2527465 5.08 <.0001 ms(rep) 4 1 1829.916667 B 591.2527465 3.09 0.0038 ms(rep) 5 1 437.750000 B 591.2527465 0.74 0.4639 ms(rep) 6 1 51.000000 B 418.0788264 0.12 0.9036 ms(rep) 1 2 3066.583333 B 591.2527465 5.19 <.0001 ms(rep) 2 2 3082.416667 B 591.2527465 5.21 <.0001 ms(rep) 3 2 3286.083333 B 591.2527465 5.56 <.0001 ms(rep) 4 2 2526.666667 B 591.2527465 4.27 0.0001 ms(rep) 5 2 337.250000 B 591.2527465 0.57 0.5720

47

An Introduction to the SAS System

ms(rep) 6 2 600.750000 B 418.0788264 1.44 0.1594 ms(rep) 1 3 2634.083333 B 591.2527465 4.46 <.0001 ms(rep) 2 3 3230.916667 B 591.2527465 5.46 <.0001 ms(rep) 3 3 3158.833333 B 591.2527465 5.34 <.0001 ms(rep) 4 3 2101.166667 B 591.2527465 3.55 0.0011 ms(rep) 5 3 374.750000 B 591.2527465 0.63 0.5302 ms(rep) 6 3 0.000000 B . . . rep 1 0.000000 B . . . rep 2 0.000000 B . . . rep 3 0.000000 B . . . ms 1 0.000000 B . . . ms 2 0.000000 B . . . ms 3 0.000000 B . . . ms 4 0.000000 B . . . ms 5 0.000000 B . . . ms 6 0.000000 B . . . mt 1 6820.000000 B 482.7558459 14.13 <.0001 mt 2 4659.666667 B 482.7558459 9.65 <.0001 mt 3 4184.666667 B 482.7558459 8.67 <.0001 mt 4 0.000000 B . . . ms*mt 1 1 -7048.666667 B 682.7198646 -10.32 <.0001 ms*mt 1 2 -4835.000000 B 682.7198646 -7.08 <.0001 ms*mt 1 3 -5482.666667 B 682.7198646 -8.03 <.0001 ms*mt 1 4 0.000000 B . . . ms*mt 2 1 -5964.000000 B 682.7198646 -8.74 <.0001 ms*mt 2 2 -3493.666667 B 682.7198646 -5.12 <.0001

48

An Introduction to the SAS System

ms*mt 2 3 -3558.000000 B 682.7198646 -5.21 <.0001

ms*mt 2 4 0.000000 B . . .

ms*mt 3 1 -5232.000000 B 682.7198646 -7.66 <.0001

ms*mt 3 2 -3212.666667 B 682.7198646 -4.71 <.0001

ms*mt 3 3 -3002.666667 B 682.7198646 -4.40 <.0001

ms*mt 3 4 0.000000 B . . .

ms*mt 4 1 -3903.333333 B 682.7198646 -5.72 <.0001

ms*mt 4 2 -1580.666667 B 682.7198646 -2.32 0.0264

ms*mt 4 3 -1986.666667 B 682.7198646 -2.91 0.0062

ms*mt 4 4 0.000000 B . . .

ms*mt 5 1 -1303.333333 B 682.7198646 -1.91 0.0643

ms*mt 5 2 244.333333 B 682.7198646 0.36 0.7225

ms*mt 5 3 456.000000 B 682.7198646 0.67 0.5084

ms*mt 5 4 0.000000 B . . .

ms*mt 6 1 0.000000 B . . .

ms*mt 6 2 0.000000 B . . .

ms*mt 6 3 0.000000 B . . .

ms*mt 6 4 0.000000 B . . .

Note: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

49

An Introduction to the SAS System

50

An Introduction to the SAS System

Level of yield

rep N Mean Std Dev

1 24 5369.70833 1756.97921

2 24 5650.16667 1751.89772

3 24 5416.83333 1642.83085

Duncan's Multiple Range Test for yieldNote: This test controls the Type I comparison wise error rate, not the experiment wise error rate.

Alpha 0.05

Error Degrees of Freedom 10

Error Mean Square 141967.9

51

An Introduction to the SAS System

Number of Means 2 3 4 5 6

Critical Range 342.7 358.2 367.2 373.0 376.9

Means with the same letter are not significantly different.

Duncan Grouping Mean N ms

A 5866.3 12 3

A

A 5864.4 12 4

A

B A 5812.0 12 5

B A

B A 5796.8 12 6

B

B 5478.2 12 2

C 4055.8 12 1

52

An Introduction to the SAS System

Level of Level of yield ms mt N Mean Std Dev

1 1 3 4252.66667 349.54447

1 2 3 4306.00000 884.42750

1 3 3 3183.33333 262.45254

1 4 3 4481.33333 355.00047

2 1 3 5672.00000 670.13133

2 2 3 5982.00000 470.42109

2 3 3 5442.66667 625.74542

2 4 3 4816.00000 326.50881

3 1 3 6400.00000 314.47734

3 2 3 6259.00000 336.98220

3 3 3 5994.00000 260.57628

3 4 3 4812.00000 831.54555

53

An Introduction to the SAS System

4 1 3 6732.66667 300.48184

4 2 3 6895.00000 297.20868

4 3 3 6014.00000 311.42896

4 4 3 3816.00000 1141.45696

5 1 3 7563.33333 279.17259

5 2 3 6950.66667 633.41719

5 3 3 6687.33333 380.63281

5 4 3 2046.66667 680.15390

6 1 3 8700.66667 215.46539

6 2 3 6540.33333 741.58906

6 3 3 6065.33333 916.87367

6 4 3 1880.66667 449.09836

54

An Introduction to the SAS System

Level of Level of yield ms rep N Mean Std Dev

1 1 4 3991.00000 404.50958

2 1 4 5470.00000 738.87662

3 1 4 5718.50000 787.93464

4 1 4 5541.75000 1925.36860

5 1 4 5866.50000 2978.79903

6 1 4 5630.50000 2566.48313

1 2 4 4304.50000 988.54422

2 2 4 5408.00000 648.76704

3 2 4 6003.75000 333.75877

4 2 4 6238.50000 965.33707

5 2 4 5766.00000 2601.84294

6 2 4 6180.25000 3287.21345

1 3 4 3872.00000 708.90150

2 3 4 5556.50000 726.97799

3 3 4 5876.50000 1195.02901

4 3 4 5813.00000 1455.62862

5 3 4 5803.50000 2093.41722

6 3 4 5579.50000 2792.39891

55

An Introduction to the SAS System

Number of Means 2 3 4

Critical Range 399.7 420.2 433.6

Means with the same letter are not significantly different.

Duncan Grouping Mean N mt

A 6553.6 18 1

A

A 6155.5 18 2

B 5564.4 18 3

C 3642.1 18 4

56

An Introduction to the SAS System

Linear, Non-linear regression and response surface analysis using SAS

Dileep K. Panda Directorate of Water Management Bhubaneswar-751023 [email protected]

Multiple linear regression is one of the statistical tools used for discovering relationships between variables. It is used to find the linear model that best predicts the dependent variable from the independent variables. A data set with p independent variables has 2p possible subset models to consider since each of the p variables is either included or excluded from the model, not counting interaction terms. The following steps are followed while performing the multiple linear regression analysis

1. Linear Functional form 2. Fixed independent variables 3. Independent observations 4. Representative sample and proper specification of the model (no omitted variables) 5. Normality of the residuals or errors 6. Equality of variance of the errors (homogeneity of residual variance) 7. No multi co linearity 8. No autocorrelation of the errors 9. No outlying observations in dataset

o Linear Functional form  Does not detect curvilinear relationships o Independent observations  Representative samples  Autocorrelation inflates the t and r and f statistics and warps the significance tests o Normality of the residuals  Permits proper significance testing o Equality of variance  Heteroscedasticity precludes generalization and external validity  This also warps the significance tests

o Multicollinearity prevents proper parameter estimation. It may also preclude computation of the parameter estimates completely if it is serious enough.

57

An Introduction to the SAS System

o Outliers may bias the results. If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates. The model can be expressed as

Yi = b 0 + b 1 X 1i + b 1 X 2i + … + b k Xki where k = the number of independent variables

Yi = value of the dependent variable for case i b 0 = the intercept (if all x‘s are zero, the expected value of y is b 0 ) b j = the slope (for every one unit increase in Xj , Y is expected to change by bj units, given that the other independent variables are heldPath constant) Diagram of A Linear Regression Analysis

X i = the value of the independent variable for case i

X1

error YY

X2

x3

Assumptions of Linear Regression

 Homoscedasticity – the variance of the error terms is constant for each value of X. To check this, look at the plot(s) of the residuals versus the X value(s). You do not want to see a fanning effect in the residuals.

 Linearity – the relationship between each X and Y is linear. To check this, again look at the plot(s) of the residuals versus the X value(s). You don‘t want to see a clustering of positive residuals or a clustering of negative residuals.  Normally Distributed Error Terms – the error terms follow the normal distribution.

 Independence of Error Terms – successive residuals are not correlated, If they are correlated, it is known as autocorrelation. If possible, use the Durbin Watson statistic to check this.

Examining data prior to modelling

Before you begin modelling, it is recommended that you plot your data. By examining these initial plots, you can quickly assess whether the data have linear relationships or interactions are present. The linear regression model may not be directly applicable to certain data. Non- linearity may be detected from scatter plots or may be known through the underlying theory of the product or process or from past experience. Transformations on either the predictor 58

An Introduction to the SAS System

variable, X, or the response variable, Y, may often be sufficient to make the linear regression model appropriate for the transformed data. If it is known that the data follows the logarithmic distribution, then a logarithmic transformation on Y (i.e. Y*= log(Y)) might be useful. For data following the Poisson distribution, a square root transformation (Y*= Sqrt(Y)) is generally applicable. Transformations on Y may also be applied based on the type of scatter plot obtained from the data. Figure 4.17 shows a few such examples. For the scatter plot of Figure (a), a square root transformation (Y*= Sqrt(Y)) is applicable. While for Figure (b), a logarithmic transformation (i.e. Y*= log(Y)) may be applied. For Figure (c), the reciprocal transformation (Y=1/ Y) is applicable. At times it may be helpful to introduce a constant into the transformation of Y. For example, if Y is negative and the logarithmic transformation on Y seems applicable, a suitable constant, k, may be chosen to make all observed Y positive. Thus the transformation in this case would be Y=log (k+Y).

SAS CODES

/*RAW DATA EXAMINATION*/ proc plot data=dataplot; plot Y*X1 Y*X2 Y*X3 ; run;

Creating the Model

The REG procedure can be used to build and test the assumptions of the data we propose to model.

R-square and Adj-Rsq You want these numbers to be as high as possible. If your model has a lot of variables, use Adj-Rsq because a model with more variables will have a higher R-square than a similar

59

An Introduction to the SAS System

model with fewer variables. Adj-Rsq takes the number of variables in your model into account. An R-square or 0.7 or higher is generally accepted as good.

Root MSE The RMSE needs to be small compared to other models. The value of Root MSE will be dependent on the values of the Y variable are used for modeling. As a guideline, you want the value for each of the variables in your model to have a Type III SS p-value of 0.05 or less. Other approaches to finding good models are having a small PRESS statistic (found in REG as Predicted Resid SS (Press)) or having a CP statistic of p-1 where p is the number of parameters in your model. CP can also be found using PROC REG.

Test of Assumptions

Plots of residuals against the predictor variable or against the fitted values are not only helpful to study whether a linear regression function is appropriate but also to examine whether the variance of the error terms is constant

Here, the residuals show no systematic tendencies of positive and negative values suggesting the appropriateness of linear regression.

The curvilinear nature of errors indicates a departure from the linear regression model, thus requiring a curvilinear regression function.

60

An Introduction to the SAS System

Tests for Normality of residuals

 Kolmogorov-Smirnov Test  Shapiro-Wilks Test

The normal probability plot plots of the errors look linear and fall in line diagonally then the errors are normally distributed.

Testing for Autocorrelation

Autocorrelation is when an error term is related to a previous error term. This situation can happen with time series data such as monthly sales. The Durbin-Watson statistic can be used to check if autocorrelation exist. The Durbin-Watson statistic is calculated by using the DW option in REG. The Durbin-Watson statistic test for first order correlation of error terms. The Durbin Watson statistic ranges from 0 to 4.0. Generally a D-W statistic of 2.0 indicates the data are independent.

Proc reg; model Y = X1 X2 X3; Output out=resdat r=resid p=pred; Data check; set resdat; Proc univariate normal plot; var resid; Title ‗Test of Normality of Residuals‘; Run;

The PROC REG provides nine methods of model selection

NONE stands for no selection. This method is the default and uses the full model given in the MODEL statement to fit the linear regression.

FORWARD stands for forward selection. This method starts with no variables in the model and adds variables one by one to the model. At each step, the variable added is the one that

61

An Introduction to the SAS System

maximizes the fit of the model. You can also specify groups of variables to treat as a unit during the selection process. An option enables you to specify the criterion for inclusion.

BACKWARD stands for backward elimination. This method starts with a full model and eliminates variables one by one from the model. At each step, the variable with the smallest contribution to the model is deleted. You can also specify groups of variables to treat as a unit during the selection process. An option enables you to specify the criterion for exclusion.

STEPWISE stands for stepwise regression, forward and backward. This method is a modification of the forward-selection method in that variables already in the model do not necessarily stay there. You can also specify groups of variables to treat as a unit during the selection process. Again, options enable you to specify criteria for entry into the model and for remaining in the model.

MAXR stands for maximum R2 improvement. This method tries to find the best one-variable model, the best two-variable model, and so on. The MAXR method differs from the

STEPWISE method in that many more models are evaluated with MAXR, which considers all switches before making any switch. The STEPWISE method may remove the "worst" variable without considering what the "best" remaining variable might accomplish, whereas

MAXR would consider what the "best" remaining variable might accomplish. Consequently,

MAXR typically takes much longer to run than STEPWISE.

MINR stands for minimum R2 improvement. This method closely resembles MAXR, but the switch chosen is the one that produces the smallest increase in R2.

RSQUARE finds a specified number of models having the highest R2 in each of a range of model sizes.

CP finds a specified number of models with the lowest Cp within a range of model sizes.

ADJRSQ finds a specified number of models having the highest adjusted R2 within a range of model sizes.

Keywords of PROC REG

Keyword Statistic

COOKD. Cook's D influence statistics

COVRATIO. standard influence of observation on covariance of betas

DFFITS. standard influence of observation on predicted value

H. leverage

LCL. lower bound of 100(1-α)% confidence interval for individual

62

An Introduction to the SAS System

prediction

LCLM. lower bound of 100(1-α)% confidence interval for the mean of the dependent variable

PREDICTED. predicted values | PRED. | P.

PRESS. residuals from refitting the model with current observation deleted

RESIDUAL. | R. residuals

RSTUDENT. studentized residuals with the current observation deleted

STDI. standard error of the individual predicted value

STDP. standard error of the mean predicted value

STDR. standard error of the residual

STUDENT. residuals divided by their standard errors

UCL. upper bound of 100(1-α)% confidence interval for individual prediction

UCLM. upper bound of 100(1-α)% confidence interval for the mean of the dependent variables

Collinearity

First, look at multicollinearity from a conventional viewpoint. The absence of multi- collinearity is essential to a multiple regression model. In regression when several predictors (regressors) are highly correlated, this problem is called multi-collinearity or collinearity. When things are related, we say they are linearly dependent on each other because you can nicely fit a straight regression line to pass through many data points of those variables. Collinearity simply means co-dependence. Collinearity is problematic when one's purpose is explanation rather than mere prediction. Collinearity makes it more difficult to achieve significance of the collinear parameters. But if such estimates are statistically significant, they are as reliable as any other variables in a model. And even if they are not significant, the sum of the coefficient is likely to be reliable. In this case, increasing the sample size is a viable remedy for collinearity when prediction instead of explanation is the goal. However, if the goal is explanation, measures other than increasing the sample size are needed.

 VIF

A statistic called the Variance Inflation Factor, VIF, can be used to test for multicollinearity. A cut off of 10 can be used to test if a regression function is unstable. If VIF>10 then you should search for causes of multicollinearity. If multicollinearity exists, then change the model in terms of dropping a term, transforming a variable or using the ridge regression. To 63

An Introduction to the SAS System

check the VIF statistic for each variable the REG with the VIF option in the model statement can be applied.

Understanding multi-collinearity should go hand in hand with understanding variation inflation. Variation inflation is the consequence of multi-collinearity. We may say multi- collinearity is the symptom while variance inflation is the disease. In a regression model we expect a high variance explained (R-square). The higher the variance explained is, the better the model is. However, if collinearity exists, probably the variance, standard error, parameter estimates are all inflated. In other words, the high variance is not a result of good independent predictors, but a mis-specified model that carries mutually dependent and thus redundant predictors.

A statistic called the Variance Inflation Factor, VIF, can be used to test for multicollinearity. A cut off of 10 can be used to test if a regression function is unstable. If VIF>10 then you should search for causes of multicollinearity. If multicollinearity exists, then change the model in terms of dropping a term, transforming a variable or using the ridge regression. To check the VIF statistic for each variable the REG with the VIF option in the model statement can be applied. In SAS the VIF can be obtained as

PROC REG;

MODEL Y = X1 X2 X3 X4 /VIF; Run;

The SAS system put the independent variables as the dependent variable e.g.

X1 = X2 X3 X4 X2 = X1 X3 X4 X3 = X1 X2 X4

A frequently used remedy for too many variables is stepwise regression. Instead, "Maximum R-square," "Root mean square error," and "Mallow's CP" are considered better alternatives.

Ridge regression

When multicollinearity occurs, the variances are large and thus far from the true value. Ridge regression is an effective counter measure because it allows better interpretation of the regression coefficients by imposing some bias on the regression coefficients and shrinking their variances.

The following is an example of performing ridge regression in SAS: proc reg outest=ridge outvif outstb ridge=0 to 4.0 by .1; model Y = X1 X2 X3 X4; plot / ridgeplot; data new; set ridge; if _type_='RIDGESTB' or _type_='RIDGEVIF'; proc sort; by _type_; proc gplot; by _type_; plot ( X1 X2 X3 X4)*_RIDGE_/overlay; 64

An Introduction to the SAS System

data two; set ridge; if _type_='RIDGE';

proc print;

For a regression model which carries interaction terms, quadratic terms, or cubic terms, other remedies such as "centered-score regression" or "orthogonalization" may be necessary.

Testing for Outliers

Outliers are observations that exert a large influence on the overall outcome of a model or a parameter's estimate. When examining outlier diagnostics, the size of the dataset is important in determining cutoffs. A data set where (2 (p/n)1/2 > 1) is considered large, where p is the number of terms in the model excluding the intercept and n is the sample size. The REG procedure can be used to view various outlier diagnostics. The Influence option requests a host of outlier diagnostic tests. The R option is used to print out Cook's D. The output prints statistics for each observation in the dataset. Cook's D is a statistic that detects outlying observations by evaluating all the variables simultaneously. SAS prints a graph that makes it easy to spot outliers using Cook's D. A Cook's D greater than the absolute value of 2 should be investigated. RSTUDENT is the studentized deleted residual. The studentized deleted residual checks if the model is significantly different if an observation is removed. An RSTUDENT whose absolute value is larger than 2 should be investigated. The DFFITS statistics also test if an observation is strongly influencing the model. Interpretation of DFFITS depends on the size of the dataset. If your dataset is small to medium size, you can use 1.0 as a cut off point. DFFITS greater than 1.0 warrant investigating. For large datasets, investigate observations where DFFITS > 2(p/n)1/2. DFFITS can also be evaluated by comparing the observations among themselves. Observations whose DFFITS values are extreme in relation to the others should be investigated. Outliers can be addressed by assigning weights, modifying the model (eg. Transform variables) or deleting the observations (eg. if data entry error suspected). Whatever approach is chosen to deal with outliers should be done for a reason and not just to get a better fit.

Nonlinear Regression

A regression model is called nonlinear, if the derivatives of the model with respect to the model parameters depend on one or more parameters. This definition is essential to distinguish nonlinear from curvilinear regression. A regression model is not necessarily nonlinear if the graphed regression trend is curved. A polynomial model such as Y = b0 + b1X 2 + b2X + e appears curved when y is plotted against X. It is, however, not a nonlinear model. To see this, take derivatives of y with respect to the parameters b0, b1, and b2: dY/db0 = 1 2 dY/db1 = X dY/db2 = X . None of these derivatives depends on a model parameter, the model is linear. In contrast, consider the log-logistic model y = d + (a - d)/(1 + exp{b log(X/g)}) + e Take derivatives with respect to d, for example: dY/dd = 1 - 1/(1 + exp{b log(X/g)}). The derivative involves other parameters, hence the model is nonlinear. Fitting a nonlinear regression model to data is slightly more involved than fitting a linear model, but they have specific advantages.

65

An Introduction to the SAS System

 Nonlinear models are often derived on the basis of physical and/or biological considerations, e.g., from differential equations, and have justification within a quantitative conceptualization of the process of interest.  The parameters of a nonlinear model usually have direct interpretation in terms of the process under study. In the case of the log-logistic model above, for example, the response takes on a sigmoidal shape between d and a, and g is the value for which the response achieves (a + d)/2.  Constraints can be built into a nonlinear model easily and are harder to enforce for linear models. If, e.g., the response achieves an asymptotic value as x grows, many nonlinear models have such behavior built in automatically. Fitting Nonlinear Regressions

One of the disadvantages of nonlinear models is that the process is iterative. To estimate the parameters of the model, you commence with a set of user-supplied starting values. The software then tries to improve on the quality of the model fit to the data by adjusting the values of the parameters successively. The adjustment of all parameters is considered as one iteration. In the next iteration, the program again attempts to improve on the fit by modifying the parameters. Once an improvement is not possible, the fit is considered converged. Care must be exercised in choosing good starting values. The fact that the program can not improve on the model fit between successive iterations may not indicate that the best parameter estimates have been found, but indicate lack of progress of the iterative algorithm. It is possible to send the algorithm off into regions of the parameter space, from which it can not escape, but that do not provide the best estimates. It is thus sensible to start the iterative process with different sets of starting values and to observe whether the program arrives at the same parameter estimates.

The NLIN procedure fits nonlinear regression models and estimates the parameters by nonlinear least squares or weighted nonlinear least squares. You specify the model with programming statements. This gives you great flexibility in modeling the relationship between the response variable and independent (regressor) variables. It does, however, require additional coding compared to model specifications in linear modeling procedures such as the REG, GLM, and MIXED procedures. You need to declare the parameters in your model and supply their initial values for the NLIN procedure. You do not need to specify derivatives of the model equation with respect to the parameters. Although facilities for specifying first and second derivatives exist in the NLIN procedure, it is not recommended that you specify derivatives this way. Obtaining derivatives from user-specified expressions predates the high-quality automatic differentiator that is now used by the NLIN procedure. Nonlinear least-squares estimation involves finding those values in the parameter space that minimize the (weighted) residual sum of squares. In a sense, this is a "distribution-free" estimation criterion since the distribution of the data does not need to be fully specified. Instead, the assumption of homoscedastic and uncorrelated model errors with zero mean is sufficient. You can relax the homoscedasticity assumption by using a weighted residual sum of squares criterion. The assumption of uncorrelated errors (independent observations) cannot be relaxed in the NLIN procedure. The primary assumptions for analyses with the NLIN procedure are

 The structure in the response variable can be decomposed additively into a mean function and an error component.

66

An Introduction to the SAS System

 The model errors are uncorrelated and have zero mean. Unless a weighted analysis is performed, the errors are also assumed to be homoscedastic (have equal variance).  The mean function consists of known regressor (independent) variables and unknown constants (the parameters).

Fitting nonlinear models can be a difficult undertaking. There is no closed-form solution for the parameter estimates, and the process is iterative. There can be one or more local minima in the residual sum of squares surface, and the process depends on the starting values supplied by the user. You can reduce the dependence on the starting values and reduce the chance to arrive at a local minimum by specifying a grid of starting values. The NLIN procedure then computes the residual sum of squares at each point on the grid and starts the iterative process from the point that yields the lowest sum of squares. Even in this case, however, convergence does not guarantee that a global minimum has been found.

The NLIN procedure solves the nonlinear least squares problem by one of the following four algorithms Steepest-descent or gradient method

 Newton method  Modified Gauss-Newton method  Marquardt method

These methods use derivatives or approximations to derivatives of the SSE with respect to the parameters to guide the search for the parameters producing the smallest SSE. Derivatives computed automatically by the NLIN procedure are analytic, unless the model contains functions for which an analytic derivative is not available.

Decadal population growth (in million) in eastern India and India

Decade Orissa AP TN Pondichery India

1901 10.30 19.07 19.25 0.25 238.40

1911 11.38 21.45 20.90 0.26 252.09

1921 11.16 21.42 21.63 0.24 251.32

1931 12.49 24.20 23.47 0.26 278.98

1941 13.77 27.29 26.27 0.29 318.66

1951 14.65 31.12 30.12 0.32 361.09

1961 17.55 35.98 33.69 0.37 439.23

1971 21.94 49.98 41.20 0.47 548.16

1981 26.37 53.55 48.41 0.60 685.18

67

An Introduction to the SAS System

1991 31.66 66.51 55.86 0.81 838.58

2001 36.71 75.73 62.11 0.97 1027.02

SAS codes for fitting non-linear regression to the population of Andhra Pradesh (AP)

PROC IMPORT OUT= WORK.EI_population DATAFILE= "C:\NAIP-Training\population.xls" DBMS=EXCEL REPLACE; RANGE="sheet2$"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; proc nlin data=WORK.EI_population; model ap=b3/(1+b2*exp(-b1*(year1))); /* year1= 1 for decade 1901, 2 for decade 1911, 3 for 1931 likewise*/ /*initial values*/ parms b1=0.0012 b2=46 b3=1590; OUTPUT OUT=PP1 P=PREDICT R=RESID; run; proc plot data=pp1; plot AP*Decade PREDICT *Decade='p' / overlay; plot RESID*Decade / vref=0; run; quit;

68

An Introduction to the SAS System

Response Surface

In statistics, response surface methodology explores the relationships between several explanatory variables and one or more response variables. Many industrial experiments are conducted to discover which values of given factor variables optimize a response. If each factor is measured at three or more values, a quadratic response surface can be estimated by least-squares regression. The predicted optimal value can be found from the estimated surface if the surface is shaped like a simple hill or a valley. If the estimated surface is more complicated, or if the predicted optimum is far from the region of experimentation, then the shape of the surface can be analyzed to indicate the directions in which new experiments

69

An Introduction to the SAS System

should be performed. The RSREG procedure uses the method of least squares to fit quadratic response surface regression models. Response surface models are a kind of general linear model in which attention focuses on characteristics of the fit response function and in particular, where optimum estimated response values occur. In addition to fitting a quadratic function, you can use the RSREG procedure to test for lack of fit, test for the significance of individual factors, analyze the canonical structure of the estimated response surface, compute the ridge of optimum response,predict new values of the response.

The following statements are available in PROC RSREG.

PROC RSREG ; MODEL responses= independents ; RIDGE ; WEIGHT variable ; ID variables ; BY variables ;

The PROC RSREG and MODEL statements are required.

The MODEL statement lists response (dependent) variables followed by an equal sign and then lists independent variables, some of which may be covariates. The output options to the MODEL statement specify which statistics are output to the data set created using the OUT= option in the PROC RSREG statement. If none of the options are selected, the data set is created but contains no observations. The option keywords become values of the special variable –TYPE– in the output data

o Analyze Original Data NOCODE o Fit Model to First BY Group Only BYOUT o Declare Covariates COVAR= o Request Additional Statistics PRESS o Request Additional Tests LACKFIT o Suppress Displayed Output NOANOVA o NOOPTIMAL o NOPRINT o Output Statistics ACTUAL o PREDICT o RESIDUAL o L95 o U95 o L95M o U95M o D

SAS Code for Fruit quality at different levels of temperature and days stored. data fruit_quality; input Temp DaysStored Firmness Puncture TSS Acidity; datalines; 70

An Introduction to the SAS System

10 4 7625.00 397.00 12.26 0.25 10 10 7596.67 395.00 12.29 0.25 10 17 7576.67 390.00 12.32 0.24 10 24 7567.67 386.00 12.61 0.24 10 31 7428.00 380.67 13.00 0.24 10 38 7346.00 375.67 13.29 0.23 10 45 7215.00 364.67 13.38 0.23 10 52 7025.00 349.67 13.53 0.23 10 59 6907.67 342.00 13.78 0.22 15 4 7625.00 397.00 12.26 0.25 15 10 7600.33 394.00 12.32 0.25 15 17 7560.00 390.00 12.56 0.24 15 24 7529.67 387.00 12.81 0.24 15 31 7515.33 383.00 13.01 0.23 15 38 7431.67 376.00 13.30 0.23 15 45 7241.67 361.00 13.69 0.22 15 52 7060.67 346.33 14.01 0.22 15 59 6851.67 327.33 14.31 0.21 20 4 7625.00 397.00 12.26 0.25 20 10 7528.00 382.33 12.41 0.25 20 17 7410.00 353.00 12.82 0.24 20 24 7221.67 335.33 13.24 0.24 20 31 7084.67 318.00 13.63 0.24 20 38 6849.00 300.33 14.13 0.22 20 45 6621.00 285.67 14.75 0.21 20 52 6379.33 271.00 15.20 0.20 20 59 5742.33 240.00 16.36 0.18 25 4 7625.00 397.00 12.26 0.25 25 10 7417.33 364.00 12.76 0.24 25 17 7173.00 331.67 13.40 0.23 25 24 6826.67 308.00 14.23 0.22 25 31 6430.00 270.33 15.28 0.20 25 38 5786.67 197.67 16.45 0.18 ; run; proc rsreg data=work.fruit_quality out=cob; model firmness puncture tss acidity = temp daysstored/nocode lackfit predict; run; /*proc reg data=work.fruit_quality ;*/ /*model firmness puncture tss acidity = temp daysstored;*/ /*run;*/ quit;

71

An Introduction to the SAS System

72

An Introduction to the SAS System

73

An Introduction to the SAS System

APPLICATION OF GLM USING SAS

N. N. Jambhulkar C.R.R.I, Cuttack- 753006 [email protected]

In agricultural research, generally the experiments are conducted in completely randomized design, randomized completely block design, latin square design, split-plot design, strip plot design etc. SAS has several procedures for analysis of these designs and ‗PROC GLM‘ is widely used for the analysis of these designs. The ‗GLM‘ in PROC GLM stands for ‗General Linear Model‘. The GLM procedures uses the method of least squares to fit general linear models. Among the statistical methods available in PROC GLM are regression, analysis of variance, analysis of covariance, multivariate analysis of variance and partial correlation.

Statistical Assumptions for Using PROC GLM Consider the general linear model y  X    The basic statistical assumption underlying the least squares approach to general linear modeling is that the observed values of each dependent variable can be written as the sum of two parts: a fixed component , which is a linear function of the independent coefficients, and a random noise, or error, component : The errors for different observations are assumed to be uncorrelated with identical variances. Thus, this model can be written as E(Y)  X and Var(Y)   2 I The least squares approach provides estimates of the linear parameters that are unbiased and have minimum variance among linear estimators. Under the further assumption that the errors have a normal (or Gaussian) distribution, the least squares estimates are the maximum likelihood estimates and their distribution is known.

The general form of the GLM procedure is proc glm options; class variables; model dependent = independent / options; lsmeans effects / options; means effects / options; run;

In the above GLM procedure statement: ―CLASS‖ specifies the classification variables for the analysis. ―MODEL‖ specifies dependent and independent variables for the analysis. ―MEANS‖ computes the arithmetic means and standard deviations of all continuous variables in the model (both dependent and independent) for each effects listed in the MEANS statement. Only classification effects can be specified in the MEANS statement that is the effects that contain classification variables.

74

An Introduction to the SAS System

―LSMEANS‖ computes least-square means for each effects listed in the LSMEANS statement. Only classification variable scan be specified in the LSMEANS statement. In contrast to the MEANS statement, the LSMEANS statement performs multiple comparisons on interactions as well as main effects.

Single factor experiments Example of CRD Consider an experiment conducted in CRD on chemical control of brown planthopper and stem borers in rice. The experiment is treated with 7 treatments. The SAS statement given in figure 1 is shown below:

Figure 1

75

An Introduction to the SAS System

The output of the above program is shown below:

Figure 2

Figure 2 shows the ANOVA of the model. The ANOVA table tests that the means for all treatment groups are equal. The p-value is less than 0.05, results in rejecting the null hypothesis and concludes that at least one treatment mean is different from other mean. The R2 for this model is 0.73 indicating that this model accounts for approximately 73% of the variability in dependent variable. There are four types of sum of squares available in the PROC GLM procedure. For one-way analysis of variance all types of sums of squares are the same. For two-way analysis of variance, all types of sums of squares are same if the data is balanced. The four types of sums of squares are generally not identical for unbalanced data. Type I and Type III sums of squares are produced by default in PROC GLM and Type III sums of squares are the most commonly used sums of squares for unbalanced data ANOVA. The sum of squares is used to test the null hypothesis that the effect of the terms in the model is insignificant. p-value is < 0.05, hence null hypothesis is rejected.

76

An Introduction to the SAS System

Figure 3

Figure 3 gives the multiple comparison tests for the treatment means. Different tests are available to test multiple comparisons like least significant difference (LSD), duncan multiple range test (DMRT), Tukey‘s significant difference etc. The LSD value in the figure is 452.2. The treatment means with same letter shows that treatments are not significantly different.

Example of RCBD Consider an experiment conducted in RCBD that has six treatments (rates of seeding of a rice variety IR8) each replicated four times. The SAS statement for this experiment is given below in figure 4:

77

An Introduction to the SAS System

Figure 4

The output of the above experiment is given in figure 5 and 6 below and the interpretation of the results is same as explained previously.

Figure 5

Figure 6

78

An Introduction to the SAS System

Two Factor Experiments Example of split-plot design Now consider two factor experiment conducted in split-plot design with six levels of nitrogen in main plot treatment (represented as ‗ms‘) and four levels of rice varieties in sub-plot treatment (represented as ‗mt‘). The SAS statement for the above example is shown below:

Figure 7

One can use TUKEY or Scheffe in place of lsd for making all possible pairwise comparisons. Statement MEANS mt ms*mt/lsd; gives only means and standard deviations for level combinations of ms and mt. For pairwise comparison of level combinations of ms and mt, the statement lsmeans ms*mt/pdiff is used.

79

An Introduction to the SAS System

The output of the above example is given below:

Figure 8

In figure 8, the p-value of the model is <0.05, hence at least one treatment mean is different from one other treatment mean. In the sum of squares table, first consider the test for the interaction ms*mt. The p-value is <0.05 hence null hypothesis is rejected and concluded that there is an interaction between ms and mt treatments. The p-value for ms and mt both are < 0.05, hence both main plot treatments and sub-plot treatments are significant. The multiple comparison tests for main plot treatments (ms) and sub plot treatments (mt) are given in figure 9 and 10 respectively.

80

An Introduction to the SAS System

Figure 9

Figure 10

The LSMEANS statement in PROC GLM can be used to perform multiple comparison on interactions. The table in figure 11 presents the LSMEANS for ms*mt combination and assigns a number to each of the treatment combinations.

81

An Introduction to the SAS System

Figure 11

In figure 12, the table gives the p-value for a test of null hypothesis that the group means are equal. If the p-value in the table are < 0.05 then that two group means are significantly different.

Figure 12

82

An Introduction to the SAS System

Four Types of Sums of Squares The four types of sums of squares available in PROC GLM are explained in the next section Consider the general linear model

yijk   i   j  ()ij   ijk Type I Type II Type III Type IV A R(A| ) R(A| , B) R(A| , B, A* B) R(A| , B, A* B) B R(B | , A) R(B | , A) R(B | , A, A* B) R(B | , A, A* B) A*B R(A* B | , A, B) R(A* B | , A, B) R (.) represents reduction in sums of squared residuals when the corresponding factor is added to the model.

Type I sums of square are usually called sequential sums of squares and have the unique property that the sum of the Type I sums of squares for all the factors in the model will always sum to the model sum of squares. Each sum of squares is adjusted for those effects that appear before it in the model. Foe example, if the model statement is written as Y = A B A*B, then the sum of squares for A is the reduction in the unexplained variability adjusted for the mean, and the sum of squares for B is the reduction in the unexplained variability adjusted for the mean and A. In other words, the sum of squares for B tells you how much more variability is explained if you include B in the model that already contains A. In the case of Type I sums of squares, the order in which effects are written in the model statement determines what hypothesis are tested with the sums of squares. Type I sums of squares are useful for nested models and polynomial models. Type II sums of squares are adjusted for every other effects that does not contain the effect being tested. For example suppose that the model statement is Y = A B A*B. The sum of squares for A*B is adjusted for all other terms in the model because no other term contains A and B. The sums of square for A is adjusted only for B because the term A*B contains A. Type II sums of squares may be useful when there is no significant interaction and the experimenter is interested in main effects. Type III sums of squares are partial sums of squares. They are adjusted for all other effects in the model. Essentially they are measuring the reduction in unexplained variability when the effect is entered into a model that contains all other effects already. Type III sums of square are useful when you are interested in main effects adjusted for interaction effects. Although Type II sums of squares provide more powerful tests for main effects, type III sums of squares should be used to protect against the presence of interactions. Type IV sums of squares are also partial sums of squares that are adjusted for all other effects in the model. The adjustment rule is different from Type III sums of squares when there are empty cells in the data. Therefore, Type IV is the same as Type III if there are no empty cells.

83

An Introduction to the SAS System

ANALYSIS OF VARIANCE BY SAS-AN OVERVIEW

P. K. Meher Central Institute of Freshwater Aquaculture Kausalyaganga, Bhubaneswar [email protected]

LEARNING OBJECTIVES  To be able to compare three or more means using one - way ANOVA with multiple comparisons  To be able to perform a repeated measures (dependent samples) analysis of variance with multiple comparisons  To be able to graph mean comparisons

How to perform an analysis of variance (ANOVA) for several common designs? These procedures are used to compare means across groups or to compare three or more repeated measures (dependent samples). SAS provides three major procedures for performing analysis of variance: PROC ANOVA and PROC GLM,

PROC ANOVA is a basic procedure that is useful for one - way ANOVA or for multi - way factorial designs with fixed factors and an equal number of observations per cell.

In this we use PROC ANOVA in our analysis of a one - way ANOVA. PROC GLM is a SAS procedure that is similar to but more advanced than PROC ANOVA. GLM stands for General Linear Model.

We use PROC GLM for the one - way repeated measures analysis because it involves techniques not supported by PROC ANOVA.

We can also use of PROC MIXED (the newest of the three SAS ANOVA procedures) to analyze a model with both fixed and random factors and for a repeated measures design with a grouping factor.

COMPARING THREE OR MORE MEANS USING ONE – WAY ANALYSIS OF VARIANCE

A one - way analysis of variance is an extension of the independent group t - test where there are more than two groups. Assumptions for this test are similar to those for the t -test: Data within groups are normally distributed with equal variances across groups. Another key assumption is that of independent samples. That is, not only do the observations within a group represent a random sample, but there is no matching or pairing of observations among groups. This is analogous to the independent samples requirement in the two - sample t - test. As with the t - test, the ANOVA is robust against moderate departures from the assumptions of normality and equal variance (especially for larger sample sizes). However, the assumption of independence is critical. The hypotheses for the comparison of independent groups are (k is the number of groups):

H0: μ1 = μ2 = ƛ= μk : Means of all the groups are equal.

84

An Introduction to the SAS System

Ha: μi ≠ μj for some i ≠ j : At least two means are not equal.

The test statistic reported is an F test with k - 1 and N - k degrees of freedom, where N is the number of subjects. A low p - value for the F - test is evidence for rejecting the null hypothesis. In other words, there is evidence that at least one pair of means is not equal. These tests can be performed in SAS using PROC ANOVA. The syntax for the statement is: PROC ANOVA < options >; CLASS variable; MODEL dependentvar = independentvars; MEANS independentvars / type comparison < means options >;

Some of the common options for PROC ANOVA are: DATA = dataset name Specifies the data set to use ORDER=option Options are DATA, FORMATTED, FREQ, and INTERNAL Some of the commonly used statements for PROC ANOVA are: CLASS variable This statement is required and specifies the grouping variable for the analysis. MODEL statement This statement specifies the dependent and independent variables for the analysis. More specifically, it takes the form MODEL dependentvariable=independentvariable; The dependent variable is the quantitative variable of interest (outcome variable), and the independent variable (one independent variable in the simple ANOVA case) is the grouping variable for the analysis (the variable listed in the CLASS statement).

Using the MEANS Statement

In one - way analysis of variance, typically there is a two - step procedure:

(1) Test H0: μ1 = μ2 = ƛ= μk to determine whether any significant differences exist and (2) If H0 is rejected then run subsequent multiple comparison tests to determine which differences are significantly different. Pair wise comparison of means can be performed using one of several multiple comparison tests specified using the MEANS statement, which has the following format: MEANS independentvar / type comparison < means options >;

Where the comparison types are selected from the following (not an exhaustive list of SAS options): BON Bonferroni t - tests of differences DUNCAN Duncan‘s multiple range test SCHEFFE Scheffe multiple comparison procedure SNK Student Newman Keuls multiple range test LSD 85

An Introduction to the SAS System

Fisher‘s Least Significant Difference test TUKEY Tukey‘s studentized range test DUNNETT ('x') Dunnett‘s test - compare to a single control, where 'x' is the category value of the control group MEANS options are ALPHA=p value Specifies the significance level for comparisons (default: 0.05) CLDIFF Requests that confidence limits be included in the output

For example, suppose for comparing the time to relief of three headache medicines -brands 1, 2, and 3. The time-to-relief data are reported in minutes. For this experiment, fifteen subjects were randomly placed on one of the three medications. Which medicine (if any) is the most effective? The data for this example are as follows: Brand 1 Brand 2 Brand 3 24.5 28.4 26.1 23.5 34.2 28.3 26.4 29.5 24.3 27.1 32.2 26.2 29.9 30.1 27.8

EXAMPLE This example illustrates how to compare the means of three or more independent groups using SAS ANOVA. We also illustrate here a technique for performing multiple comparisons.

1. In the Editor window, open the file AANOVA2.SAS.

DATA ACHE; INPUT BRAND RELIEF; CARDS; 1 24.5 1 23.5 1 26.4 1 27.1 1 29.9 2 28.4 2 34.2 2 29.5 2 32.2 2 30.1 3 26.1 3 28.3 3 24.3 3 26.2 3 27.8; 86

An Introduction to the SAS System

PROC ANOVA DATA=ACHE; CLASS BRAND; MODEL RELIEF=BRAND; MEANS BRAND/TUKEY; TITLE 'COMPARE RELIEF ACROSS MEDICINES - ANOVA EXAMPLE'; RUN;

Examine the PROC ANOVA statement:  BRAND is the CLASS or grouping variable (containing three levels).  The MODEL statement indicates that RELIEF is the dependent variable, whose means across groups are to be compared. The grouping factor is BRAND.  The MEANS statement requests a multiple comparison test for BRAND using the Tukey method.

TABLE .1. ANOVA results

2. Run this analysis and observe the results. The analysis of variance (partial) results are shown in Table 1. The first analysis of variance table is a test of the full model, which in this case is the same as the test of the BRAND effect shown in the second table because only one factor is in the model. This is a test of the hypothesis that all means are equal. The small p-value (p=0.009) provides evidence for rejecting the null hypothesis that the means are equal. If the p-value for the model is not significant (p > 0.05), end the analysis here and conclude that we cannot reject (or fail to) reject the null hypothesis that all means are equal. If, as in this case, the p-value is small, we can perform multiple comparisons to determine which means are different.

87

An Introduction to the SAS System

TABLE 2. Tukey multiple comparisons results

3. Observe the multiple comparison results, shown in Table.2.

This table graphically displays any significant mean differences using Tukey‘s multiple comparison test. Groups that are not significantly different from each other are included in the same Tukey Grouping. From Table 2 we see that there are two groupings, A and B. Notice that BRAND 2 is in a group by itself. This indicates that the mean for BRAND 2 (30.88) is significantly higher than (different from) the means of BRAND 1 (26.28) and BRAND 3 (26.54). Because BRAND 1 and BRAND 3 are in the same grouping, there is no significant difference between these two brands. Because a shorter time to relief is desirable, the conclusion would be that BRANDs 1 and 3 are preferable to BRAND 2. 4. Another method for displaying the Tukey results is provided when we include the option CLDIFF in the MEANS statement. Change the MEANS statement to read: MEANS BRAND/TUKEY CLDIFF; Run the program again and observe in Table 3 the Tukey comparison table showing simultaneous confidence limits on the difference between means.

TABLE 3. Simultaneous confidence limits

88

An Introduction to the SAS System

The asterisks (***) in the multiple comparisons table indicate paired comparisons that are significant at the 0.05 level. In this case, all comparisons except means 1 versus 3 are different. This indicates that mean time to relief for BRAND 3 is not significantly different from that of BRAND 1, but that the mean time to relief for BRAND 2 is significantly different from those for BRAND 1 and BRAND 3. The Simultaneous 95 percent Confidence Limits provide an estimate of how small or large the difference between the means is likely to be. We can use this information to assess the (clinical) importance of the difference. For example, the difference between BRANDs 2 and 3 could plausibly be as small as 0.69 and as large as 7.99. If such differences are determined to be of clinical significance, then the conclusions are the same as those obtained from Table 2—that is, BRANDs 1 and 3 are preferable to BRAND 2.

By changing ANOVA to GLM in this SAS code and re - running the program, we will see essentially the same results. So, GLM could have been used for this one - way ANOVA problem. Some people choose to always use GLM where GLM or ANOVA would have applied. In the GLM output we will see results in tables labeled TYPE I and TYPE III sums of squares. In this example, the results in the two tables will be the same. In some more complex settings — for example, multi - way ANOVA designs with an unequal number of observations per cell — the TYPE I and TYPE III sums of squares will differ. In this case, the typical recommendation is to use the TYPE III sums of squares (see Elliott and Woodward, 1986).

COMPARING THREE OR MORE REPEATED MEASURES

Repeated measures are observations taken from the same or related subjects over time or in differing circumstances. Examples include weight loss or reaction to a drug over time. When there are two repeated measures, the analysis of the data becomes a paired t - test. When there are three or more repeated measures, the analysis is a repeated measures analysis of variance.

89

An Introduction to the SAS System

Assumptions for the repeated measures ANOVA are that the dependent variable is normally distributed and that the variances across the repeated measures are equal. As in the one - way ANOVA case, the test is robust against moderate departures from the normality and equal variance assumptions. As in the independent groups ANOVA procedure, we will usually perform the analysis in two steps. First, an analysis of variance will determine if there is a difference in means across time. If a difference is found, then multiple comparisons can be performed to determine where the differences lie. The hypotheses being tested with repeated measures ANOVA are: H0 There is no difference among the groups (repeated measures). Ha There is a difference among the groups. For this analysis, the PROC GLM procedure will be used because the complexity of this procedure is not supported in PROC ANOVA. The abbreviated syntax for PROC GLM is similar to that for PROC ANOVA:

PROC GLM < options >; CLASS variable; MODEL dependentvar = independentvars/options; MEANS independentvars / typecomparison < meansoptions >;

The CLASS, MODEL, and MEANS statements are essentially the same as for PROC ANOVA. These are not all of the options available in PROC GLM, but this list is sufficient for performing the analysis in this section. Also note that GLM gives TYPE I and TYPE III sums of squares.

The data in the following Hands - on Example are repeated measures of reaction times (OBS) of five persons after being treated with four drugs in randomized order. (These types of data may come from a crossover experimental design.) The data are as follows where it is important to understand that, for example, the first row of results (i.e., 31, 29, 17, and 35) consists of results observed on Subject 1. The data must be entered into SAS in such a way that this relationship is identified. We will note that in the SAS code to follow, each reading on the dependent variable (OBS) is identified with respect to its corresponding SUBJ and DRUG.

Subj Drug1 Drug2 Drug3 Drug4 1 31 29 17 35 2 15 17 11 23 3 25 21 19 31 4 35 35 21 45 5 27 27 15 31

EXAMPLE This example illustrates how to compare three or more repeated measures (dependent samples) and perform pairwise comparisons using the DUNCAN procedure.

1. In the Editor window, open the file AGLM1.SAS.

90

An Introduction to the SAS System

DATA STUDY; INPUT SUBJ DRUG OBS; DATALINES; 1 1 31 1 2 29 1 3 17 1 4 35 2 1 15 ...etc 5 3 15 5 4 31 ; run; ODS HTML; ODS GRAPHICS ON; PROC GLM DATA=STUDY; CLASS SUBJ DRUG; MODEL OBS= SUBJ DRUG; MEANS DRUG/DUNCAN; TITLE 'Repeated Measures ANOVA'; RUN; ODS HTML CLOSE; ODS GRAPHICS OFF;

2. Run the program and observe the results. Several tables are included in the output. Table 4 shows the overall analysis of variance table and the ―Type III SS‖ table. The test in the first table is an overall test to determine whether there are any significant differences across subjects or drugs. If this test is not significant, we can end our analysis and conclude that there is insufficient evidence to show a difference among subjects or drugs.

In this case p < 0.0001, so we continue to the Type III results table. In the Type III results table, the DRUG row reports a p - value of p < 0.0001. This is the test of the null hypothesis of interest, which is that there is no difference among the drugs. Because p < 0.05, we reject the null hypothesis and conclude that there is a difference among the drugs. Although SUBJ is included as a factor in the model statement, we will generally not be interested in a subject (SUBJ) effect.

91

An Introduction to the SAS System

TABLE 4. Analysis of variance results

3. The multiple comparison test results are shown in Table 5. This table is similar to the one discussed in the one - way ANOVA example shown in Table 2 except that in this example we have used the Duncan multiple range test rather than the Tukey test for multiple comparisons. The Duncan multiple range test for DRUG indicates that the time to relief for drug 3 is significantly lower than that for all other drugs. There is no statistical difference between drugs 2 and 1; drug 4 has the highest time to relief for all drugs tested, while the time to relief for drug 3 is significantly lower than that for other drugs. Thus, on this basis, drug 3 would be the preferred drug. 4. Change the DUNCAN option to SNK (Student Newman Keuls multiple range test). Re - run the program and see if the multiple comparison results have changed.

TABLE 5. Duncan ’ s multiple comparison results

92

An Introduction to the SAS System

GOING DEEPER: GRAPHING MEAN COMPARISONS A graphical comparison allows us to compare the groups visually. If the p -value is low in a t - test or ANOVA analysis, a visual analysis of the means can be used to illustrate the separation among the means. If the p - value is not significant, a graphical comparison will often show a fair amount of overlap among the groups. We use two types of graphs, dot plots and box plots, to illustrate this comparison and to examine assumptions.

EXAMPLE This example illustrates the plotting of means by group to accompany an ANOVA or t - test. This program uses the same ACHE data set that was used in AANOVA2.SAS in previous section ―Using the MEANS Statement‖. In this example we illustrate the additional plots provided by ODS GRAPHICS along with an application of the GPLOT command.

1. In the Editor window, open the file AANOVA3.SAS. ODS HTML; ODS GRAPHICS ON; TITLE 'GRAPHICAL ANOVA RESULTS - HEADACHE ANALYSIS'; PROC ANOVA DATA=ACHE; CLASS BRAND; MODEL RELIEF=BRAND; MEANS BRAND/TUKEY; TITLE 'COMPARE RELIEF ACROSS MEDICINES - ANOVA EXAMPLE'; RUN; PROC GPLOT DATA=ACHE; PLOT RELIEF * BRAND; RUN; QUIT; ODS GRAPHICS OFF; ODS HTML CLOSE; Observe the two changes from AANOVA2.SAS:  We have used the ODS GRAPHICS command, which produces side-by-side box-and- whiskers plots of RELIEF with BRAND, as the grouping variable as a part of the one- way ANOVA output from PROC ANOVA.  GPLOT produces a graphical plot (scatterplot, which in this case results in side - by - side dot plots) with BRAND as the x - axis (horizontal axis) and RELIEF as the y - axis (vertical axis). 2. Using ODS GRAPHICS ON / ODS GRAPHICS OFF produces the box-and–whiskers plots seen in Figure 1. PROC GPLOT illustrates the time to relief as a series of points (by default shown as crosses) indicating individual observations for each BRAND in Figure 2 . In both of these graphs we can see that times to relief for brands 1 and 3 are comparable and that the time to relief for brand 2 is higher than the others‘. These graphs are plotted using default values. Advanced plotting techniques that can help us to make these graphs more suited for presentation. Note that side - by - side box - and - whiskers plots like those in Figure 1 could also have been created without using PROC ANOVA with the commands: PROC BOXPLOT DATA=ACHE; PLOT RELIEF * BRAND;

93

An Introduction to the SAS System

3. As a preview of some of the graphic enhancement options open the SAS program file AANOV4A.SAS and observe the SYMBOL1 and AXIS1 statements. The SYMBOLn statement defines the symbol to be plotted (as a dot), and the AXIS1 statement (along with the /HAXIS=AXIS1 option) instructs SAS to offset the horizontal axis so that groups 1 and 3 are not as close to the sides of the graph. Run this program and compare the output to Figure 2 .

FIGURE 1. Side - by - side boxplots produced by ODS GRAPHICS

94

An Introduction to the SAS System

FIGURE 2. Side - by - side dot plots produced by GPLOT

ODS HTML; TITLE 'ANOVA RESULTS - HEADACHE ANALYSIS'; SYMBOL1 V=DOT; AXIS1 OFFSET=( 5 ); PROC GPLOT DATA=ACHE; PLOT RELIEF * BRAND/HAXIS=AXIS1; RUN ; QUIT ; ODS HTML CLOSE;

95

An Introduction to the SAS System

Some Multivariate Techniques: Through SAS

Dharm Nath Jha Regional Centre, CIFRI, Allahabad [email protected]

Introduction Biological, physical, behavioral, social, and educational phenomena are often determined by multifactor. Therefore, any systematic attempt to understand them requires the examination of multiple dimensions that are usually interrelated. The subject of multivariate analysis deals with the statistical analysis of the data collected on more than one variable. These variables may be correlated with each other. In practical, most data collection schemes or designed experiments results in multivariate data. A few examples of such situations are given below.

• During a survey of households, several measurements on each household are taken. These measurements, being taken on the same household, will be dependent. For example, the education level of the head of the household and the annual income of the family are related.

• In a designed experiment conducted in a research and development center, various factors are set up at desired levels and a number of response variables are measured for each of these treatment combinations. Due to dependence among responses, it may be more meaningful to analyze response variables simultaneously.

• During a production process, a number of different measurements such as the tensile strength, brittleness, diameter, etc. are taken on the same unit. Collectively such data are viewed as multivariate data.

Generally data are analyzed by taking one variable at a time. But in all above example collected data are multivariate in nature.The inferences drawn by analyzing the data for each of the variables may be misleading. Therefore, the data on several variables should be analyzed using multivariate analytical techniques. Various statistical methods for describing and analyzing these multivariate data sets are Multivariate analysis of variance (MANOVA), Hotelling T 2 , Discriminant Analysis, Factor Analysis, Principal Component Analysis (PCA), Cluster Analysis, Canonical Correlation Analysis (CCA), etc. Here an attempt has been made to do some multivariate techniques like PCA, Cluster Analysis and CCA by using SAS software.

Principal Component Analysis The purpose of principal component analysis is to derive a small number of linear combinations (principal components) of a set of variables that retain as much information in the original variables as possible. Often a small number of principal components can be used in place of the original variables for plotting, regression, clustering and so on. Principal component analysis can also be viewed as a technique to remove multicollinearity in the data. In this technique, we transform the original set of variables to a new set of uncorrelated random variables. These new variables are linear combinations of the originals variables and are derived in decreasing order of importance so that the first principal component accounts for as much as possible of the variation in the original data.

96

An Introduction to the SAS System

It is quite likely that first few principal components account for most of the variability in the original data. If so, these few principal components can then replace the initial variables in subsequent analysis, thus reducing the effective dimensionality of the problem. An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretation that would not ordinarily result. However, Principal Components Analysis is more of a mean to an end rather than end in itself because this frequently serves as intermediate steps in much larger investigations by reducing the dimensionality of the problem and providing easier interpretation. It is a mathematical technique, which does not require user to specify the statistical model or assumption about distribution of original variates. It may also be mentioned that principal components are artificial variables and often it is not possible to assign physical meaning to them. Further, since Principal Components Analysis transforms original set of variables to new set of uncorrelated variables. It is worth stressing that if the original variables are uncorrelated, then there is no point in carrying out the Principal Components Analysis. It is important to note here that the principal components depend on the scale of measurement. Conventional way of getting rid of this problem is to use the standardized variables with unit variances.

Example : Let us consider the following data on average minimum temperature x1, average relative humidity at 8 hrs. x2, average relative humidity at 14 hrs. x3 and total rainfall in st cm. x4 pertaining to Raipur district from 1970 to 1986 for kharif season from 21 May to 7th Oct. x1 x2 x3 x4 25.0 86 66 186.49 24.9 84 66 124.34 25.4 77 55 98.79 24.4 82 62 118.88 22.9 79 53 71.88 7.7 86 60 111.96 25.1 82 58 99.74 24.9 83 63 115.20 24.9 82 63 100.16 24.9 78 56 62.38 24.3 85 67 154.40 24.6 79 61 112.71 24.3 81 58 79.63 24.6 81 61 125.59 24.1 85 64 99.87 24.5 84 63 143.56 24.0 81 61 114.97 Mean 23.56 82.06 61.00 112.97 S.D. 4.13 2.75 3.97 30.06 with the variance co-variance matrix. 17.02  4.12 1.54 5.14   7.56 8.50 54.82       15.75 92.95     903.87

97

An Introduction to the SAS System

Find the eigenvalues and eigenvectors of the above matrix. Arrange the eigenvalues in decreasing order. Let the eigenvalues in decreasing order and corresponding eigenvectors are 1  916.902 a1  0.006, 0.061, 0.103, 0.993 2 18.375 a2  0.955,  0.296, 0.011, 0.012 3  7.87 a3  0.141, 0.485, 0.855,0.119 4 1.056 a4  0.260, 0.820,  0.509, 0.001

The principal components for this data will be z1  0.006x1  0.061x2  0.103x3  0.993x4 z  0.955x  0.296x  0.011x  0.012x 2 1 2 3 4 z3  0.141x1  0.485x2  0.855x3  0.119x4 z4  0.26x1  0.82x2  0.509x3  0.001x4

The variance of principal components will be eigenvalues i.e. Varz1 916.902, Varz218.375, Varz3 7.87, Varz41.056

The total variation explained by principal components is 1  2  3  4  916.90218.375 7.871.056 944.20

As such, it can be seen that the total variation explained by principal components is same as that explained by original variables. It could also be proved mathematically as well as empirically that the principal components are uncorrelated.

The proportion of total variation accounted for by the principal components is  916.902 1   0.97 1  2  3  4 944.203

Continuing, the first two components account for a proportion    935.277 1 2   0.99 of the total variance. 1  2  3  4 944.203 Hence, in further analysis, the first or first two principal components z1 and z2 could replace four variables by sacrificing negligible information about the total variation in the system. The scores of principal components can be obtained by substituting the values of xi 's in the equations of zi 's. For above data, the first two principal components for first observation i.e. for year 1970 can be worked out as z  0.006 25.0  0.06186 0.10366 0.993186.49197.380 1 z2  0.955 25.0  0.29686 0.01166 0.012186.491.383

Similarly for the year 1971 z  0.006 24.9  0.06184 0.10366 0.993124.34135.54 1 z2  0.955 24.9  0.29684 0.01166 0.012124.341.134

Thus the whole data with four variables can be converted to a new data set with two principal components.

98

An Introduction to the SAS System

PCA Through SAS

Following steps of SAS may be used for performing the principal component analysis.

The PROC PRINCOMP can be used for performing principal component analysis. Raw data, a correlation matrix, a covariance matrix or a sum of squares and cross products (SSCP) matrix can be used as input data. The data sets containing eigenvalues, eigenvectors, and standardized or unstandardized principal component scores can be created as output. The basic syntax of PROC PRINCOMP is as follows:

SAS Code title 'PCA by using SAS'; ods graphics on; data pca; input x1 x2 x3 x4; cards; 25.0 86 66 186.49 24.9 84 66 124.34 25.4 77 55 98.79 24.4 82 62 118.88 22.9 79 53 71.88 7.7 86 60 111.96 25.1 82 58 99.74 24.9 83 63 115.20 24.9 82 63 100.16 24.9 78 56 62.38 24.3 85 67 154.40 24.6 79 61 112.71 24.3 81 58 79.63 24.6 81 61 125.59 24.1 85 64 99.87 24.5 84 63 143.56 24.0 81 61 114.97 ; proc princomp cov out=new; var x1 x2 x3 x4; proc print data=new ; var x1 x2 x3 x4 prin1 prin2; proc corr data=new ; var x1 x2 x3 x4 prin1 prin2; run;

The PROC PRINCOMP and RUN are must. However, the VAR statement listing the numeric variables to be analysed is usually used alongwith PROC PRINCOMP statement. If the DATA= data set is TYPE=SSCP, the default set of variables does not include intercept. Therefore, INTERCEPT may also be included in the VAR statement. The following options are available in PROC PRINCOMP.

99

An Introduction to the SAS System

A. DATA SETS SPECIFICATION 1. DATA= SAS-data-set : names the SAS data set to be analysed. This data set can be ordinary data set or a TYPE = CORR, COV, FACTOR, UCORR or UCOV data set. 2. OUT = SAS-data-set : creates an output data set containing original data alongwith principal component scores. 3. OUTSTAT-SAS-data-set : creates an output data set containing means, standard deviations, number of observations, correlations or covariances, eigenvalues and eigenvectors.

B. ANALYTICAL DETAILS SPECIFICATION 1. COV: computes the principal components from the covariance matrix. The default option is computation of principal components using a correlation matrix. 2. N=: the non-negative integer equal to the number of principal components to be computed. 3. NOINT : omits the intercept from the model 4. PREFIX= name: specifies a prefix for naming the principal components. The default option is PRIN1, PRIN2, ... . 5. STANDARD (STD): standardizes the principal component scores to unit variance from the variance equal to corresponding eigenvalue. 6. VARDEF=DFNWDFWEIGHT: specifies the divisor (error degree of freedomnumber of observationssun of weightssum of weights-1) in calculating variances and standard deviations. The default option is DF. Besides these options NOPRINT option suppresses the output. The other statements in PROC PRINCOMP are:

By variables: obtains the separate analysis on observations in groups defined by variables.

FREQ statement: It names a variable that provides frequencies of each observation in the data set. Specifically, if n is the value of the FREQ variable for a given observation, then that observation is used ‗n‘ times.

PARTIAL Statement: used to analyze for a partial correlation or covariance matrix.

VAR statement: Lists the numeric variables to be analysed.

WEIGHT Statement: If we want to use relative weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a weight statement. This is often done when the variance associated with each observation is different and the values of the weight variable are proportional to reciprocals of the variances. The observation is used in the analysis only if the value of the WEIGHT statement variable is non-missing and greater than zero.

The other closely related procedures with PROC PRINCOMP are PROC PRINQUAL: It performs a principal component analysis of a qualitative data. PROC CORRESP: performs correspondence analysis, which is a weighted principal component analysis of contingency tables.

100

An Introduction to the SAS System

SAS Output

Observations 17 Variables 4

Simple Statistics x1 x2 x3 x4 Mean 23.55882353 82.05882353 61.00000000 112.9735294 StD 4.12553918 2.74933147 3.96862697 30.0645310

Covariance Matrix x1 x2 x3 x4 x1 17.0200735 -4.1161765 1.5375000 5.1359669 x2 -4.1161765 7.5588235 8.5000000 54.8254044 x3 1.5375000 8.5000000 15.7500000 92.9456250 x4 5.1359669 54.8254044 92.9456250 903.8760243

Total 944.204921 Variance 32

Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 916.903119 898.527591 0.9711 0.9711 2 18.375528 10.505165 0.0195 0.9905 3 7.870363 6.814450 0.0083 0.9989 4 1.055912 0.0011 1.0000

Eigenvectors Prin1 Prin2 Prin3 Prin4 x1 0.005564 0.955110 0.141400 0.260269 x2 0.060795 -.295805 0.484926 0.820762 x3 0.102982 0.011430 0.854784 -.508536 x4 0.992808 0.011575 -.119152 0.001031

101

An Introduction to the SAS System

Obs x1 x2 x3 x4 Prin1 Prin2 1 25.0 86 66 186.49 73.7503 1.1188 2 24.9 84 66 124.34 11.9251 0.8955 3 25.4 77 55 98.79 -14.9967 3.0222 4 24.4 82 62 118.88 5.9681 0.9006 5 22.9 79 53 71.88 -41.8115 -0.2915 6 7.7 86 60 111.96 -0.9579 -16.3359 7 25.1 82 58 99.74 -13.4423 1.3019 8 24.9 83 63 115.20 2.4811 1.0512 9 24.9 82 63 100.16 -12.5115 1.1729 10 24.9 78 56 62.38 -50.9839 1.8388 11 24.3 85 67 154.40 41.9294 0.3860 12 24.6 79 61 112.71 -0.4418 1.8962 13 24.3 81 58 79.63 -33.4729 0.6009 14 24.6 81 61 125.59 12.4672 1.4537 15 24.1 85 64 99.87 -12.5185 -0.4705 16 24.5 84 63 143.56 30.6957 0.7016 17 24.0 81 61 114.97 1.9202 0.7577

6 Variables: x1 x2 x3 x4 Prin1 Prin2

Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum x1 17 23.55882 4.12554 400.50000 7.70000 25.40000 x2 17 82.05882 2.74933 1395 77.00000 86.00000 x3 17 61.00000 3.96863 1037 53.00000 67.00000 x4 17 112.97353 30.06453 1921 62.38000 186.49000 Prin1 17 0 30.28041 0 -50.98386 73.75027 Prin2 17 0 4.28667 0 -16.33590 3.02220

102

Pearson Correlation Coefficients, N = 17 Prob > |r| under H0: Rho=0 x1 x2 x3 x4 Prin1 Prin2 x1 1.00000 -0.36290 0.09391 0.04141 0.04084 0.99241 0.1522 0.7200 0.8746 0.8763 <.0001 x2 -0.36290 1.00000 0.77903 0.66329 0.66958 -0.46121 0.1522 0.0002 0.0037 0.0033 0.0624 x3 0.09391 0.77903 1.00000 0.77899 0.78575 0.01235 0.7200 0.0002 0.0002 0.0002 0.9625 x4 0.04141 0.66329 0.77899 1.00000 0.99994 0.00165 0.8746 0.0037 0.0002 <.0001 0.9950 Prin1 0.04084 0.66958 0.78575 0.99994 1.00000 0.00000 0.8763 0.0033 0.0002 <.0001 1.0000 Prin2 0.99241 -0.46121 0.01235 0.00165 0.00000 1.00000 <.0001 0.0624 0.9625 0.9950 1.0000

Cluster Analysis The basic aim of the cluster analysis is to find ―natural‖ or ―real‖ groupings, if any, of a set of individuals (or objects or points or units or whatever). This set of individuals may form a complete population or be a sample from a larger population. More formally, cluster analysis aims to allocate a set of individuals to a set of mutually exclusive, exhaustive groups such that individuals within a group are similar to one another while individuals in different groups are dissimilar. This set of groups is called partition or Cluster. Cluster analysis can also be used for summarizing the data rather than finding natural or real groupings. Grouping or clustering is distinct from the classification methods in the sense that the classification pertains to a known number of groups, and the operational objective is to assign new observations to one of these groups. Cluster analysis is a more primitive technique in that no assumptions are made concerning the number of groups or the group structure. Cluster analysis is used in diversified research fields. In biology, cluster analysis is used to identify diseases and their stages. For example, by examining patients who are diagnosed as depressed, one finds that there are several distinct subgroups of patients with different types of depression. In marketing, cluster analysis is used to identify persons with similar buying habits. By examining their characteristics it becomes possible to plan future marketing strategies more efficiently. Grouping is done on the basis of similarities or distances (dissimilarities). Some of these distance criteria are:

Euclidean distance: This is probably the most commonly chosen type of distance. It is the geometric distance in the multidimensional space and is computed as: p 1/ 2  2 d(x,y)  (xi  yi )   (x  y)(x  y) i1  where x,y are the p-dimensional vectors of observations.

Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any 103

two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale), and consequently, the results of cluster analyses may be very different. Generally, it is good practice to transform the dimensions so they have similar scales.

Squared Euclidean distance: This measure is used in order to place progressively greater weight on objects that are further apart. This distance is square of the Euclidean distance.

Statistical distance: The statistical distance between the two p-dimensional vectors x and y is d(x,y)  (x  y)s1(x  y) , where s is the sample variance-covariance matrix.

Many more distance measures are available in literature.

The objective of cluster analysis is to group observations into clusters such that each cluster is as homogenous as possible with respect to the clustering variables. The various steps in cluster analysis

(i) Select a measure of similarity. (ii) Decision is to be made on the type of clustering technique to be used (iii) Type of clustering method for the selected technique (iv) Decision regarding the number of clusters (v) Cluster solution is interpreted.

No generalization about cluster analysis is possible as a vast number of clustering methods have been developed in several different fields with different definitions of clusters and similarities. There are many kinds of clusters namely: • Disjoint cluster where every object appears in single cluster. • Hierarchical clusters where one cluster can be completely contained in another cluster, but no other kind of overlap is permitted • Overlapping clusters. • Fuzzy clusters, defined by a probability of membership of each object in one cluster.

Similarity Measures A measure of closeness is required to form simple group structures from complex data sets. A great deal of subjectivity is involved in the choice of similarity measures. Important considerations are the nature of the variables i.e. discrete, continuous or binary or scales of measurement (nominal, ordinal, interval, ratio etc.) and subject matter knowledge. If the items are to be clustered, proximity is usually indicated by some sort of distance. The variables however are grouped on the basis of some measure of association like the correlation co-efficient etc. Some of the measures are

104

Hierarchical Agglomeration Hierarchical clustering techniques begin by either a series of successive mergers or of successive divisions. Consider a natural process of grouping • Each unit is an entity to start with • Merge those two units first which are most similar (least dij) - now becomes an entity • Examine mutual distance between (n-1) entities • Merge those two that are most similar • Repeat the process and go on merging till all are merged to form one entity • At each stage of agglomerative process, note the distance between the two merging entities • Choose that stage which shows sudden jump in this distance (since it indicates that two very dissimilar entities are being merged). This could be subjective.

Distance Between Entities As there are large numbers of methods available, so it is not possible to enumerate them here but some of them are • Single linkage- This method works on the principle of smallest distance or nearest neighbour • Complete linkage- It works on the principle of distant neighbour or dissimilarities- Farthest neighbour • Average linkage- This works on the principle of average distance. Average of distances between unit of one entity and the other unit of the second entity. • Centroid- This method assigns each item to the cluster having nearest centroid (means). The process has three steps _ Partition the items into k initial clusters _ Proceed through the list of items assigning an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid (mean) for the cluster receiving the new item and the cluster losing the item. Repeat this step until no more assignments take place • Ward’s – It forms cluster by maximising within – cluster homogeneity, within group sum of squares is used as the measure of homogeneity • Two stage density linkage • Units assigned to modal entities on the basis of densities (frequencies) – (kth nearest neighbour) • Modal entities allowed to join later on

Cluster Analysis through SAS

The SAS procedures for clustering are oriented towards disjoint or hierarchical cluster from a co- ordinate data, distance or a correlation or covariance matrix. The following procedures are used for clustering

CLUSTER Does hierarchical clustering of observations FASTCLUS Finds disjoint clusters of observations using a k-means method applied to co- ordinate data. Recommended for large data sets.

105

VARCLUS It is used for hierarchical as well as non-hierarchical clustering

TREE Draws the tree diagrams or dendograms using outputs from the CLUSTER or VARCLUS procedures

The TREE procedure is considered to be very important because it produces dendrogram, using a data set created by the CLUSTER or VARCLUS procedure and also create output data sets giving the results of hierarchical clustering as tree structure. The TREE procedure uses the output sets to print a diagram.

Following is the terminology related to TREE procedure. Leaves Objects that are clustered Root The cluster containing all the objects Branch A cluster containing at least two objects but not all of them Node A general term for leaves, branch and roots Parent & Child If A is union of cluster B and C, then A is parent and B and C are Children.

SAS Code Specifications The TREE procedure is invoked by the following statements: PROC TREE Optional Statements NAME variables HEIGHT variables PARENT variables BY variables COPY variables FREQ variables ID variables

If the data sets have been created by CLUSTER or VARCLUS, the only requirement is the statement PROC TREE. The other optional statements listed above are described after the PROC TREE statement PROC TREE

The PROC TREE statement starts the TREE procedure. The options that usually find place in the PROC TREE statement are as given: FUNCTION OPTION Specify data set DATA= DOCK= LEVEL= NCLUSTERS= OUT= Specify cluster heights HEIGHT= DISSIMILAR= SIMILAR=

106

Print horizontal trees HORIZONTAL Control the height axis INC= MAXHEIGHT= MINHEIGHT= NTICK= PAGES= POS= SPACES= TICKPOPS= Control characters printed in trees FILLCHAR= JOINCHAR= LEAFCHAR= TREECHAR= Control sort order DESCENDING SORT Control output LIST NOPRINT PAGES By default, the tree diagram is oriented with the height vertical and the object names at the top of the diagram. For horizontal axis HORIZONTAL option can be used.

Example: The data along with SAS CODE belongs to different kinds of teeth for a variety of mammals. The objective of the study is to identify suitable clusters of mammals based on the eight variables. title1 'Cluster Analysis using mammal data'; Data teeth; Input mammal $ v1 v2 v3 v4 v5 v6 v7 v8; Cards; A 2 3 1 1 3 3 3 3 B 3 2 1 0 3 3 3 3 C 2 3 1 1 2 3 3 3 D 2 3 1 1 2 2 3 3 E 2 3 1 1 1 2 3 3 F 1 3 1 1 2 2 3 3 G 2 1 0 0 2 2 3 3 H 2 1 0 0 3 2 3 3 I 1 1 0 0 2 1 3 3 J 1 1 0 0 2 1 3 3 K 1 1 0 0 1 1 3 3 L 1 1 0 0 0 0 3 3 M 1 1 0 0 1 1 3 3 N 3 3 1 1 4 4 2 3 O 3 3 1 1 4 4 2 3 P 3 3 1 1 4 4 3 2

107

Q 3 3 1 1 4 4 1 2 R 3 3 1 1 3 3 1 2 S 3 3 1 1 4 4 1 2 T 3 3 1 1 3 3 1 2 U 3 3 1 1 4 3 1 2 V 3 2 1 1 3 3 1 2 W 3 3 1 1 3 2 1 1 X 3 3 1 1 3 2 1 1 Y 3 2 1 1 4 4 1 1 Z 3 2 1 1 4 4 1 1 AA 3 2 1 1 3 3 2 2 BB 2 1 1 1 4 4 1 1 CC 0 4 1 0 3 3 3 3 DD 0 4 1 0 3 3 3 3 EE 0 4 0 0 3 3 3 3 FF 0 4 0 0 3 3 3 3 ; proc cluster simple noeigen method=centroid rmsstd rsquare nonorm out=tree; var v1-v8; id mammal; proc tree data=tree out=clus3 nclusters=3; id mammal; copy v1-v8; proc sort ;by cluster; proc print ;by cluster; var mammal v1-v8; title2 '3-cluster solution'; run;

SAS output Variable Mean Std Dev Skewness Kurtosis Bimodality v1 2.0313 1.0920 -0.6993 -0.8885 0.6139 v2 2.4688 1.0155 -0.3040 -1.0806 0.4892 v3 0.7188 0.4568 -1.0216 -1.0246 0.8927 v4 0.6250 0.4919 -0.5421 -1.8244 0.8687 v5 2.8125 1.0607 -0.8124 0.2587 0.4647 v6 2.6875 1.0906 -0.5955 -0.2693 0.4449 v7 2.2188 0.9413 -0.4687 -1.7688 0.7894 v8 2.4375 0.7594 -0.9541 -0.5410 0.6889

Root-Mean-Square Total-Sample Standard Deviation = 0.898027

108

Cluster History T RMS Cent i NCL Clusters Joined FREQ STD SPRSQ RSQ Dist e 31 I J 2 0 0.0000 1.00 0 T 30 K M 2 0 0.0000 1.00 0 T 29 N O 2 0 0.0000 1.00 0 T 28 Q S 2 0 0.0000 1.00 0 T 27 R T 2 0 0.0000 1.00 0 T 26 W X 2 0 0.0000 1.00 0 T 25 Y Z 2 0 0.0000 1.00 0 T 24 CC DD 2 0 0.0000 1.00 0 T 23 EE FF 2 0 0.0000 1.00 0 22 A C 2 0.2500 0.0025 .998 1 T 21 D E 2 0.2500 0.0025 .995 1 T 20 G H 2 0.2500 0.0025 .993 1 T 19 CL31 CL30 4 0.2041 0.0050 .988 1 T 18 CL28 U 3 0.2041 0.0033 .984 1 T 17 CL27 V 3 0.2041 0.0033 .981 1 T 16 CL24 CL23 4 0.2041 0.0050 .976 1 15 CL21 F 3 0.2887 0.0042 .972 1.118 14 CL17 AA 4 0.2700 0.0054 .966 1.2019 13 CL18 CL14 7 0.3363 0.0151 .951 1.3255 12 CL22 CL15 5 0.3536 0.0108 .940 1.3437 11 CL29 P 3 0.2887 0.0067 .934 1.4142 T 10 CL25 BB 3 0.2887 0.0067 .927 1.4142 9 CL11 CL13 10 0.4183 0.0292 .898 1.6673 8 CL20 CL19 6 0.3708 0.0200 .878 1.7321 7 CL9 CL10 13 0.4787 0.0403 .838 1.8696 6 CL7 CL26 15 0.5129 0.0373 .800 2.0755 5 CL12 B 6 0.4472 0.0200 .780 2.1909 4 CL8 L 7 0.4564 0.0225 .758 2.2913 3 CL5 CL16 10 0.6055 0.0870 .671 2.6926 2 CL3 CL4 17 0.7676 0.1951 .476 3.078 1 CL2 CL6 32 0.8980 0.4756 .000 3.455

109

4

3

2

1

0 A C D E F B C D E F G H I J K M L N O P Q S U R T V A Y Z B W X C D E F A B

mammal

CLUSTER=1

Ob mamm v v v v v v v s al 1 2 3 4 5 6 7 v8 1 I 1 1 0 0 2 1 3 3 2 J 1 1 0 0 2 1 3 3 3 K 1 1 0 0 1 1 3 3 4 M 1 1 0 0 1 1 3 3 5 G 2 1 0 0 2 2 3 3 6 H 2 1 0 0 3 2 3 3 7 L 1 1 0 0 0 0 3 3

CLUSTER=2

Ob mamm v v v v v v v s al 1 2 3 4 5 6 7 v8 8 N 3 3 1 1 4 4 2 3 9 O 3 3 1 1 4 4 2 3 10 Q 3 3 1 1 4 4 1 2 11 S 3 3 1 1 4 4 1 2 12 R 3 3 1 1 3 3 1 2 13 T 3 3 1 1 3 3 1 2 14 W 3 3 1 1 3 2 1 1 15 X 3 3 1 1 3 2 1 1 16 Y 3 2 1 1 4 4 1 1 110

Ob mamm v v v v v v v s al 1 2 3 4 5 6 7 v8 17 Z 3 2 1 1 4 4 1 1 18 U 3 3 1 1 4 3 1 2 19 V 3 2 1 1 3 3 1 2 20 AA 3 2 1 1 3 3 2 2 21 P 3 3 1 1 4 4 3 2 22 BB 2 1 1 1 4 4 1 1

CLUSTER=3

Ob mamm v v v v v v v s al 1 2 3 4 5 6 7 v8 23 CC 0 4 1 0 3 3 3 3 24 DD 0 4 1 0 3 3 3 3 25 EE 0 4 0 0 3 3 3 3 26 FF 0 4 0 0 3 3 3 3 27 A 2 3 1 1 3 3 3 3 28 C 2 3 1 1 2 3 3 3 29 D 2 3 1 1 2 2 3 3 30 E 2 3 1 1 1 2 3 3 31 F 1 3 1 1 2 2 3 3 32 B 3 2 1 0 3 3 3 3

Canonical Correlation Analysis Often in applied research, scientist encounter variables of large dimensions and are faced with the problem of understanding dependency structures, reduction of dimensionalities, construction of a subset of good predictors from the explanatory variables etc. Canonical Correlation Analysis (CCA) provides us with a tool to over come these problems. However, its appeal and motivation need to differ from the theoretical statisticians to the social scientist.

Canonical correlation is a technique for analyzing the relationship between two sets of variables, one representing a set of independent variables, the other a set of dependent variables. Each set can contain several variables. Simple and multiple correlation are special cases of canonical correlation in which one or both sets contain a single variable. Whereas multiple correlation is used for may-to–one relationships, canonical correlation is used for may-to-many relationships. This analysis focuses on the correlation between a linear combination of the variables in one set and a linear combination of the variables in the second set. There may be more than one such linear correlation relating the two sets of variables, with each such correlation representing a different dimension by which the independent set of variable is related to the dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not to model the individual variables. The idea is first to determine the pair of linear combinations having the largest correlation. Next we determine the pair of linear combinations having the largest correlation among all pairs uncorrelated 111

with the initially selected pair. This process continues until the number of pairs of canonical variables equals the number of variables in the smaller group. The pairs of linear combinations are called the canonical variables and their correlations are called canonical correlations. The canonical correlations measure the strength of association between the two sets of variables. The maximization aspect of the technique represents an attempt to concentrate a high-dimensional relationship between two sets of variables into a few pair of canonical variables.

Analogous with ordinary correlation, canonical correlation squared is the percent of variance in the dependent set explained by the independent set of variables along a given dimension (there may be more than one). In addition to asking how strong the relationship is between two latent variables, canonical correlation is useful in determining how many dimensions are needed to account for that relationship. Canonical correlation finds the linear combination of variables that produces the largest correlation with the second set of variables. This linear combination, or root is extracted and the process is repeated for the residual data, with the constraint that the second linear combination of variables must not correlate with the first one. The process is repeated until a successive linear combination is no longer significant.

CCA Through SAS

The PROC CANCORR procedure tests a series of hypotheses that each canonical correlation and all smaller correlations are zero in population using an F-approximation. At least one of the two sets of the variables should have an approximate multivariate normal distribution. PROC CANCORR can also perform partial canonical correlation, a multivariate generalization of ordinary partial correlation. Most commonly used parametric statistical methods, ranging from t-tests to multivariate analysis of covariance are special cases of partial canonical correlations.

Example: Mean corrected data set of 4 variables have created for illustration purpose and have been shown in code between cards.

SAS Code title "Canonical Correlation using hypothetical data"; data cancor; input x1 x2 y1 y2; cards; 1.051 -0.435 0.083 0.538 -0.419 -1.335 -1.347 -0.723 1.201 0.445 1.093 -0.112 0.661 0.415 0.673 -0.353 -1.819 -0.945 -0.817 -1.323 -0.899 0.375 -0.297 -0.433 3.001 1.495 1.723 2.418 -0.069 -2.625 -2.287 -1.063 -0.919 0.385 -0.547 0.808 -0.369 -0.265 -0.447 -0.543 112

-0.009 -0.515 0.943 -0.633 0.841 1.915 1.743 1.198 0.781 1.845 1.043 2.048 0.631 -0.495 0.413 -0.543 -1.679 -0.615 -1.567 -0.643 -0.229 -0.525 -0.777 -0.252 -0.709 -0.975 0.523 -0.713 -0.519 0.055 -0.357 0.078 0.051 0.715 0.133 0.328 0.221 0.245 0.403 0.238 -1.399 -0.645 -0.817 -1.133 0.651 0.385 1.063 -0.633 -0.469 -0.125 -0.557 -0.393 0.421 1.215 -0.017 1.838 ; proc cancorr all vname= 'y variables' wname='x variables' vprefix=response wprefix=explanatory; var y1 y2; with x1 x2; run;

SAS Output

y variables 2 x variables 2 Observation 2 s 4

Means and Standard Deviations Variable Mean Standard Deviation y1 0.000083333 1.018454 y2 -0.000041667 1.011011 x1 0.000167 1.052049 x2 -0.000417 1.032847 Correlations Among the y variables y1 y2 y1 1.0000 0.5511 y2 0.5511 1.0000

Correlations Among the x variables x1 x2 x1 1.0000 0.5233 x2 0.5233 1.0000

113

Correlations Between the y variables and the x variables x1 x2 y1 0.7101 0.7551 y2 0.6604 0.8094

Adjusted Squared Eigenvalues of Inv(E)*H Canonic Canonic Approxim Canonic = CanRsq/(1-CanRsq) al al ate al Correlati Correlati Standard Correlati Eigenval Differe Proporti Cumulat on on Error on ue nce on ive 1 0.961496 0.959656 0.015748 0.924475 12.2407 12.2282 0.9990 0.9990 2 0.111255 . 0.205933 0.012378 0.0125 0.0010 1.0000

Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Approximate Ratio F Value Num DF Den DF Pr > F 1 0.07458976 26.62 4 40 <.0001 2 0.98762226 0.26 1 21 0.6133

Multivariate Statistics and F Approximations S=2 M=-0.5 N=9 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.07458976 26.62 4 40 <.0001 Pillai's Trace 0.93685315 9.25 4 42 <.0001 Hotelling-Lawley Trace 12.25325493 60.39 4 23 <.0001 Roy's Greatest Root 12.24072206 128.53 2 21 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.

Raw Canonical Coefficients for the y variables response1 response2 y1 0.5399285948 -1.045524484 y2 0.5790856145 1.034294423

Raw Canonical Coefficients for the x variables explanatory1 explanatory2 x1 0.4245987011 -1.031441494 x2 0.6689973036 0.9183078485

114

Standardized Canonical Coefficients for the y variables response1 response2 y1 0.5499 -1.0648 y2 0.5855 1.0457

Standardized Canonical Coefficients for the x variables explanatory1 explanatory2 x1 0.4467 -1.0851 x2 0.6910 0.9485 Correlations Between the y variables and Their Canonical Variables response1 response2 y1 0.8725 -0.4885 y2 0.8885 0.4588

Correlations Between the x variables and Their Canonical Variables explanatory1 explanatory2 x1 0.8083 -0.5888 x2 0.9247 0.3807

Correlations Between the y variables and the Canonical Variables of the x variables explanatory1 explanatory2 y1 0.8390 -0.0544 y2 0.8543 0.0510

Correlations Between the x variables and the Canonical Variables of the y variables response1 response2 x1 0.7771 -0.0655 x2 0.8891 0.0424

Raw Variance of the y variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.7753 0.7753 0.9245 0.7167 0.7167 2 0.2247 1.0000 0.0124 0.0028 0.7195

Raw Variance of the x variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.7523 0.7523 0.9245 0.6955 0.6955 2 0.2477 1.0000 0.0124 0.0031 0.6986

115

Standardized Variance of the y variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.7754 0.7754 0.9245 0.7168 0.7168 2 0.2246 1.0000 0.0124 0.0028 0.7196

Standardized Variance of the x variables Explained by Their Own The Opposite Canonical Variables Canonical Variables Canonical Variable Cumulative Canonical Cumulative Number Proportion Proportion R-Square Proportion Proportion 1 0.7542 0.7542 0.9245 0.6972 0.6972 2 0.2458 1.0000 0.0124 0.0030 0.7003

Squared Multiple Correlations Between the y variables and the First M Canonical Variables of the x variables M 1 2 y1 0.7038 0.7068 y2 0.7298 0.7324

Squared Multiple Correlations Between the x variables and the First M Canonical Variables of the y variables M 1 2 x1 0.6039 0.6082 x2 0.7905 0.7923

References Johnson, R.A. and Wichern, D.W. (2002). Applied multivariate statistical analysis. 5th Edition, Pearson Education Inc., New Delhi. Levine, Mark S. (1977). Canonical Analysis and Factor Comparison. Thousand Oaks, CA: Sage Publications, Quantitative Applications in the Social Sciences Series, No. 6. Romesburg, H.C. (1984). Cluster Analysis for Researchers. Lifetime Learning Publications, California. Sharma, S. (1996). Applied Multivariate Techniques, John Wiley & Sons, New York

116

The JMP Software: A menu-driven statistical tool

Dileep K. Panda Directorate of Water Management Bhubaneswar-751023 [email protected]

Introduction The software JMP was developed by SAS Institute although it is not a part of the SAS System. However, portions of JMP were adapted from routines in the SAS System, particularly for linear algebra and probability calculations. The JMP provides a rich variety of statistical and graphical methods organized into a small number of interactive platforms. The JMP offers descriptive statistics and simple analyses for beginning statisticians and complex model fitting for advanced researchers. Standard statistical analysis and specialty platforms for design of experiments, statistical quality control, contour plotting, and survival analysis provide the tools you need to analyze data and see results quickly. The mission of JMP is to help researchers to analyze and make discoveries in their data. The JMP is easy to learn. Statistics are organized into logical areas with appropriate graphs and tables, which help you find patterns in data, identify outlying points, or fit models. Appropriate analyses are defined and performed for you, based on the types of variables you have and the roles they play. The JMP is software for interactive statistical graphics and includes:

 A spreadsheet for viewing, editing, entering, and manipulating data;  A broad range of graphical and statistical methods for data analysis;  An extensive design of experiments module;  Options to select and display subsets of the data;  Data management tools for sorting and combining tables;  A calculator for each table column to compute values;  A facility for grouping data and computing summary statistics;  Special plots, charts, and communication capability for quality improvement techniques;  Tools for moving analysis results between applications and for printing;  A scripting language for saving frequently used routines.

117

When JMP opens, you see the Tip of the Day window. This window provides tips about using JMP that you might not know. Some tips are basic introductory information, and others give hidden power features that you should learn after getting comfortable with the basics. Upon startup, the JMP Starter window is located behind the Tip of the Day window. Most of the commands found on the JMP Starter are a duplication of commands found in the main menu and toolbars. To open and close the JMP Starter, select View (Window on the Macintosh) > JMP Starter. You can also stop the JMP Starter from appearing upon startup by selecting File (JMP on the Macintosh) > Preferences and unchecking Initial JMP Starter Window.

To create a new data table, Select File > New > Data Table. This shows an empty data table with no rows and one numeric column, labeled Column 1. Move the cursor onto a cell. Click the cell. The cursor becomes an I-beam. There are several ways to fill a table with values such as

 Create new rows and columns and type or paste data into the data grid.  Construct a formula to calculate column values.  Import data from another application.  Copy values from another application and paste them into the table.  Use a measuring instrument to read external measures.  Drag columns from one table to another

If you want to import a file that is a JMP data table (.jmp), script (.jsl), journal (.jrn), or report (.jrp) then Select File > Open and then Select the file type from the window that appears. If you have data that exists in a format other than a .jmp file, you can import it and save it as a JMP data table. The list below gives the file types you can import into JMP. The important ones are (.xls), Microsoft Excel 2007 (.xlsm, .xlsx, .x$sb) on Windows, Text (.txt), Text with comma separated values (.csv), Data (.dat) files, HTML (.htm, .html) on Windows. You can open Microsoft Excel files on JMP for Windows and Macintosh. You can

118

open Office spreadsheets by Select File > Open. On Windows: From the Files of type field, select Excel Files (*.XLS).

Table 1 Pumping test results of 17 bore wells in Balasore

Transmissivity Depth Discharge Thickness Storativity Block Well Id (m) (L/Sec) (m) (m2 day-1) ( ×10-3) jaleswar w1 1 60.90 11.50 22.00 2.18 6.30 jaleswar w2 1 48.70 11.28 17.06 3.03 2.20 jaleswar w3 1 88.40 31.00 30.48 2.97 3.06 jaleswar w4 1 50.00 7.75 20.50 2.88 9.30 jaleswar w5 1 94.50 27.50 33.80 3.07 4.50 bhograi w6 2 42.00 12.17 17.00 3.43 5.40 bhograi w7 2 40.00 17.91 19.00 3.53 3.20 bhograi w8 2 91.50 20.90 31.08 2.53 3.80 bhograi w9 2 88.30 24.50 27.43 3.30 2.70 baliapal w10 3 64.00 22.43 24.38 2.75 1.10 baliapal w11 3 64.30 34.16 21.34 3.03 9.20

119

baliapal w12 3 82.30 28.98 21.53 2.84 6.50 basta w13 4 92.50 53.50 56.00 3.43 1.20 basta w14 4 67.00 25.00 20.60 2.12 1.30 basta w15 4 35.00 14.81 18.50 3.28 1.60 basta w16 4 79.24 11.50 18.46 2.09 7.10

To add new empty rows, Select Rows > Add Rows. Then enter the number and location of rows you want to add. By default, new rows appear at the end of the table. Click in a cell anywhere below the last row in a table and begin typing. Then press Enter (or Return) to automatically generate new rows up to and including the row with the value you typed. Double-click an empty row number area below the last row to add that many empty rows. To add new empty columns, double-click the empty space to the right of the last data table column and begin typing or Select Cols > New Column. The JMP can find all cells whose values are the same as the one(s) you currently have highlighted. You can do this within one data table or throughout all open data tables. To select cells that contain the same values, highlight the cells that contain the value(s) you want to locate. To find all matching cells within the active data table, select Rows > Row Selection > Select Matching Cells. Or, right- click (Ctrl-click on the Macintosh) one of the highlighted row numbers and select Select Matching Cells. To find all matching cells across all open data tables, select Select All Matching Cells. The row(s) that contain the same values as the highlighted ones will highlight.

The Analyze Menu

Each Analyze command launches a platform. A platform is an interactive window you use to analyze data, work with points on plots, and save results. The reports in a JMP analysis are organized hierarchically. Methods unfold that suit the context of your data. Many results appear automatically and more are offered through drop-down menus. Choosing Distribution launches the Distribution platform, which describes a distribution of values with histograms and other graphical and textual reports:

 Continuous columns display a histogram and box plots. You can test the mean and standard deviation of the distribution and select from a variety of distribution fits. For continuous variables, capability analysis is available.  Nominal or ordinal columns are shown with a histogram of relative frequency for each level of the ordinal or nominal variable. You have the option to view a mosaic (stacked) bar chart as well as options to test probabilities.

The Distribution report has two menus for fitting distributions, one for continuous variables and one for discrete variables. When a distribution is selected, a table that shows the fitted parameter estimates is appended to the report. The Normal fitting option estimates the parameters of the normal distribution based on the analysis sample. The parameters for the normal distribution are μ (the mean), which defines the location of the distribution on the x- axis and σ (standard deviation), which defines the dispersion or spread of the distribution.

120

The standard normal distribution occurs when μ=0 and σ=1. The Parameter Estimates table for the normal distribution fit shows mu (estimate of μ) and sigma (estimate of σ), with upper and lower 95% confidence limits. The normal distribution is often used to model measures that are symmetric with most of the values falling in the middle of the curve. You can choose Normal fitting for any set of data and test how well a normal distribution fits your data. This example shows the Density Curve option from the Fitted Normal popup menu. This option overlay the density curve on the histogram using the parameter estimates from the data.

To select a point in a plot, click the point with the arrow cursor. This selects the point as well as the corresponding row in the current data table. To keep all points selected, press the Shift key while you click new points. A point's label appears when you place the cursor over the point with or without clicking. All graphs and plots that represent the same data table are linked to each other and to the corresponding data table. When you click points in plots or bars of a graph, the corresponding rows highlight in the data table. You can also extend the selection of bars in a histogram by pressing Shift and then clicking them. For scrolling and scaling axes, the hand tool (also known as the grabber tool) (€) provides a way to change the axes and view of a plot:

121

Normal(2.92528,0.46171)

LogNormal(1.06054,0.16421)

Weibull(3.11012,8.35293)

Extreme Value(1.13466,0.11972)

Fit Model

Fit Model lets you tailor an analysis using a model specific for your data. You select columns, assign roles, and build the model to fit in the Fit Model window.

Fit Model fits one or more y variables to a model of x variables. You select the kind of model appropriate to your data from the menu of fitting personalities given in the Fit Model window. The fitting personalities available depend on the kind of responses you select. The following list briefly describes the different fitting techniques:

122

 Standard Least Squares: Gives a least squares fit for a single continuous response, accompanied by leverage plots and an analysis of variance table.  Screening: Produces an exploratory screening analysis for single or multiple y columns with continuous values.  Stepwise: Gives a stepwise regression for a single continuous or categorical y and all types of effects.  Manova: Performs a multivariate analysis of variance for multiple continuous response columns. Manova displays a window that lets you fit multivariate models interactively.  Loglinear Variance: Is for a single continuous response and estimates parameters that optimize both a mean and a variance.  Nominal Logistic: Fits a single nominal response with nominal regression by maximum likelihood.  Ordinal Logistic: Fits a single ordinal response with ordinal cumulative logistic regression by maximum likelihood.  Proportional Hazard: Performs a proportional hazard (Cox) model fit for survival analysis of censored data with a single continuous response.  Parametric Survival: Tests the fit of an exponential, Weibull, or lognormal distribution.

One-way ANOVA (CRD)

123

Example of Hybrid-rice Factorial Experiment at DWM, Bhubaneswar

Repl Var Irri Fert Yield 1 1 1 1 2.13 1 1 1 2 3.18 1 1 1 3 4.97 1 1 1 4 5.65 1 1 2 1 2.55 1 1 2 2 2.85 1 1 2 3 4.3 1 1 2 4 4.65 1 1 3 1 2.95 1 1 3 2 3.18 1 1 3 3 3.88 1 1 3 4 3.15 1 2 1 1 1.9 1 2 1 2 2.82 1 2 1 3 3.35 1 2 1 4 3.05 1 2 2 1 1.88 1 2 2 2 2.62 1 2 2 3 3.15 1 2 2 4 2.85 1 2 3 1 2.17 1 2 3 2 3.68 1 2 3 3 4.01 1 2 3 4 3.48 2 1 1 1 2.33 2 1 1 2 3.63 2 1 1 3 4.82 2 1 1 4 5.53 2 1 2 1 2.35 2 1 2 2 3.99 2 1 2 3 4.25 2 1 2 4 4.85 2 1 3 1 2.94 2 1 3 2 3.55 124

2 1 3 3 4.25 2 1 3 4 3.22 2 2 1 1 1.99 2 2 1 2 2.85 2 2 1 3 3.78 2 2 1 4 3.15 2 2 2 1 2.1 2 2 2 2 2.88 2 2 2 3 3 2 2 2 4 3 2 2 3 1 2.55 2 2 3 2 3.54 2 2 3 3 3.57 2 2 3 4 2.9 3 1 1 1 2.23 3 1 1 2 3.4 3 1 1 3 4.89 3 1 1 4 5.59 3 1 2 1 2.45 3 1 2 2 3.42 3 1 2 3 4.27 3 1 2 4 4.65 3 1 3 1 2.94 3 1 3 2 3 3 1 3 3 3.18 3 1 3 4 4.06 3 2 1 1 2 3 2 1 2 2.83 3 2 1 3 3.56 3 2 1 4 3.1 3 2 2 1 2.1 3 2 2 2 2.95 3 2 2 3 3.08 3 2 2 4 2.92 3 2 3 1 2.36 3 2 3 2 3.11 3 2 3 3 3 3 2 3 4 3.76

125

126

Split plot analysis

127

Multiple Regression

Time Series explore, analyze, and forecast univariate time series. The time series platform also supports Transfer Function Models. The launch window (role assignment window) 128

requires that one or more continuous variables be assigned as the time series. Optionally, you can specify a time ID variable, which is used to label the time axis. If a time ID variable is specified it must be continuous, sorted ascending, and evenly spaced with no missing values.

The analysis begins with a plot of the points in the time series. In addition, the platform displays graphs of the autocorrelations and partial autocorrelations of the series. These indicate how and to what degree each point in the series is correlated with earlier values in the series. You can interactively add:

. Variograms--characterizations of process disturbances . AR coefficients--autoregressive coefficients . Spectral density plots--period and frequency plots with white noise tests

These graphs can be used to identify the type of model appropriate for describing and predicting (forecasting) the evolution of the time series. The model types include:

. ARIMA--autoregressive integrated moving-average, often called Box- Jenkins models . Seasonal ARIMA--ARIMA models with a seasonal component . Smoothing Model--several forms of exponential smoothing and Winters Method

Multivariate Methods

Explores how multiple variables relate to each other and how points fit that relationship. This platform helps you see correlations between two or more response (y) variables, look for points that are outliers, and examine principal components to look for factors. The multivariate platform appears showing correlations and a scatterplot matrix. Options give: Inverse, partial, nonparametric, and pairwise correlations with accompanying bar charts, a matrix of bivariate scatterplots with a plot for each pair of y-variables, a Mahalanobis distance outlier plot, a jackknifed multivariate distance outlier plot where the distance for each point is calculated excluding the point itself. There are options with these plots to save the distance scores. You can also request principal components, standardized principal components, rotation of a specified number of components, and factor analysis information.

129

Cluster

Clusters rows of a JMP data table. Cluster can perform a hierarchical or a k-means clustering method. The hierarchical cluster platform displays results as a tree diagram of the clusters called a dendrogram followed by a plot of the distances between clusters. The dendrogram has a sliding cluster selector that lets you identify the rows in any size cluster. There are options to save the cluster number of each row. Hierarchical clustering uses these five clustering methods. Average linkage computes the distance between two clusters as the average distance between pairs of observations, one in each cluster. Centroid method computes the distance between two clusters as the squared Euclidean distance between their means. Ward's minimum variance method (the default) uses the distance between two clusters as the ANOVA sum of squares between the two clusters added over all the variables. Single linkage uses the distance between two clusters that is the minimum distance between an observation in one cluster and an observation in the other cluster. Complete linkage uses the distance between two clusters that is the maximum distance between an observation in one cluster and an observation in the other cluster. The k-means clustering approach finds disjoint clusters on the basis of Euclidean distances computed from one or more quantitative variables. Every observation belongs to only one cluster--the clusters do not form a tree structure as with hierarchical clustering. You specify the number of clusters you want. The 130

cluster platform also has options to do normal mixture clustering and SOMs (self-organizing maps). The chapter "Clustering" of JMP Statistics and Graphics Guide describes the Cluster command in detail and shows clustering examples.

Principal Components: Derives a small number of independent linear combinations (principal components) of a set of variables that capture as much of the variability in the original variables as possible. JMP also offers several types of orthogonal and oblique Factor- Analysis-Style rotations to help interpret the extracted components.

Discriminant: Provides a method of predicting the level of a one-way classification based on known values of the responses. The technique is based on how close a set of measurement variables are to the multivariate means of the levels being predicted. Optionally, you can do stepwise discriminate analysis. See the JMP Statistics and Graphics Guide for details.

PLS (Partial Least Squares): Fits models using the partial least squares (PLS) method that balances the two objectives of explaining response variation and explaining predictor variation. The PLS techniques work by extracting successive linear combinations of the predictors, called factors (also called components or latent vectors) that address one or both of these two goals. The PLS platform in JMP also enables you to select the number of extracted factors by cross validation, which involves fitting the model to part of the data and minimizing the prediction error for the unfitted part. See the JMP Statistics and Graphics Guide for details.

131

132

The Graph Menu

Graph menu commands produce windows that contain specialized graphs or plots with supporting tables and statistics. Chart: The Chart command gives a chart for every numeric y variable specified where the y's are statistics to chart. The x values are always treated as discrete values. By default, a vertical bar chart appears, but there are options to show horizontal bar charts, line charts, step charts, needle charts, point charts, or pie charts. You can specify up to two x variables for grouping on the chart itself. The first x is the group variable, and the second x is the level (subgroup) variable. If there is no x variable, then each row is a bar. Overlay Plot: The Overlay Plot command gives an overlay of a single numeric or categorical x column and all specified numeric y variables. The axis can have either a linear or a log scale. Optionally, the plots for each y can be shown separately, with or without a common x-axis. By default, the values of the x variable are in ascending order, and the points are plotted in that order. You have the option of plotting the x values as they are encountered in the data table. Note: For scatterplots of two variables with regression fitting options, use the Fit Y by X command instead of Overlay Plot. The chapter "Overlay Plots" of JMP Statistics and Graphics Guide describes the Overlay Plot command in detail and shows examples of plotting data.

Scatterplot 3D

Scatterplot 3D: Scatterplot 3D produces a three-dimensional spinnable display of values from any three numeric columns in the active data table. It also produces an approximation to 133

higher dimensions through principal components, standardized principal components, rotated components, and biplots. There are options to save principal component scores, standardized scores, and rotated scores. The Scatterplot 3D platform also gives factor-analysis-style rotations of the principal components to form orthogonal combinations that correspond to directions of variable clusters in the space. The method used is called a varimax rotation, and is the same method that is traditionally used in factor analysis. See the JMP Statistics and Graphics Guide for details about the Scatterplot 3D command and examples of plotting data and computing principal components.

Contour Plot: The Contour Plot command constructs a contour plot for one or more response variables, y, for the values of two x variables. Contour Plot assumes the x values lie in a rectangular coordinate system, but the observed points do not have to form a grid. Some contour plot options are: Show or hide data points, show or hide triangulation and boundary, specify and label levels, show a line contour or fill areas.

Bubble Plot: a Bubble Plot is a scatter plot which represents its points as circles (bubbles). Optionally the bubbles can be sized according to a another column, colored by another column, aggregated across groups defined by one or more other columns, and dynamically indexed by a time column. With the opportunity to see up to five dimensions at once (x position, y position, size, color, and time), bubble plots can produce dramatic visualizations and make interesting discoveries easy. For details, see the JMP Statistics and Graphics Guide. Parallel Plot: the Parallel Plot command draws a parallel coordinate plot, which shows connected line segments representing each row of a data table.

Bubble Plot of Discharge by Depth

Scatterplot Matrix: the Scatterplot Matrix command allows quick production of scatterplot matrices. These matrices are orderly collections of bivariate graphs, assembled so that comparisons among many variables can be conducted visually. In addition, the plots can be customized and decorated with other analytical quantities (like density ellipses) to allow for further analysis. These matrices can be square, showing the same variables on both sides of the matrix or triangular, showing only unique pairs of variables in either a lower or upper triangular fashion. In addition, you can specify that different variables be shown on the sides

134

and bottom of the matrix, giving maximum flexibility for comparisons. For details, see the JMP Statistics and Graphics Guide.

Control Chart: The Control Chart menu has a sub-menu that creates dynamic plots of sample subgroups as they are received and recorded. Control charts are a graphical analytic tool used for statistical quality improvement. Control charts can be broadly classified according to the type of data analyzed: Control charts for variables are used when the quality characteristic to be analyzed is measured on a continuous scale. Control charts for attributes are used when the quality characteristic is measured by counting the number of nonconformities (defects) in an item or by counting the number of nonconforming (defective) items in a sample. The concepts underlying the control chart are that the natural variability in any process can be quantified with a set of control limits, and that variation exceeding these limits signals a special cause of variation. In industry, control charts are commonly used for studying the variation in output from a manufacturing process. They are typically used to distinguish variation due to special causes from variation due to common causes.

The control chart platform offers the following kinds of charts:

 Mean, range, and standard deviation  Individual measurement and moving range (run chart, XBar Chart, and IR)  p-chart, np-chart, c-chart, and u-chart  UWMA and EWMA  CUSUM  Presummarized  Levey-Jennings  Multivariate Control Charts

Pareto Plot : The Pareto Plot command creates a bar chart (Pareto chart) that displays the severity (frequency) of problems in a quality-related process or operation. Pareto plots compare quality-related measures or counts in a process or operation. The defining characteristic of Pareto plots is that the bars are in descending order of values, which visually emphasizes the most important measures or frequencies. Pareto Plot uses a single y variable, called a process variable, and gives: A simple Pareto plot when you do not specify an x (classification) variable, a one-way comparative Pareto plot when you specify a single x

variable, a two-way comparative plot when there are two x variables. The Pareto Plot command does not distinguish between numeric and character variables or between modeling types. All values are treated as discrete, and bars represent either counts or percentages.

135

Capability: Capability analysis, used in quality control, measures the conformance of a process to given specification limits. Using these limits, you can compare a current process to specific tolerances and maintain consistency in production. Graphical tools such as the goalpost plot and box plot give you quick visual ways of observing within-spec behaviours. Profiler: The Profiler is available for tables with columns whose values are computed from model prediction formulas. Usually, a profiler plot results when you do a Standard Least Squares analysis and then request it. However, if you save the prediction equation from the analysis, you can access the prediction profile later from the Graph menu and look at the model using the response column with the saved prediction formula. The prediction profiler displays prediction traces for each x variable. A prediction trace is the predicted response as one variable is changed while the others are held constant at the current values. The prediction profiler is a way of changing one variable at a time and looking at the effect on the predicted response. You interact with the prediction profiler; as you vary the value of an x variable, the prediction profiler re-computes: The low and high values show on the x-axis for each factor, showing its current value. The current predicted value of each y variable for the current values of the x variables. Lines and markers within the prediction plots show how the predicted value changes when you change the current value of an individual x variable and include the 95% confidence interval for the predicted values shown by error bars above and below each marker. Prediction profiles are useful in multiple-response models to help judge which factor values can optimize a complex set of criteria.

Contour Profiler: The Contour Profiler command works the same as the Profiler command. It is usually accessed from the Fit Model platform when a model has multiple responses. However, if you then save the prediction formulas for the responses, you can access the Contour Profiler at a later time from the Graph menu and specify the columns with the prediction equations as the response columns. Surface Plot: The Surface Plot command plots surfaces and points in three dimensions based on formulas or data.

136