<<

Introduction to Applied Statistical Computing

Math 3210 Dr. Zeng Department of Mathematics California State University, Bakersfield Outline

• Introduction to Applied Statistical Computing • Course Introduction • SAS Introduction • Introduction • Difference between SAS and R • SAS Installation Guide • R Installation Guide

A Data Driven World

• A data driven world: from daily life to every scientific fields (e.g. medical research, DNA Microarray data, agriculture, social network, banking, economic & finance, political polls et al.). • Data are more than facts and figures, it is the lifeblood of all business. • Statisticians help us to turn large amounts of data into useful knowledge. • Statistical programming involves doing computations to aid in statistical analysis.

Statistical Computing

• Statistics is the study of collection, organization, analysis and the interpretation of data. • In statistical modeling, it is often necessary to use computer software to aid the implementation of large data sets and to obtain useful results. • For example, data must be summarized and displayed. Models must be fit to data, and the results displayed. • In statistical research, computer software is also used for statistical simulation which is a way to model random events, such that simulated outcomes closely match real-world outcomes.

Statistical Software

• Point-and-click Interface (menu driven) , SPSS • Command Syntax SAS, R, S-Plus • Both JMP, • General Excel, MATLAB, et al. Why use Command Line?

• Menu-based interfaces are very convenient when applied to a limited set of commands, from a few to one or two hundred. • Programming allows you to create a new program that no one has done before. • Learning how to use one command-line interface will give you some insight into how a menu-driven interface is implemented. Course Introduction

• This course offers foundations of two most popular programming language SAS and R. Specifically, you will learn

 SAS data management and exploratory data analysis.  Elementary R programming and R data management  Using SAS for multivariate statistical analysis

• Course Project

 Explore a research question using a large dataset  Use inferential techniques that you learned before or here  Introduce and use new SAS steps and R functions

Roadmap of Statistical Courses

• After completing this course, you are encouraged to take the following courses in order to better understand and practice a series of applied statistical modeling methods in SAS and R

 Math 4210: Applied Regression Analysis  Math 4220: Statistical Design of Experiments

• The courses below also offers theoretical foundations to advanced classes

 Math 3200: Probability  Math 4200: Mathematical Statistics

EXPLORING SAS SAS: History • The root of SAS (Statistical Analysis System) software reach back to the 1970s at North Carolina State University when it started out as a software package for statistical analysis, but SAS didn’t stop there. • By the mid-1980s SAS had already branched out into graphics, online data entry, and compilers for the C programming language. • In the 1990s the SAS family tree grew to include tools for visualizing data, administering data warehouses, and building interfaces to the World Wide Web. • In the new century, SAS has continued to grow with products designed for cleansing messy data, discovering and developing drugs, and detecting money laundering. SAS: Overview

• Major statistical software in many industries • Multiple add-ons and extensions available, including integration of SQL programming language and integration with JMP • Extensive online help manuals and forums • Used by many statisticians and computer scientists for data mining, data analysis, and development of statistical methodology • Not case-sensitive language • Offers various certifications, which many employers value highly • Common fields: – Statistical science – Sociology – Manufacturing – Pharmaceutical science – Agriculture – Computer science – Quantitative finance – Engineering

Who use SAS? SAS is used at more than 75,000 sites in 135 countries, including 93 of the top 100 companies on the 2014 Fortune Global 500® list.

• Bank of America • Google • Twitter • Netflix • DIRECTV • US Census Bureau • USDA National Agricultural Statistical Service • HP • Kelly Blue Book • Many more banks and IT Companies • SAS Customer stories: https://www.sas.com/en_us/customers.html

SAS: pros and cons

Pros: – Data warehousing • Widely used in both industry and academia – Multivariate analysis • High-performance architecture that – Nonparametric methods supports computationally-intensive – Hypothesis testing algorithms – Categorical analysis • Flexible and customizable analyses and – Time series analysis graphics – Sample size calculation/power • Great for: analysis – Data manipulation, editing, and – Design of experiments coding – Optimization – Data mining

– Graphical analysis Cons: – Data summary • Scripting programming language – Exploratory analysis • Expensive – Simulations • Some versions are not 100% compatible – Forecasting • Not as useful for: – Survival analysis – Simple analysis and manipulation – Linear and nonlinear modeling – Quality assessment and improvement SAS: usage • Data can be read in through a command or imported through menu-driven prompts • Variables and functions can be created and renamed • Multiple data sets can be handled at once and are stored in various workspaces (“libraries”) • Four types of commands: DATA step (read & edit data); Procedure steps (run built-in functions); macros (create and run own function); ODS statements (set output settings, styles, etc.) • Editor window is used to write and save commands • Log window reads commands and displays any errors or comments • Output window displays some output created by commands • Results viewer window displays most output, including graphs • Can save only commands, only data, or whole project

EXPLORING R R: History

• R was created by and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. • In 1980s, a newly developed statistical programming package named S-Plus was in widespread use among statisticians of all kinds. • Ross Ihaka and Robert Gentleman chose to write a reduced version of S-Plus for teaching purposes, and what was more natural than choosing the immediately preceding letter? Their initials may also have played a role. • The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000. R: Overview

• Free, open-source software; similar to S-plus • Multiple add-ons and extensions available, including integration with LaTeX ( a word processor) via RStudio, and Excel via RExcel • Extensive online help manuals and forums • Used by many statisticians and computer scientists for data mining, data analysis, and development of statistical methodology • Case-sensitive language • Common fields – Statistical science – Computational biology – Computer science – Quantitative finance – Engineering

Who use R? R is used by more than 2 million data scientists and statisticians worldwide. • Facebook uses R to understand how its users interact with the service (data visualization) • The New York Times uses R as the basis for interactive data analysis features that forecast upcoming elections • Oracle has its own enterprise version of R

R: usage

• Data can be read in through code or created • Variables and functions can be created and renamed • Multiple data sets can be handled at once • Editor window is used to write and save commands • Console window reads commands and displays output, which is best saved by copying and pasting into a word processing document • Graphs are outputted in separate window, which is overwritten for each new graph unless otherwise indicated in commands • Workspaces can be saved, meaning data sets and variables do not need to be recreated (especially useful if data creation and manipulation take a long time to run)

R: pros and cons Pros: Cons: • Widely used in both industry and academia • Scripting programming language • Flexible and customizable analyses and graphics • Mediocre graphics • Great for: • Not as useful for: – Data manipulation, editing, and coding – Graphical analysis – Data mining – Data summary – Simulations – Exploratory analysis – Survival analysis – Quality assessment and improvement – Linear and nonlinear modeling – Design of experiments – Data warehousing – Multivariate analysis – Nonparametric methods – Hypothesis testing – Categorical analysis – Time series analysis – Sample size calculation/power analysis – Optimization

Installation: SAS and SAS Studio

SAS Enterprise licenses are very expensive, but there are two ways to access SAS for free:

1. SAS OnDemand for Academics: a cloud-based system. Free for academics. 2. SAS Studio: a web-based system. Create and interact with SAS anywhere, anytime for free.

SAS OnDemand for Academics is available in the following locations on campus:

1. Science III Room 239 2. WSL Lab 16 3. Science III Math Major Study Room

Instructions for free access to SAS Studio and SAS Enterprise Guide:

1. Register for SAS OnDemand for Academics. Create your account by following this instruction http://support.sas.com/software/products/ondemand- academics/manuals/EnterpriseGuideStudent.pdf 2. Use the course enrollment link to enroll in this course (you need to type in the log in information created on step 1 first): https://odamid.oda.sas.com/SASODAControlCenter/enroll.ht ml?enroll=936defbf-9458-4f4d-80f2-147d3b2f618f 3. Download SAS Enterprise Guide to your pc. 4. Click on SAS Studio to open the SAS Studio web page and use it immediately!

SAS Academic on Demand

SAS Studio

Installation: R and RStudio

R versus RStudio:

• R is a statistical programming language • RStudio is an IDE (Integrated Development Enviroment) for using R. • An IDE is a software application that provides comprehensive facilities to computer programmers for software development. An IDE normally consists of a source code editor, build automation tools and a debugger. • RStudio makes R easier to use. It includes a code editor, debugging & visualization tools. • Rstudio cannot be used without a pre installation of R. • In this class, we will generally use Rstudio to run the R code.

Instructions for installing R: 1. R can be downloaded from https://cloud.r-project.org/ 2. Download R for Windows, Linux or Mac

Instructions for installing RStudio: 1. Make sure that R is already properly installed. 2. You should download the “Open Source Edition” of “RStudio Desktop” from the following website and follow the instructions to install it on your computer https://www.rstudio.com/products/rstudio/download/#d ownload

R

RStudio

Summary: SAS versus R

• Availability / Price (Industry v.s. research) • Package Flexibility (R adopts the latest statistical methods much more quickly than SAS) • Customer care support & community • Data handling (SAS excels in handling largest data sets, e.g. terabytes of data) • Data management (SAS uses row by row but R is inherently a vector-based language) • R is case sensitive but SAS is not

Useful Study Resource

• SAS support by category tutorial videos: http://support.sas.com/training/tutorial/#s1=1

• SAS 9.3 User’s Guide: https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/ viewer.htm#statug_intro_sect001.htm

• UCLA SAS and R online learning material: https://stats.idre.ucla.edu/sas/ https://stats.idre.ucla.edu/r/

• PennState Intro to SAS and R: https://onlinecourses.science.psu.edu/statprogram/sas https://onlinecourses.science.psu.edu/statprogram/r