
Introduction to Applied Statistical Computing Math 3210 Dr. Zeng Department of Mathematics California State University, Bakersfield Outline • Introduction to Applied Statistical Computing • Course Introduction • SAS Introduction • R Introduction • Difference between SAS and R • SAS Installation Guide • R Installation Guide A Data Driven World • A data driven world: from daily life to every scientific fields (e.g. medical research, DNA Microarray data, agriculture, social network, banking, economic & finance, political polls et al.). • Data are more than facts and figures, it is the lifeblood of all business. • Statisticians help us to turn large amounts of data into useful knowledge. • Statistical programming involves doing computations to aid in statistical analysis. Statistical Computing • Statistics is the study of collection, organization, analysis and the interpretation of data. • In statistical modeling, it is often necessary to use computer software to aid the implementation of large data sets and to obtain useful results. • For example, data must be summarized and displayed. Models must be fit to data, and the results displayed. • In statistical research, computer software is also used for statistical simulation which is a way to model random events, such that simulated outcomes closely match real-world outcomes. Statistical Software • Point-and-click Interface (menu driven) Minitab, SPSS • Command Syntax SAS, R, S-Plus • Both JMP, Stata • General Excel, MATLAB, et al. Why use Command Line? • Menu-based interfaces are very convenient when applied to a limited set of commands, from a few to one or two hundred. • Programming allows you to create a new program that no one has done before. • Learning how to use one command-line interface will give you some insight into how a menu-driven interface is implemented. Course Introduction • This course offers foundations of two most popular programming language SAS and R. Specifically, you will learn SAS data management and exploratory data analysis. Elementary R programming and R data management Using SAS for multivariate statistical analysis • Course Project Explore a research question using a large dataset Use inferential techniques that you learned before or here Introduce and use new SAS steps and R functions Roadmap of Statistical Courses • After completing this course, you are encouraged to take the following courses in order to better understand and practice a series of applied statistical modeling methods in SAS and R Math 4210: Applied Regression Analysis Math 4220: Statistical Design of Experiments • The courses below also offers theoretical foundations to advanced classes Math 3200: Probability Math 4200: Mathematical Statistics EXPLORING SAS SAS: History • The root of SAS (Statistical Analysis System) software reach back to the 1970s at North Carolina State University when it started out as a software package for statistical analysis, but SAS didn’t stop there. • By the mid-1980s SAS had already branched out into graphics, online data entry, and compilers for the C programming language. • In the 1990s the SAS family tree grew to include tools for visualizing data, administering data warehouses, and building interfaces to the World Wide Web. • In the new century, SAS has continued to grow with products designed for cleansing messy data, discovering and developing drugs, and detecting money laundering. SAS: Overview • Major statistical software in many industries • Multiple add-ons and extensions available, including integration of SQL programming language and integration with JMP • Extensive online help manuals and forums • Used by many statisticians and computer scientists for data mining, data analysis, and development of statistical methodology • Not case-sensitive language • Offers various certifications, which many employers value highly • Common fields: – Statistical science – Sociology – Manufacturing – Pharmaceutical science – Agriculture – Computer science – Quantitative finance – Engineering Who use SAS? SAS is used at more than 75,000 sites in 135 countries, including 93 of the top 100 companies on the 2014 Fortune Global 500® list. • Bank of America • Google • Twitter • Netflix • DIRECTV • US Census Bureau • USDA National Agricultural Statistical Service • HP • Kelly Blue Book • Many more banks and IT Companies • SAS Customer stories: https://www.sas.com/en_us/customers.html SAS: pros and cons Pros: – Data warehousing • Widely used in both industry and academia – Multivariate analysis • High-performance architecture that – Nonparametric methods supports computationally-intensive – Hypothesis testing algorithms – Categorical analysis • Flexible and customizable analyses and – Time series analysis graphics – Sample size calculation/power • Great for: analysis – Data manipulation, editing, and – Design of experiments coding – Optimization – Data mining – Graphical analysis Cons: – Data summary • Scripting programming language – Exploratory analysis • Expensive – Simulations • Some versions are not 100% compatible – Forecasting • Not as useful for: – Survival analysis – Simple analysis and manipulation – Linear and nonlinear modeling – Quality assessment and improvement SAS: usage • Data can be read in through a command or imported through menu-driven prompts • Variables and functions can be created and renamed • Multiple data sets can be handled at once and are stored in various workspaces (“libraries”) • Four types of commands: DATA step (read & edit data); Procedure steps (run built-in functions); macros (create and run own function); ODS statements (set output settings, styles, etc.) • Editor window is used to write and save commands • Log window reads commands and displays any errors or comments • Output window displays some output created by commands • Results viewer window displays most output, including graphs • Can save only commands, only data, or whole project EXPLORING R R: History • R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. • In 1980s, a newly developed statistical programming package named S-Plus was in widespread use among statisticians of all kinds. • Ross Ihaka and Robert Gentleman chose to write a reduced version of S-Plus for teaching purposes, and what was more natural than choosing the immediately preceding letter? Their initials may also have played a role. • The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000. R: Overview • Free, open-source software; similar to S-plus • Multiple add-ons and extensions available, including integration with LaTeX ( a word processor) via RStudio, and Excel via RExcel • Extensive online help manuals and forums • Used by many statisticians and computer scientists for data mining, data analysis, and development of statistical methodology • Case-sensitive language • Common fields – Statistical science – Computational biology – Computer science – Quantitative finance – Engineering Who use R? R is used by more than 2 million data scientists and statisticians worldwide. • Facebook uses R to understand how its users interact with the service (data visualization) • The New York Times uses R as the basis for interactive data analysis features that forecast upcoming elections • Oracle has its own enterprise version of R R: usage • Data can be read in through code or created • Variables and functions can be created and renamed • Multiple data sets can be handled at once • Editor window is used to write and save commands • Console window reads commands and displays output, which is best saved by copying and pasting into a word processing document • Graphs are outputted in separate window, which is overwritten for each new graph unless otherwise indicated in commands • Workspaces can be saved, meaning data sets and variables do not need to be recreated (especially useful if data creation and manipulation take a long time to run) R: pros and cons Pros: Cons: • Widely used in both industry and academia • Scripting programming language • Flexible and customizable analyses and graphics • Mediocre graphics • Great for: • Not as useful for: – Data manipulation, editing, and coding – Graphical analysis – Data mining – Data summary – Simulations – Exploratory analysis – Survival analysis – Quality assessment and improvement – Linear and nonlinear modeling – Design of experiments – Data warehousing – Multivariate analysis – Nonparametric methods – Hypothesis testing – Categorical analysis – Time series analysis – Sample size calculation/power analysis – Optimization Installation: SAS and SAS Studio SAS Enterprise licenses are very expensive, but there are two ways to access SAS for free: 1. SAS OnDemand for Academics: a cloud-based system. Free for academics. 2. SAS Studio: a web-based system. Create and interact with SAS anywhere, anytime for free. SAS OnDemand for Academics is available in the following locations on campus: 1. Science III Room 239 2. WSL Lab 16 3. Science III Math Major Study Room Instructions for free access to SAS Studio and SAS Enterprise Guide: 1. Register for SAS OnDemand for Academics. Create your account by following this instruction http://support.sas.com/software/products/ondemand- academics/manuals/EnterpriseGuideStudent.pdf 2. Use the course enrollment link to enroll in this course (you need to type in the log in information created on step 1 first): https://odamid.oda.sas.com/SASODAControlCenter/enroll.ht
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages28 Page
-
File Size-