Introduction Motivations for the course:

• Statisticians do a lot more than just data analysis.

STATS 220 Data Technologies • Statisticians have to use computers

– Computers can improve accuracy. Introduction to – Computers can improve efficiency. Data Technologies Main aims of the course: • Learn about the use of computers in acquiring, storing, accessing, and processing statistical data.

• Learn some specific computer tools/languages.

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Introduction Introduction Some things that statisticians do: Some things that statisticians do:

• Collect data • Transform and Clean Data

– Data entry – Convert between formats ∗ Electronic forms (HTML) – Recode values – Data storage – Detect unusual values ∗ Data formats (XML, Databases) – Transform variables – Data Retrieval – Summarise variables ∗ Accessing Data Sets (SQL) ∗ Scripting (PHP, R, Regular Expressions)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Introduction Course Information Some things that statisticians do: Lecturer Paul Murrell rm 303 273 • Analyse Data (R) ext 85392 [email protected] • Report results Lectures Tuesday and Thursday 10:00-11:00 – Automation of tables, plots, reports BLT204 (Biology Building Room 204) Labs Friday 10:00-11:00 (A-L) OR 11:00-12:00 (M-Z) ∗ Scripting (PHP, R) Basement Teaching Lab (rm 303 B75) – Publish results Assessment Exam 60% + Test 20% + Labs & Assignments 20% ∗ Web (HTML) or Exam 80% + Labs & Assignments 20% (Require 50% overall and 45% in final exam) – Export results Textbook none ∗ Data formats (XML, Databases) Reading (i) PHP3: programming browser-based applications (ii) An Introduction to R Homepage stat18.stat.auckland.ac.nz/stats220/2004/

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Computer Lab Hours Resource Centre Basement Computer Lab PLT4 Lifts t

e Tutorial Hand-In PLT1

e PLT3

r Room Boxes t Mathematics S

Assistance ------s B25 s Room e

c B07

Monday 9:00am - 8:00pm n i

r Tutorial BASEMENT OF B08 Tuesday 9:00am - 8:00pm P Room MATHS/PHYSICS Wednesday 9:00am - 8:00pm B09 BUILDING Statistics Thursday 9:00am - 8:00pm Assistance B10 Entrance Staff Room from carpark Carpark Friday 9:00am - 5:00pm B11 40 Saturday CLOSED Lifts Sunday CLOSED Teaching Maths / ------Lab S tatistics Computer Lab Information Commons Wellesley Street Early till late virtually every day!

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit GROUND FLOOR MATHS/PHYSICS BUILDING Assessment

PLT1 MLT1 Labs After each lab, you will electronically submit work

Alternative / based on the lab. This work will be marked and, in Afterhours Access to lab total, the lab marks will count 5% toward your final

t mark. There will be 10 labs. e e r t e g S

d Faculty of i s r Assignments There will be 3 assignments. Assignment 1 s b r

e Science Student e c v

n Centre O i will be due during week 4 and will be worth 4% of your r P final mark, assignment 2 will be due in week 7 and will Staff be worth 4% also, and assignment 3 will be due in week Carpark 12 and will be worth 7%. Term Test This will be held in class on Thursday April 8 and Computer Laboratories will be worth either 0% or 20%. W elles ley S treet Exam This will be worth 60% or 80% of your final mark.

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Course Overview Course Overview

Week Event Topic Week Event Topic 1 lecture Introduction 4 lecture Databases lecture Writing Computer Code lecture Database Design Lab Writing Computer Code Lab Database Design 2 lecture Data Collection 5 lecture SQL (DDL) lecture HTML Forms lecture SQL (DML) Lab HTML Forms Lab SQL 3 lecture Data Storage 6 lecture SQL (DML) lecture XML TERM TEST Lab XML No Lab

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Course Overview Course Overview

Week Event Topic Week Event Topic 7 lecture Scripting 10 No Lecture lecture PHP No Lecture Lab PHP No Lab 8 lecture PHP 11 lecture R lecture Regular Expressions lecture R Lab Regular Expressions Lab R 9 lecture HTML Forms and PHP 12 lecture R and Databases lecture PHP and SQL lecture R Graphics Lab HTML Forms, PHP, and SQL Lab R Graphics

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Course Resources Virtually all that you need for this course will be provided as handouts during lectures and/or on the STATS 220 web site: stat18.stat.auckland.ac.nz/stats220/2004/

All you will need is a web browser (e.g., Internet Explorer) and internet access.

In order to complete labs and assignments you will need to make use of the reference material on the 220 web site in addition to the material in the lecture notes.

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Course Resources WARNING We will also make use of a free text editor called Crimson This course involves using a lot of computer hardware and Editor in Labs. software. http://www.crimsoneditor.com/ Most of this hardware and software is provided by the University of Auckland for your use during this course. This will be available in the computer labs in the Maths/Physics building; for working at home, you can either DO NOT use the hardware or software for any other purpose. install it yourself, or just use Notepad or any other editor you like. See the University of Auckland Guidelines for the Use of University Computing Facilities and Services Be careful if using a word processor like Microsoft Word that http://www.auckland.ac.nz/cir resources/index.cfm?action=display page&page title=resource computer guidelines you save files in “plain text” format.

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Cricinfo Example Cricinfo Example In test matches, are batsmen more likely to get out just Scorecards from CricInfo - www.cricinfo.com Test # 1600 New Zealand in , 2002, 1st Test before and/or just after reaching a century (100 runs)? Pakistan v New Zealand Gaddafi Stadium, 1,2,3 May 2002 (5-day match)

The CricInfo website (www.cricinfo.com) has records Result: Pakistan won by an innings and 324 runs for all cricket test matches ever played. Series: Pakistan leads the 2-Test series 1-0 Toss: Pakistan Umpires: SA Bucknor (WI) and RE Koertzen (SA) TV : Aleem Dar Match Referee: MJ Procter (SA) Test Debut: RG Hart (NZ). Man of the Match: Inzamam-ul-Haq

Close of Play: Day 1: Pakistan 355/4 (Inzamam-ul-Haq 159*; 89.5 overs) Day 2: Pakistan 643, New Zealand 58/6 (Hart 2*, Vettori 0*; 20 overs)

Pakistan 1st innings R M B 4 6 Imran Nazir c Richardson b McMillan 127 291 203 18 3 c Hart b Tuffey 0 2 1 0 0 c Fleming b Vettori 27 55 41 6 0 Inzamam-ul-Haq c Tuffey b Walker 329 579 436 38 9 Yousuf Youhana c Fleming b Martin 29 76 47 4 0 Abdur Razzaq lbw b Tuffey 25 44 31 6 0 +Rashid Latif c & b Harris 7 27 27 1 0 b McMillan 30 104 87 3 1 * c & b McMillan 10 19 24 2 0 st Hart b Walker 37 63 55 4 2 not out 4 8 3 1 0 Extras (b 1, lb 8, w 1, nb 8) 18 Total (all out, 157.5 overs, 638 mins) 643

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Cricinfo Example Cricinfo Example First, I tried entering scores by hand: Here’s a small part of the file I created:

1. Use Netscape to interactively browse to a test match 1 a 1877 165 page. 2 a 1877 1 3 a 1877 12 2. Open a text editor. 4 a 1877 1 5 a 1877 15 3. The data are not in a convenient format to cut-and-paste 6 a 1877 5 so I retyped some of the information: 7 a 1877 0 • 8 a 1877 17 the country the batsman played for. 9 a 1877 18 • the batsman's position in the order. 10 a 1877 3 11 a 1877 0 • the year the test was played. 1 e 1877 63 • the batsman's score. 2 e 1877 7

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Cricinfo Example Cricinfo Example After mucking about for a year and a half, I managed to Some years later, I returned to the problem armed with some enter 2888 scores. computational tools.

Because the task was so laborious, I did not: 1. I wrote a Java program to automatically download files • record whether the batsman was “not out” from the CricInfo site. (I could have used PHP or R)

• double-enter the data to check I had not made mistakes. 2. I wrote awk scripts containing regular expressions and shell scripts to • record all of the information available (such as the automatically detect the files that contained test batsman’s name, the country the batsman played matches and download those files (using the Java against, the team’s total score for that innnings, ...). program). (I could have used PHP or R)

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Cricinfo Example Cricinfo Example 3. I wrote awk scripts containing After programming for a a couple of days, I was able to type regular expressions to extract the test match the following commands on a computer and enter 21829 results from the HTML file and process the results to scores. record the following information: saveDecade 1960S • the country the batsman played for. saveDecade 1970S • the batsman’s position in the . saveDecade 1980S • the year the test was played. processCricInfo • the batsman’s score. Because the task was automated: • the country the batsman played against. • there is much less chance of error. • the total score for the innings. • I could record ALL of the information available for • the batsman’s name. almost exactly the same cost as recording only some of • how the batsman got out. the information.

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit

Cricinfo Example Histogram of Test Scores By Hand Automated (1960s, 70s, and 80s)

Set Up Cost 0sec Several Days 2500 2000 Data Entry Cost Months 10sec to enter

the commands 1500

About 30min for Frequency the programs to run 1000 500 Result Frustration Elation

Boredom Pride 0

Arthritis Useful skills 0 50 100 150 200 250 300

A full hard disk Score

•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit