Introduction to Data Technologies
Total Page:16
File Type:pdf, Size:1020Kb
Introduction Motivations for the course: • Statisticians do a lot more than just data analysis. STATS 220 Data Technologies • Statisticians have to use computers – Computers can improve accuracy. Introduction to – Computers can improve efficiency. Data Technologies Main aims of the course: • Learn about the use of computers in acquiring, storing, accessing, and processing statistical data. • Learn some specific computer tools/languages. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Introduction Introduction Some things that statisticians do: Some things that statisticians do: • Collect data • Transform and Clean Data – Data entry – Convert between formats ∗ Electronic forms (HTML) – Recode values – Data storage – Detect unusual values ∗ Data formats (XML, Databases) – Transform variables – Data Retrieval – Summarise variables ∗ Accessing Data Sets (SQL) ∗ Scripting (PHP, R, Regular Expressions) •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Introduction Course Information Some things that statisticians do: Lecturer Paul Murrell rm 303 273 • Analyse Data (R) ext 85392 [email protected] • Report results Lectures Tuesday and Thursday 10:00-11:00 – Automation of tables, plots, reports BLT204 (Biology Building Room 204) Labs Friday 10:00-11:00 (A-L) OR 11:00-12:00 (M-Z) ∗ Scripting (PHP, R) Basement Teaching Lab (rm 303 B75) – Publish results Assessment Exam 60% + Test 20% + Labs & Assignments 20% ∗ Web delivery (HTML) or Exam 80% + Labs & Assignments 20% (Require 50% overall and 45% in final exam) – Export results Textbook none ∗ Data formats (XML, Databases) Reading (i) PHP3: programming browser-based applications (ii) An Introduction to R Homepage stat18.stat.auckland.ac.nz/stats220/2004/ •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Computer Lab Hours Resource Centre Basement Computer Lab PLT4 Lifts t e Tutorial Hand-In PLT1 e PLT3 r Room Boxes t Mathematics S Assistance --------------------------- s B25 s Room e c B07 Monday 9:00am - 8:00pm n i r Tutorial BASEMENT OF B08 Tuesday 9:00am - 8:00pm P Room MATHS/PHYSICS Wednesday 9:00am - 8:00pm B09 BUILDING Statistics Thursday 9:00am - 8:00pm Assistance B10 Entrance Staff Room from carpark Carpark Friday 9:00am - 5:00pm B11 40 Saturday CLOSED Lifts Sunday CLOSED Teaching Maths / --------------------------- Lab S tatistics Computer Lab Information Commons Wellesley Street Early till late virtually every day! •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit GROUND FLOOR MATHS/PHYSICS BUILDING Assessment PLT1 MLT1 Labs After each lab, you will electronically submit work Alternative / based on the lab. This work will be marked and, in Afterhours Access to lab total, the lab marks will count 5% toward your final t mark. There will be 10 labs. e e r t e g S d Faculty of i s r Assignments There will be 3 assignments. Assignment 1 s b r e Science Student e c v n Centre O i will be due during week 4 and will be worth 4% of your r P final mark, assignment 2 will be due in week 7 and will Staff be worth 4% also, and assignment 3 will be due in week Carpark 12 and will be worth 7%. Term Test This will be held in class on Thursday April 8 and Computer Laboratories will be worth either 0% or 20%. W elles ley S treet Exam This will be worth 60% or 80% of your final mark. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Course Overview Course Overview Week Event Topic Week Event Topic 1 lecture Introduction 4 lecture Databases lecture Writing Computer Code lecture Database Design Lab Writing Computer Code Lab Database Design 2 lecture Data Collection 5 lecture SQL (DDL) lecture HTML Forms lecture SQL (DML) Lab HTML Forms Lab SQL 3 lecture Data Storage 6 lecture SQL (DML) lecture XML TERM TEST Lab XML No Lab •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Course Overview Course Overview Week Event Topic Week Event Topic 7 lecture Scripting 10 No Lecture lecture PHP No Lecture Lab PHP No Lab 8 lecture PHP 11 lecture R lecture Regular Expressions lecture R Lab Regular Expressions Lab R 9 lecture HTML Forms and PHP 12 lecture R and Databases lecture PHP and SQL lecture R Graphics Lab HTML Forms, PHP, and SQL Lab R Graphics •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Course Resources Virtually all that you need for this course will be provided as handouts during lectures and/or on the STATS 220 web site: stat18.stat.auckland.ac.nz/stats220/2004/ All you will need is a web browser (e.g., Internet Explorer) and internet access. In order to complete labs and assignments you will need to make use of the reference material on the 220 web site in addition to the material in the lecture notes. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Course Resources WARNING We will also make use of a free text editor called Crimson This course involves using a lot of computer hardware and Editor in Labs. software. http://www.crimsoneditor.com/ Most of this hardware and software is provided by the University of Auckland for your use during this course. This will be available in the computer labs in the Maths/Physics building; for working at home, you can either DO NOT use the hardware or software for any other purpose. install it yourself, or just use Notepad or any other editor you like. See the University of Auckland Guidelines for the Use of University Computing Facilities and Services Be careful if using a word processor like Microsoft Word that http://www.auckland.ac.nz/cir resources/index.cfm?action=display page&page title=resource computer guidelines you save files in “plain text” format. •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Cricinfo Example Cricinfo Example In cricket test matches, are batsmen more likely to get out just Scorecards from CricInfo - www.cricinfo.com Test # 1600 New Zealand in Pakistan, 2002, 1st Test before and/or just after reaching a century (100 runs)? Pakistan v New Zealand Gaddafi Stadium, Lahore 1,2,3 May 2002 (5-day match) The CricInfo website (www.cricinfo.com) has records Result: Pakistan won by an innings and 324 runs for all cricket test matches ever played. Series: Pakistan leads the 2-Test series 1-0 Toss: Pakistan Umpires: SA Bucknor (WI) and RE Koertzen (SA) TV Umpire: Aleem Dar Match Referee: MJ Procter (SA) Test Debut: RG Hart (NZ). Man of the Match: Inzamam-ul-Haq Close of Play: Day 1: Pakistan 355/4 (Inzamam-ul-Haq 159*; 89.5 overs) Day 2: Pakistan 643, New Zealand 58/6 (Hart 2*, Vettori 0*; 20 overs) Pakistan 1st innings R M B 4 6 Imran Nazir c Richardson b McMillan 127 291 203 18 3 Shahid Afridi c Hart b Tuffey 0 2 1 0 0 Younis Khan c Fleming b Vettori 27 55 41 6 0 Inzamam-ul-Haq c Tuffey b Walker 329 579 436 38 9 Yousuf Youhana c Fleming b Martin 29 76 47 4 0 Abdur Razzaq lbw b Tuffey 25 44 31 6 0 +Rashid Latif c & b Harris 7 27 27 1 0 Saqlain Mushtaq b McMillan 30 104 87 3 1 *Waqar Younis c & b McMillan 10 19 24 2 0 Shoaib Akhtar st Hart b Walker 37 63 55 4 2 Danish Kaneria not out 4 8 3 1 0 Extras (b 1, lb 8, w 1, nb 8) 18 Total (all out, 157.5 overs, 638 mins) 643 •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Cricinfo Example Cricinfo Example First, I tried entering scores by hand: Here’s a small part of the file I created: 1. Use Netscape to interactively browse to a test match 1 a 1877 165 page. 2 a 1877 1 3 a 1877 12 2. Open a text editor. 4 a 1877 1 5 a 1877 15 3. The data are not in a convenient format to cut-and-paste 6 a 1877 5 so I retyped some of the information: 7 a 1877 0 • 8 a 1877 17 the country the batsman played for. 9 a 1877 18 • the batsman's position in the batting order. 10 a 1877 3 11 a 1877 0 • the year the test was played. 1 e 1877 63 • the batsman's score. 2 e 1877 7 •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit •First •Prev •Next •Last •Go Back •Full Screen •Close •Quit Cricinfo Example Cricinfo Example After mucking about for a year and a half, I managed to Some years later, I returned to the problem armed with some enter 2888 scores. computational tools. Because the task was so laborious, I did not: 1. I wrote a Java program to automatically download files • record whether the batsman was “not out” from the CricInfo site.