Wrangling F1 Data with R a Data Junkie’S Guide
Total Page:16
File Type:pdf, Size:1020Kb
Wrangling F1 Data With R A Data Junkie’s Guide Tony Hirst This book is for sale at http://leanpub.com/wranglingf1datawithr This version was published on 2016-04-02 This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do. This work is licensed under a Creative Commons Attribution 3.0 Unported License Tweet This Book! Please help Tony Hirst by spreading the word about this book on Twitter! The suggested hashtag for this book is #f1datajunkie. Find out what other people are saying about the book by clicking on this link to search for this hashtag on Twitter: https://twitter.com/search?q=#f1datajunkie Thanks… Just because… (sic) Acknowledgements FORMULA 1, FORMULA ONE, F1, FIA FORMULA ONE WORLD CHAMPIONSHIP, GRAND PRIX, F1 GRAND PRIX, FORMULA 1 GRAND PRIX and related marks are trademarks of Formula One Licensing BV, a Formula One group company. All rights reserved. The Ergast Developer API is an experimental web service which provides a historical record of motor racing data for non-commercial purposes. http://ergast.com/mrd/ . R: A Language and Environment for Statistical Computing, is produced by the R Core Team/R Foundation for Statistical Computing, http://www.R-project.org . RStudio™ is a trademark of RStudio, Inc. http://www.rstudio.com/ . ggplot2: elegant graphics for data analysis, Hadley Wickham, Springer New York, 2009. http://ggplot2.org/ knitr: A general-purpose package for dynamic report generation in R, Yihui Xie. http://yihui.name/knitr/ Leanpub: flexible ebook posting, and host of this book at http://leanpub.com/wranglingf1datawithr/ v Errata An update of the RSQLite package to version 1.0.0 (25.10/14) required the following changes to be made to earlier editions of this book: ### COMMENT OUT THE ORIGINAL SET UP #require(RSQLite) #ergastdb = dbConnect(drv='SQLite', dbname='./ergastdb13.sqlite') ### REPLACE WITH: require(DBI) ergastdb =dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite') vi Contents Acknowledgements .................................... v Errata ............................................ vi Foreword .......................................... 1 A Note on the Data Sources ............................. 1 The Lean and Live Nature of This Book ....................... 2 Introduction ........................................ 3 Preamble ........................................ 3 What are we trying to do with the data? ....................... 4 Choosing the tools ................................... 5 The Data Sources ................................... 7 Additional Data Sources ............................... 11 Getting the Data into RStudio ............................ 12 Example F1 Stats Sites ................................. 13 How to Use This Book ................................ 14 The Rest of This Book… ................................ 15 An Introduction to RStudio and R dataframes ..................... 17 Getting Started with RStudio ............................. 17 Getting Started with R ................................ 18 Summary ....................................... 41 Accessing Data from the ergast API ......................... 43 Summary ....................................... 55 Practice Session Utilisation ................................ 56 Session Utilisation Charts ............................... 59 Finding Purple and Green Times ........................... 65 Stint Detection ..................................... 69 CONTENTS Revisiting the Session Utilisation Chart - Annotations . 82 Session Summary Annotations ............................ 85 Session Utilisation Lap Delta Charts ......................... 88 Summary ....................................... 90 Useful Functions Derived From This Chapter .................... 91 A Quick Look at Qualifying ............................... 93 Qualifying Progression Charts ............................ 95 Improving the Qualifying Session Progression Tables . 97 Qualifying Session Rank Position Summary Chart - Towards the Slopegraph . 99 Rank-Real Plots ....................................103 Ultimate Laps .....................................104 Summary .......................................104 From Battlemaps to Track Position Maps ........................ 105 Identifying Track Position From Accumulated Laptimes . 105 Calculating DIFF and GAP times . 109 Battles for a particular position ............................121 Generating Track Position Maps . 124 Summary .......................................127 Keeping an Eye on Competitiveness - Tracking Churn ................ 128 Calculating Adjusted Churn - Event Level . 130 Calculating Adjusted Churn - Across Seasons . 137 Taking it Further ...................................144 Summary .......................................144 End of Season Showdown ................................ 145 Modeling the Points Effects of the Final Championship Race . 145 Visualising the Outcome ...............................147 Summary .......................................148 Comparing Intra-Team Driver Performances ..................... 149 Intra-Team League Tables ...............................149 Race Performance ...................................155 Summary .......................................158 Points Performance Charts ................................ 159 Maximising Team Points Hauls ............................159 CONTENTS Intra-Team Support ..................................165 Points Performance Charts - One-Way . 167 Points Performance Charts - Two-Way . 173 Summary .......................................180 Foreword For several years I’ve spent Formula One race weekends dabbling with F1 data, posting the occasional review on the F1DataJunkie website (f1datajunkie.com). If the teams can produce updates to their cars on a fortnightly or even weekly basis, I thought I should be able to push my own understanding of data analysis and visualisation at least a little way over the one hour TV prequel to each qualifying session and in the run up to each race. This book represents a review of some of those race weekend experiments. Using a range of data sources, I hope to show how we can use powerful, and freely available, data analysis and visualisation tools to pull out some of the stories hidden in the data that may not always be reported. Along the way, I hope to inspire you to try out some of the techniques for yourself, as well as developing new ones. And in much the same way that Formula One teams pride themselves in developing technologies and techniques that can be used outside of F1, you may find that some of the tools and techniques that I’ll introduce in these pages may be useful to you in your own activities away from F1 fandom. If they are, let me know, and maybe we can pull the ideas together in an #F1datajunkie spinoffs book! Indeed, the desire to learn plays a significant part on my own #f1datajunkie activities. Formula One provides a context, and authentic, real world data sets, for exploring new- to-me data analysis and visualisation techniques that I may be able to apply elsewhere. The pace of change of F1 drives me to try new things out each weekend, building on what I have learned already. But at the end of the day, if any of my dabblings don’t work out, or I get an analysis wrong, it doesn’t matter so much: after all, this is just recreational data play, a conversation with the data where I can pose questions and get straightforward answers back, and hopefully learn something along the way. A Note on the Data Sources Throughout this book, I’ll be drawing on a range of data sources. Where the data is openly licensed, such as the ergast motor racing results API (http://ergast.com) maintained by Chris Newell, I will point you to places where you can access it directly. For the copyrighted data, an element of subterfuge may be required: I can tell you how to grab the data, but I can’t share a copy of it with you. On occasion, I may take exception to this rule and point you 1 Foreword 2 to an archival copy of the data I have made available for the purpose of reporting, personal research and and/or criticism. To make it easier to get started working with the datasets, I have put together a set of f1datajunkie Docker containers that together make up a lightweight virtual machine that contains all you need to get started analysing and visualising the data. In addition, source code for many of the analyses contained within this book can be found in the f1datajunkie Github repository. Source code - and data sets (as .sqlite files) - are available from: https://github.com/psychemedia/wranglingf1datawithr/tree/master/src Note that the contents of this repository may lag the contents of this book quite significantly. The Lean and Live Nature of This Book This book was originally published using Leanpub (leanpub.com), a “lean book production” website that allows books to be published as they are being written. This reduced the production time and allowed the book to be published in an incremental way and allows any errors that are identified to be corrected as soon as they are spotted; purchasers of the the book on the Leanpub site get access to updates of the book, as they are posted, for no additional cost. This book has also generated been generated from a “live document” in the sense that much of the content was generated automatically from code. All the