The Essentials of Data Analytics and Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Introduction The Essentials of Data Analytics and Machine Learning [A guide for anyone who wants to learn practical machining learning using R] Author: Dr. Mike Ashcroft Editor: Ali Syed This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/. ® 2016 Dr. Michael Ashcroft and Persontyle Limited Essentials of Data Analytics and Machine Learning 1 INTRODUCTION Module 1 Practical introduction to machine learning and predictive models, and it is intended to serve as a fundamental resource for advanced data scientists. You will develop an applied understanding of the principles of machine learning and able to develop practical solutions using predictive models. Introduction Machine Learning for Data Science and Analytics This is a guide on practical machine learning, and it is intended to serve as a fundamental resource for advanced data scientists. But what does that mean? One problem in explaining such a sentence is that the domain of modern data analytics is plagued by a large number of near-synonyms, most colored by assumptions, associations and prejudices. Data science, data analytics, statistics, machine learning, data mining, information mining, pattern recognition, artificial intelligence, etc. Let us define our terms! By data science we understand the general modern data analytics area. It is dominated by the application of advanced statistical techniques to copious data using powerful computers. The datasets worked with may be large or small, but the ability to get data on any and all topics is a defining feature of the age that would have made the statisticians of previous eras green with jealousy. The ability to apply advanced statistical techniques to such data in a tractable manner using modern computers would have left them speechless. The ability to enable others without dedicated mathematical and scientific backgrounds to make use of the resulting tools to analyze incoming data, often in real time, would seem like science fiction. Such a field intersects mathematics (primarily statistics) and computer science (programming, but sometimes also other aspects such as managing remote computing resources or distributed systems). Individual projects always benefit from and normally require domain expertise regarding the problem they seek to solve. Such domain expertise has uses that range from helping to contextualize data and provide sanity checks on results, to being able to be formally encoded within statistical models to overcome data scarcity or bias. So data science, as a practice, requires mathematics, computer science and domain knowledge. What skills in this mix might a data scientist be expected to have? An unfortunately popular diagram (reproduced here) is often used to suggest that the data scientist ought to combine all of these skill sets. Certainly, job descriptions often request applicants to have computing, statistical and domain specific skills. The reality, though, is that it is the team working on a project that requires all these skills. Data scientists will normally interact with many different experts on many different problems. They will utilize the domain knowledge of these experts, and combine it with the skills they possess which the domain experts do not in order to produce the best possible outcomes of the data analysis projects they work on. Indeed, criticism of the ‘one individual, all the skills’ idea has been vigorous, and we now see circulating the web various updated Venn diagrams explicitly rejecting or mocking this idea, such as that given to Essentials of Data Analytics and Machine Learning 1 Introduction the left which describes the central intersection reserved in our first diagram for data science as the location of unicorns. Data science, this new diagram states, covers all these areas, and data scientists can have any of these skills, but will seldom have all of them. The author, Steve Geringer, warns companies that expecting otherwise will lead to unfilled positions, not to mention unfulfilled expectations. But this disjunctive view of data science is too lenient. Someone with subject matter expertise alone is not a data scientist. Inevitably we begin to focus on the top of the Venn diagram, where computer science and math and statistics intersect. It is here that the unique skills of the data scientist are to be found – those capabilities that he or she brings to a team already likely bursting with domain expertise; those capabilities that build on the technical competencies of computer science and mathematics without being reducible or reproducible to either individually. It is no surprise that we find this area labelled machine learning. This is not to deny that there is a third, very important, component to the data scientists skill set, but it is not domain knowledge. Rather, it is the ability to communicate with others – the ability to understand what relevant domain experts are saying, and to explain the algorithms and methods the data scientist is working with in such a way that non-experts can understand the pros and cons, possibilities and limitations, of different approaches. This is difficult. It is difficult because many fields have their own jargon, assumptions and outlooks. But it is also difficult because being able to communicate such matters requires understanding them well yourself – for a data scientist to be a skilled communicator of the potential of the field, they must have a deep knowledge the techniques and practices of machine learning. Aim of the Guide The aim of this guide is to provide both a deep understanding of the techniques and practices of machine learning and to expose a wide set of resources capable of being wielded by the data scientist in their work. Readers will encounter explanations of the theory behind the algorithms and models they are exposed to, giving them an understanding of the strengths and weaknesses of each which they should be able to use to reason about suitable approaches to real life problem – and to communicate such reasoning to other stakeholders in such problems. In addition to, or as part of, this theory, we will see how the algorithms can be implemented. Understanding comes in different stages, and the basic understanding of being comfortable with a mathematical formula is a far shallower level than that achieved by being able to transform the dense abstract math into the closed form instructions required Essentials of Data Analytics and Machine Learning 2 Introduction to implement that math with a programming language. The specific programming language that we will work with is R, for reasons that are discussed in detail in module 2. But of course we do not expect data scientists to build algorithms from scratch and we also provide the reader with information about library resources available to use when working with the algorithms and models examined, and walk through demonstrations using these libraries. There is an unfortunate air of mystique that surrounds the advanced machine learning techniques. Too many data scientists and data science stakeholders fear that actual deep understanding of the algorithms is beyond them. At best they can hope for a hand-waving gesture towards the mechanics, requirements, assumptions, and expected consequences of the mathematical techniques they work with. This is a serious problem, for without this understanding it is impossible to consistently do good work within data science. Nor is it possible to explain why what succeeds did succeeds – or, sometimes more pressingly, why what will succeed will succeed. This guide aims to show readers that they can understand these techniques. For it is a wonderful truth that the advanced of machine learning that are changing the world are no more than the layering of a number of simple techniques. Once you see what these simple techniques are and how they fit together, their patterns are clear and the mirage of complexity dissolves. It is a lovely experience, and one I hope all our readers experience and enjoy. Audience It is expected that readers are reasonably comfortable with statistics and linear algebra, and should certainly be comfortably with the notations of both. It is also expected that readers have a solid background in programming, so that while they may be new to R they are not new to programming in general. It is certainly possible to understand this guide while lacking one or more of these assumed background competencies, but it will make the process more difficult. This guide is aimed at a number of audiences. These include professional data scientists, and those aiming to become such, advanced undergraduate and postgraduate students, and researchers from areas outside data science looking for a guide to the utilization of these techniques in their work. Essentials of Data Analytics and Machine Learning 3 Introduction Data Science Workflow and Guide Structure Data analytic projects follow a reasonably uniform workflow: Data is acquired, it is prepared for statistical analysis, the various approaches of analysis are evaluated and eventually one is selected. This selected approach is tested and incorporated into some sort of application or library such that end users will be able to apply it to new data, or it is immediately applied to the specific new data it was designed for and then discarded. I like to introduce this process with the following linear workflow: STEP NAME STAGE COMMENTS 1. Data Collection Data Collection Data is collected. 2. Feature Initialization Preprocessing Data is prepared for use within 3. Feature Selection statistical learning algorithms. 4. Feature Transformation 5. Deal with Missing Data 6. Model Generation Statistical Modeling Statistical models are developed, 7. Model Selection evaluated and selected for use. 8. Model Evaluation 9. Application Application Application is developed around Development Development the chosen model enabling end 10. Application Roll-out users to use it to analyze new data.