<<

Introduction

The Essentials of Analytics and

[A guide for anyone who wants to learn practical machining learning using ]

Author: Dr. Mike Ashcroft Editor: Ali Syed

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

® 2016 Dr. Michael Ashcroft and Persontyle Limited

Essentials of Data Analytics and Machine Learning 1

INTRODUCTION Module 1

Practical introduction to machine learning and predictive models, and it is intended to serve as a fundamental resource for advanced data scientists. You will develop an applied understanding of the principles of machine learning and able to develop practical solutions using predictive models. Introduction

Machine Learning for Data Science and Analytics

This is a guide on practical machine learning, and it is intended to serve as a fundamental resource for advanced data scientists.

But what does that mean? One problem in explaining such a sentence is that the domain of modern data analytics is plagued by a large number of near-synonyms, most colored by assumptions, associations and prejudices. Data science, data analytics, , machine learning, , mining, pattern recognition, artificial intelligence, etc. Let us define our terms!

By data science we understand the general modern data analytics area. It is dominated by the application of advanced statistical techniques to copious data using powerful computers. The datasets worked with may be large or small, but the ability to get data on any and all topics is a defining feature of the age that would have made the of previous eras green with jealousy. The ability to apply advanced statistical techniques to such data in a tractable manner using modern computers would have left them speechless. The ability to enable others without dedicated mathematical and scientific backgrounds to make use of the resulting tools to analyze incoming data, often in real time, would seem like science fiction.

Such a field intersects (primarily statistics) and (programming, but sometimes also other aspects such as managing remote resources or distributed systems). Individual projects always benefit from and normally require domain expertise regarding the problem they seek to solve. Such domain expertise has uses that range from helping to contextualize data and provide sanity checks on results, to being able to be formally encoded within statistical models to overcome data scarcity or bias.

So data science, as a practice, requires mathematics, computer science and domain knowledge. What skills in this mix might a data scientist be expected to have? An unfortunately popular diagram (reproduced here) is often used to suggest that the data scientist ought to combine all of these skill sets. Certainly, job descriptions often request applicants to have computing, statistical and domain specific skills.

The reality, though, is that it is the team working on a project that requires all these skills. Data scientists will normally interact with many different experts on many different problems. They will utilize the domain knowledge of these experts, and combine it with the skills they possess which the domain experts do not in order to produce the best possible outcomes of the projects they work on.

Indeed, criticism of the ‘one individual, all the skills’ idea has been vigorous, and we now see circulating the web various updated Venn diagrams explicitly rejecting or mocking this idea, such as that given to

Essentials of Data Analytics and Machine Learning 1

Introduction the left which describes the central intersection reserved in our first diagram for data science as the location of unicorns. Data science, this new diagram states, covers all these areas, and data scientists can have any of these skills, but will seldom have all of them. The author, Steve Geringer, warns companies that expecting otherwise will lead to unfilled positions, not to mention unfulfilled expectations.

But this disjunctive view of data science is too lenient. Someone with subject matter expertise alone is not a data scientist. Inevitably we begin to focus on the top of the Venn diagram, where computer science and math and statistics intersect. It is here that the unique skills of the data scientist are to be found – those capabilities that he or she brings to a team already likely bursting with domain expertise; those capabilities that build on the technical competencies of computer science and mathematics without being reducible or reproducible to either individually. It is no surprise that we find this area labelled machine learning.

This is not to deny that there is a third, very important, component to the data scientists skill set, but it is not domain knowledge. Rather, it is the ability to communicate with others – the ability to understand what relevant domain experts are saying, and to explain the algorithms and methods the data scientist is working with in such a way that non-experts can understand the pros and cons, possibilities and limitations, of different approaches. This is difficult. It is difficult because many fields have their own jargon, assumptions and outlooks. But it is also difficult because being able to communicate such matters requires understanding them well yourself – for a data scientist to be a skilled communicator of the potential of the field, they must have a deep knowledge the techniques and practices of machine learning. Aim of the Guide The aim of this guide is to provide both a deep understanding of the techniques and practices of machine learning and to expose a wide set of resources capable of being wielded by the data scientist in their work. Readers will encounter explanations of the theory behind the algorithms and models they are exposed to, giving them an understanding of the strengths and weaknesses of each which they should be able to use to reason about suitable approaches to real life problem – and to communicate such reasoning to other stakeholders in such problems. In addition to, or as part of, this theory, we will see how the algorithms can be implemented. Understanding comes in different stages, and the basic understanding of being comfortable with a mathematical formula is a far shallower level than that achieved by being able to transform the dense abstract math into the closed form instructions required

Essentials of Data Analytics and Machine Learning 2

Introduction to implement that math with a programming language. The specific programming language that we will work with is R, for reasons that are discussed in detail in module 2. But of course we do not expect data scientists to build algorithms from scratch and we also provide the reader with information about library resources available to use when working with the algorithms and models examined, and walk through demonstrations using these libraries.

There is an unfortunate air of mystique that surrounds the advanced machine learning techniques. Too many data scientists and data science stakeholders fear that actual deep understanding of the algorithms is beyond them. At best they can hope for a hand-waving gesture towards the mechanics, requirements, assumptions, and expected consequences of the mathematical techniques they work with. This is a serious problem, for without this understanding it is impossible to consistently do good work within data science. Nor is it possible to explain why what succeeds did succeeds – or, sometimes more pressingly, why what will succeed will succeed.

This guide aims to show readers that they can understand these techniques. For it is a wonderful truth that the advanced of machine learning that are changing the world are no more than the layering of a number of simple techniques. Once you see what these simple techniques are and how they fit together, their patterns are clear and the mirage of complexity dissolves. It is a lovely experience, and one I hope all our readers experience and enjoy. Audience It is expected that readers are reasonably comfortable with statistics and linear algebra, and should certainly be comfortably with the notations of both. It is also expected that readers have a solid background in programming, so that while they may be new to R they are not new to programming in general. It is certainly possible to understand this guide while lacking one or more of these assumed background competencies, but it will make the process more difficult.

This guide is aimed at a number of audiences. These include professional data scientists, and those aiming to become such, advanced undergraduate and postgraduate students, and researchers from areas outside data science looking for a guide to the utilization of these techniques in their work.

Essentials of Data Analytics and Machine Learning 3

Introduction

Data Science Workflow and Guide Structure Data analytic projects follow a reasonably uniform workflow: Data is acquired, it is prepared for statistical analysis, the various approaches of analysis are evaluated and eventually one is selected. This selected approach is tested and incorporated into some sort of application or library such that end users will be able to apply it to new data, or it is immediately applied to the specific new data it was designed for and then discarded.

I like to introduce this process with the following linear workflow:

STEP NAME STAGE COMMENTS 1. Data Collection Data is collected. 2. Feature Initialization Preprocessing Data is prepared for use within 3. Feature Selection statistical learning algorithms. 4. Feature Transformation 5. Deal with Missing Data 6. Model Generation Statistical Modeling Statistical models are developed, 7. Model Selection evaluated and selected for use. 8. Model Evaluation 9. Application Application Application is developed around Development Development the chosen model enabling end 10. Application Roll-out users to use it to analyze new data. 11. Application Use Application Use Application must be supported and maintained.

Of course, this linearity is idealized. Features can be selected to create models on the performance of previous sets of features. Dealing with missing data often requires the use of statistical models. Etc. But there is much truth in the idealization, and even more pedagogical value for those whose experience with data science has not yet been of an applied type.

This workflow too structures the guide. In delimiting the area of machine learning with data science, we have trimmed the ends of this process, ignoring the preliminary task of acquiring data and the final tasks of developing, rolling out and supporting applications built on top of the statistical models the generation of which we will focus on. This is not because these are not part of the proper domain of the data scientist – it is quite likely that you will be directly involved in these tasks on at least some of the projects you work on as a data scientist. But rather, it is because they are not part of the proper domain of machine learning, which is what this guide will focus one.

Of the remaining, middle sections of the workflow, we distinguish between those that involve preparing the collected data for use in statistical modeling techniques and the application of these techniques, and selection and evaluation of the models that are the results of these techniques. The former, shaded in green, can be grouped as the pre-processing stage. The later, shaded in blue, are grouped in the

Essentials of Data Analytics and Machine Learning 4

Introduction statistical modeling stage. These two stages are reflected in this guide, with preprocessing steps and techniques covered in modules 3-6, and the various statistical modeling techniques and methods for evaluating the models generated by these techniques taking modules 7-18. As we proceed it will be more and more common for the increasingly sophisticated algorithms discussed to make use of, and assume understanding of, algorithms covered in earlier modules.

Prior to both of these sections is module 2, which devotes itself to providing an overview of the basic functionality of the R programming language. We will be using R through-out this guide to demonstrate the implementation of various algorithms, and to explore the various available libraries that can be used to work with the algorithms examined without manually implementing them.

It should be noted that module 7, on missing data, is not entirely well suited to the framework adopted. Although part of pre-processing, it makes considerable use of algorithms introduced only later in the statistical modeling modules. Accordingly, it may make sense to delay reading module 7 until after the remaining modules. Style of the Guide Within each module we will examine a number of techniques. This examination will typically have three parts:

i. A theoretical overview: Here the theoretical background for the algorithms involved will be explained. ii. An example using manual implementation: Here we work through an example applying the algorithms to an actual problem. The algorithms are manually implemented so as to show how the abstract mathematics can be simply and elegantly implemented using R. iii. An example using a library implementation: Here we propose R packages that already implement the algorithms and use these to provide work through either the same example or a new, second example.

It is hoped that this tripartite structure will allow for a deeper theoretical understanding and a means of connecting the theoretical understanding to implementation and algorithmic choices. Further, we hope it will allow both those who come to machine learning and data science from either a mathematical or a computer science background to engage with the material primarily through their preferred medium. Of course, the introduction to library implementations is designed to give the reader a ready-made tool box to take with them to their work, study or research.

Essentials of Data Analytics and Machine Learning 5