Automl: Automating the Machine Learning Pipeline

AutoML: Automating the Machine Learning Pipeline Whitepaper by Ojasvi Jain, Data Science Intern at Mphasis | Kaushlesh Kumar, AVP - Applied AI, Mphasis NEXT Labs Contents Introduction 1 ML Pipeline and the Need for AutoML 2 Prominent Open-source AutoML Frameworks 5 Case Study 7 Benefits of AutoML 9 Challenges 10 Conclusion 11 References 11 1. Introduction As an applied field, Machine Learning (ML) is rapidly maturing from a research- and experimentation- heavy framework (which required significant investments) to a ‘productionalized’ enabler of solutions for mainstream industries. Along with ‘productionalization’ comes the business requirement to make solutions faster, better and less expensive. Conventional software development and deployment have adopted DevOps practices to shorten the systems development lifecycle and provide continuous delivery with high software quality. The data science practice, on the other hand, has adopted best practices from DevOps in their own development and deployment framework calling it MLOps. The focus here is more on collaboration, tracking and monitoring, deployment, quality of production ML and enhanced automation. One of the areas which is both core to data science and has a significantly high cycle time is model development. Traditional ML processes are inherently time-consuming and dependent on human intervention and expertise. The theorem of “no free lunches” also implies that the only scenario when an approach outperforms another is when it is customized to the specific problem on hand. This means that the model development invariably has to go through the pains of data preparation, feature engineering, model training, evaluation and selection every time a new dataset is encountered or a new problem arises. Automated Machine Learning (AutoML) aims to automate and accelerate the process of building ML and deep learning models. It allows us to provide the labelled training data as input and receive an optimized model as output. This functionality enables data scientists to generate accurate insights by leveraging the capabilities of Machine Learning. AutoML allows professionals from various domains to leverage the benefits of data science and ML. It accelerates processes, reduces errors and costs and provides greater accuracy of results by training multiple high performing models. Most of a data scientist’s time is invested in mundane tasks such as data pre-processing and feature engineering. Introduced to cut down time spent on iterative tasks concerning model development, AutoML channelizes more of their time and resources into model selection and model tracking so that data scientists can deal with more complex problems. These tools have helped developers to build scalable models with pipeline output acceleration. | 1 2. ML Pipeline and the Need for AutoML An ML pipeline is used to automate ML workflows and implement AutoML solutions. Hyperparameter Tuning Feature Model Inferencing Raw Data Data Cleaning Model Training Engineering and Deployment Model Validation Fig.1: Machine Learning Pipeline The pipeline starts with ingesting raw data, which is then engineered to suit algorithm and domain requirements by leveraging data cleaning and feature engineering techniques. A ML model is then trained on this data and validated by hyperparameter tuning. The best performing model is then deployed and used for production. 3% 5% 4% What data scientists spend the most time doing 9% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets: 19% 19% Mining data for patterns: 9% 60% Reﬁning algorithms: 4% Other: 5% Fig.2: Forbes’ Insights As can be seen from Fig. 2, data cleaning and preparation occupies 60 percent of a data scientist’s workload. Tasks such as cleaning the data and exploring multiple models to select the most accurate one can be easily automated using AutoML frameworks. AutoML automates ML pipeline components in the following ways: Data Preprocessing A crucial step in the ML pipeline, data preprocessing enhances data quality and enables the user to draw meaningful insights. It also makes the data suitable for ML algorithms and enables a smooth ML pipeline. As real-world data can be inaccurate and inconsistent, it is 2 | therefore imperative to clean, format and structure it to make it model-ready and gain relevant insights. Most ML practitioners face the challenge of processing data accurately and efficiently. It is believed that around 80 percent of a data scientist’s time is spent on preprocessing the data and making it usable. Data preprocessing involves the following steps: Handling missing values Often while data is collected, some fields and values do go missing. AutoML frameworks can detect missing values and apply data imputation techniques to ensure that crucial data points are not ignored by the algorithm. Depending on the data type, AutoML can impute either the mean, median or mode of the column. Other techniques include imputing by a constant value specified by the user to the nearest neighbour (KNN) value – or, predicting missing values using regression algorithms in the column detected with missing values. Handling outliers Outliers are anomalous data points that may negatively affect test results. A few methods for outlier detection and removal can be used in AutoML software. • Isolation Forest identifies and detects anomalies instead of profiling data points. The process can be automated by passing a data frame through an Isolation Forest algorithm. • Local Outlier Factor is extremely useful when the dataset has a small number of features. It is a technique that attempts to harness the idea of nearest neighbours for outlier detection and can be leveraged for automatic detection of outliers. Encoding Most ML algorithms only accept numerical data. Categorical variable encoding is a fundamental step in the smooth functioning of a ML pipeline. AutoML encodes categorical variables in two ways: • Label Encoding - Dependent categorical variables/ordinal categorical variables • One Hot Encoding - Usually for nominal independent categorical variables Though these processes are not completely automated, they are increasingly coming under the scope of AutoML pipelines. Feature Engineering This is the next step after data preprocessing. An essential function of any ML algorithm is to accurately and efficiently identify and predict features. For this to happen, the algorithms must receive structured data to enhance their predictive power. Feature engineering often requires domain knowledge to define and manipulate features to serve a business purpose. | 3 Feature transformation Feature transformation refers to creating new features from existing ones. Although few algorithms transform features internally (such as support vector machines), a few explicit ways of feature transformation include: • Feature Scaling and Normalizing - Scaling and normalizing techniques are not only essential for efficient computing, but also for distance-based algorithms. Based on the data type, AutoML can automate scaling and normalizing techniques. • Aggregation - This combines values in a feature to reduce variability in data. • Filtering - This is mostly used in Time Series to regulate the frequency in the dataset by determining threshold values. Feature selection Feature Selection techniques include choosing a specific set of variables that are appropriate for subsequent analysis. The goal is to come up with the smallest set of features that best represent the data characteristics. A few methods used for feature selection are: • Filter Method - This is the fastest method that eliminates highly correlated variables. It includes metrics such as Pearson’s Correlation, which identifies variables that are highly correlated and automatically eliminates one of them. • Wrapper Method - This is the most accurate method, but is very computationally intensive. It consists of algorithms such as Forward Selection, Backward Elimination, Recursive Feature Elimination and Bi-directional Elimination. • Embedded Method - Embedded methods include regularization techniques such as Ridge Regression, Lasso Regression and ElasticNet Regression. These techniques are most commonly used in AutoML frameworks and aim to reduce variance while model training. Dimensionality reduction This reduces high-dimensional data to a more manageable one to not only improve performance but also improve accuracy of ML algorithms. One of the most common ways to perform Dimensionality Reduction is Principal Component Analysis, which aims to capture the maximum amount of variability in the dataset and map it to a smaller dimension. However, Feature Engineering can be tricky since it largely depends on the data type of the variable and the application. For instance, ordinal variables can be considered categorical as well as discrete. In the case of a categorical data type, scaling and normalizing reduces the interpretability of the data and is not ideal. Thus, even though scaling and normalizing can be automated for incorporating in the AutoML frameworks, some kind of human intervention is needed. Model Training and Hyperparameter Tuning Once the features are ready, the next step is to train a model. Model training encompasses a huge variety of ML algorithms that can be leveraged to get a desired predicted value. Each model has 4 | its benefits and is fit to be used in specific use cases. An important aspect of model training is model validation and hyperparameter tuning. It is essential that the ML model works

Load more