In-Database Machine Learning with Vertica Analytics Platform

Business white paper Unlock machine learning for the new speed and scale of business In-database machine learning with HPE Vertica Analytics Platform Business white paper Table of contents 3 Machine learning: A competitive advantage 3 What is machine learning, and why is it important? 4 Purpose of this white paper 5 Types of machine learning 6 Barriers to applying machine learning at scale 6 HPE Vertica Analytics Platform delivers predictive analytics at speed and scale 7 Implementing machine learning with the HPE Vertica Analytics Platform 7 In-database machine learning 7 User-defined extensions (UDxs) 7 HPE Vertica Analytics Platform in-database machine learning functions 8 K-means clustering 8 Logistic regression 9 Linear regression 9 Naïve Bayes 10 Data preparation and normalization 10 Real-world implementation of Vertica’s in-database machine learning features 12 Use cases for machine learning across different industries 12 Summary Business white paper Page 3 Machine learning: A competitive advantage In today’s data-driven world, creating a competitive advantage depends on your ability to transform massive volumes of data into meaningful insights. Companies that use advanced analytics and machine learning are twice as likely to be top quartile financial performers, and three times more likely to execute effective decisions.1 However, nearly 50% of business leaders admit to not knowing how to get value from their data, and only 30%2 are able to act upon their information in real time. Leveraging Big Data to the full can help provide an in-depth understanding of customer behavior, enabling you to personalize the consumer experience, prevent churn, detect fraud, and increase your bottom line. Unfortunately, the growing velocity, volume, and variety of data has increased the complexity of building predictive models, since few tools are capable of processing these massive data sets at the speed of business. Vertica’s in-database machine learning, on the other hand, allows you to embrace the power of Big Data and accelerate business outcomes with no limits and no compromises. What is machine learning, and why is it important? Machine learning is gaining popularity as an essential way of not only identifying patterns and relationships, but also predicting outcomes. This is creating a fundamental shift in the way businesses are operating—from being reactive to being proactive. Machine learning enables processes and analyses that were previously too complex to handle manually. Comprising a variety of algorithms and statistical models to process and correlate data, machine learning helps data scientists and analysts unlock valuable insights from seemingly random or unrelated data sets. These insights—about customers, value chain processes, manufacturing inputs, and more—can be used to drive business decisions and improve your competitive standing. Simply put, machine learning algorithms facilitate more informed organizational decision making, and should be an essential input to your daily business functions, products, and operations. To capture the full value of Big Data, however, we need to change the scale and speed through which these machine learning algorithms are modeled, trained, and deployed. Traditional analytics tools no longer scale in the world we live in. Today’s data volumes are far too large for many traditional machine learning tools to handle, and the relationships between large, disparate data sources—from back-end customer databases to clickstream behavior—are far too complex for traditional tools to derive all hidden value available. 1 “Creating Value through Advanced Analytics.” Bain & Company, February 2015. 2 emc.com/about/news/press/2015/ 20150416-01.htm “In the same way that a bank without databases can’t compete with a bank that has them, a company without machine learning can’t keep up with the ones that use it.” – “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World” by Pedro Domingos. September 2015. Business white paper Page 4 New in-database machine learning methods free data scientists from the volume constraints of traditional tools and enable them to discover and display patterns buried in ever-larger data sets. And as the volume of data ingested increases, the level and sophistication of learning also increases, resulting in more accurate predictions that can be turned into better customer service, superior products, and a competitive advantage. In 2016, Google™ AlphaGo made history by becoming the first program to beat a professional player of Go—an ancient Chinese game with more possible board configurations than there are atoms in the universe.3 How did it do it? By using machine learning algorithms to study a database of online Go matches—equivalent to the experience one would gain by playing Go for 80 years without a break. The same is true of Google’s self-driving car. According to Pedro Domingos, one of the world’s leading experts on machine learning, “a self-driving car is not programmed to drive itself. Nobody actually knows how to program a car to drive. We know how to drive, but we can’t even explain it to ourselves. The Google car learned by driving millions of miles, and observing people driving.”4 Using the same machine learning algorithms, Netflix offers movie recommendations based on your viewing history. Amazon makes product recommendations based on your purchasing trends. Facebook identifies and tags your friends’ faces in the photos you upload, and the list goes on. The benefit? A personalized consumer experience, increased loyalty, reduced churn, and a greater share of wallet. All by leveraging the power of machine learning. Purpose of this white paper With the ever-increasing volume, velocity, and variety of data, it’s critical that organizations discover the hidden insights contained in Big Data to create competitive differentiation and increase profitability. This white paper explains how in-database machine learning eliminates the constraints of small-scale analytics when applied to extremely large data sets for increased accuracy at lightning speeds. It also explains the unique benefits of HPE Vertica’s in-database machine learning algorithms and how these complement traditional tools (C++, Java, Python, and R). 3 “Google just made artificial-intelligence history.” Business Insider, March 2016. 4 “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World.” Pedro Domingos, September 2015. Business white paper Page 5 Types of machine learning Machine learning algorithms can be divided into two broad categories: supervised learning and unsupervised learning. Supervised learning and unsupervised learning are the two most widely adopted methods of machine learning, with algorithms for both built into HPE Vertica. • Supervised learning is used in cases where all data is labeled and the algorithms learn to predict the output from the input data. The learning algorithm ingests a set of training inputs together with a corresponding set of correct outputs. It then learns by comparing the actual output with the correct outputs, finding errors, and modifying the model accordingly. The process is repeated until the model provides the desired level of accuracy based on the training data. Supervised learning often uses historic data to predict future events—for example, the best customer demographic to target for a promotion based on past buying behavior, or predicting credit scores based on past financial behavior. Popular algorithms include decision trees, Naïve Bayes classification, random forest, linear regression, and logistic regression. • Unsupervised learning is used in cases where all data is unlabeled and the algorithms learn the inherent structure from the input data. There are no correct answers and no “teacher,” unlike supervised learning. The algorithm is responsible for analyzing the data and identifying a pattern—for example, customers with similar characteristics that can be collectively targeted in a marketing campaign. Popular classifications of techniques for unsupervised learning include association rule learning and clustering techniques such as hierarchal clustering and k-means. Business white paper Page 6 Barriers to applying machine learning at scale There are several challenges when it comes to applying machine learning to the massive volumes of data that organizations collect and store. Predictive analytics can be complex, especially when Big Data is added to the mix. Since larger data sets yield more accurate results, high-performance, distributed, and parallel processing is required to obtain insights at the speed of business. In addition, machine learning algorithms must be rewritten to take advantage of modern distributed and parallel engines. Traditional machine learning applications and tools require data scientists to build and tune models using only small subsets of data (called down-sampling), often resulting in inaccuracies, delays, increased costs, and slower access to critical insights: • Slower development: Delays in moving large volumes of data between systems increases the amount of time data scientists spend creating predictive analytics models, which delays time-to-value. • Inaccurate predictions: Since large data sets cannot be processed due to memory and computational limitations with traditional methods, only a subset of the data is analyzed, reducing the accuracy of subsequent insights and putting at risk any business decisions based on these insights. • Delayed

In-Database Machine Learning with Vertica Analytics Platform

Finding Fraud in Large and Diverse Data Sets

The Vertica Analytic Database: C-Store 7 Years Later

Database Software Market: Billy Fitzsimmons +1 312 364 5112

Vertica Advanced Analytics Platform

Gartner Magic Quadrant for Data Management Solutions for Analytics

Presto: the Definitive Guide

Vertica in the Clouds

A Modern Optimizer for Real-Time Analytics in a Distributed Database

Vertica Overview Data Sheet

Vertica Enterprise Edition

Choosing a Cloud DBMS: Architectures and Tradeoffs

Vertica Advanced Analytics Platform