Google Cloud Dataprep by Trifacta the Answer to Data Preparation On
Total Page:16
File Type:pdf, Size:1020Kb
Google Cloud Dataprep by Trifacta The Answer to Data Preparation on the Google Cloud Platform The Problem: Challenges with Legacy Analytics Inertia Data Management Data lakes, data warehouses, and Machine Learning/Artificial Intelligence (ML/AI) applications have been historically expensive, slow to implement, and difficult to manage in on-premise architectures. Cloud Computing Maturity The rise of cost-effective, scalable data storage and elastic processing in the cloud has completely flipped the analytics paradigm. Modern cloud platforms with serverless automated data services now offer organizations a more efficient approach to analytics. New Data-Led Challenges & Opportunities With cloud adoption and mass digitalization, the volume and types of data are drastically evolving and becoming more complex, which in turn has made data engineering more challenging than ever. This influx of data has overburdened those capable of using traditional data management tools and has prevented them from responding to business demands in a timely manner. Yet at the same time, an exciting opportunity awaits—the profusion of data presents organizations the chance to outcompete their competitors by successfully curating differentiated, insight-rich data. There must be a way. The Solution: The Google Cloud Alternative Google Cloud offers an end-to-end, fully managed smart analytics suite that includes batch and Self-Service real-time data ingestion, data storage and processing at scale, data preparation, reporting and Data Preparation dashboarding, and ML/AI applications. in the Cloud BigQuery Cloud Storage Dataprep Cloud Dataflow AI Platf orm Cloud AutoML Looker Data Studio Cloud Functions Cloud IAM Cloud Data Catalog Cloud Composer Solving Data Preparation There is one common challenge that every analytics initiative has to tackle. Clean, structured, and normalized data is needed to fuel trustworthy reports or accurate predictions. In other words, self-service data preparation has become critical to solving this now well-known hurdle, which easily consumes up to 80% of any data project. “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.” _ DJ Patil, Former Chief, Data Scientist of the United States 3 What is Data Preparation? 4 DISCOVERING. What’s in the data? What kind of data is available? What are the most productive conclusions that we can deduce from this data? Discovering the content and structure of the data can have a huge impact on business outcomes. Discover features of the data and quickly determine the value of the datasets. STRUCTURING. How can we change the form or schema of the data so that it can be incorporated into the analysis? How can we split columns, pivot rows and delete fields without having to write code or build complex formulas? CLEANING. How can we identify and remove invalid elements in the data that might compromise the trustworthiness of the analysis? Analysts devote a huge amount of time remediating data inaccuracies, but bad data can still slip through these models or analytics undetected. Identify data quality issues, such as mismatched values, null values, invalid formats, inconsistent, incomplete, or inaccurate values, and apply the appropriate transformation to correct or delete them from the dataset. ENRICHING. Which additional data attributes would prove helpful in the analysis? Can new features be derived from existing data? The necessary data to make business decisions can often be spread across multiple files or databases. To gather the required context, one needs to enrich existing datasets by combining and aggregating it with numerous other data sources. VALIDATING. Before the final analysis, are there any remaining data quality or consistency issues? How can we be sure that the transformations have correctly addressed all the data problems? Every dataset deserves validation that ensures the right transformations have been performed. Users must validate their results at the end of the data pipeline. 5 “Google Cloud Dataprep by Trifacta has enabled several of our analysts and data stewards to automate complex data preparation routines, build large analytical data models, and work with files too large for Excel or Access. This allowed us to avoid piling more work on top of our heavily backlogged data engineering teams and more than doubled our development velocity.” _ Matt Bossemeyer, Director of Supply Chain IT Services & Analytics at Premier Inc. Putting the Analyst Behind the Driver’s Seat Today’s business decisions must happen fast and must be founded on trustworthy information. This high demand for data insight pressures data analysts and business analysts to produce accurate analyses in an extremely quick turnaround. In order to move at today’s pace of business, these data professionals want to break free from the time-consuming IT-dependent processes and schema-rigid data warehouses to build and deliver on agile BI analysis both for ad-hoc requests and recurring reporting. They want to be empowered with scalable, self-service analytic solutions to tackle any business data demands and automate their delivery chain. Google Cloud Smart Analytics With a comprehensive suite of data analytics tools that provides flexibility, scalability, collaboration, and advanced analytics, built-in, Google Cloud Smart Analytic Suite offers just that. Users need just an email address to be up and running in minutes leveraging Google Cloud Dataprep by Trifacta (along with the full analytics suite) to prepare data easily and at scale for analytics. 6 Google Cloud Dataprep by Trifacta Google and Trifacta have partnered to offer Dataprep on the Google Cloud Platform, the sole data preparation solution available for Google Cloud. Dataprep by Trifacta offers the unmatched Trifacta wrangling experience for Google Cloud customers. Intelligent data preparation Google Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning. Because Dataprep is serverless and works at any scale, there’s no infrastructure to deploy or manage. The next ideal data transformation is suggested and predicted with each UI input, so you don’t have to write code. “Dataprep by Trifacta is incredibly user friendly and the machine learning suggestions help us reduce a big chunk of the labor-intensive process of data wrangling, so we can analyze and mine food purchasing data across 46 states, 521 distributors, and 1,000 producer groups representing over 100,000 farms and vendors to reimagine local and sustainable sourcing and regional supply chains.” _ Linda Mallers President and CEO, Farmlogix 7 Serverless simplicity Google Cloud Dataprep is an integrated partner service operated by Trifacta and based on Trifacta’s industry-leading data preparation solution. Google works closely with Trifacta to provide a seamless user experience that removes the need for up-front software installation or ongoing operational overhead. Dataprep is fully managed and scales on demand to meet the growing data preparation needs to stay focused on analysis. ! Dataprep is available on the Google Cloud console and adheres to the same consumption, invoicing, and security principles in order to offer a seamless Google Cloud experience. Fast exploration and anomaly detection Understand and explore data instantly with visual data distributions. Dataprep automatically detects schemas, data types, possible joins, and anomalies, such as missing values, outliers, and duplicates. It allows analysts to skip the time-consuming work of assessing data quality and move ahead to data exploration and analysis. Easy and powerful data preparation With each gesture in the UI, Dataprep automatically suggests and predicts the next ideal data transformation. Once the data transformations’s sequence is defined, Dataprep uses Cloud Dataflow and BigQuery under the hood, enabling users to process structured or unstructured datasets of any size with the ease of clicks—not code. 8 Reference Architecture “Dataprep is an intelligent data service that allows users to visually explore, clean and interactively prepare their data. We selected Trifacta to help power this new service because it was incredibly advanced, super intuitive for people to use immediately, and had a cloud architecture that integrated naturally with Google Cloud Platform.” _ Brian Stevens, CTO, Google Cloud 9 “Early on, we recognized the benefits of building on top of Ready to Learn More? Google Cloud Platform and a Cloud Data Warehouse, but With seamless data preparation across any cloud, needed a technology that would allow our data analysts to hybrid or multi-cloud environment, Google Cloud Dataprep by Trifacta is the ideal self-service data manage their own data pipelines instead of strictly relying preparation solution for the Google Cloud Platform. on our data engineering team. Moving to Google Cloud Schedule a demo Dataprep by Trifacta Premium, our analysts have more autonomy to manage and prepare their data than ever before, whether that’s leveraging Google Analytics data stored in BigQuery or migrating new data from or into Salesforce, which leads to enormous productivity gains.” _ Thibaut Gadiolet, Product Manager of Data Platforms, HomeServe 575 Market St, 11th Floor Follow Trifacta San Francisco, CA 94105 1 844 332 2821 @Trifacta Trifacta Trifacta www.trifacta.com 10 10.