State of Data Infrastructure and Analytics Report 2021
Total Page:16
File Type:pdf, Size:1020Kb
THE STATE OF DATA INFRASTRUCTURE & ANALYTICS REPORT 2021 Work-Bench is an enterprise-focused THE STATE OF DATA INFRASTRUCTURE venture capital firm based in New York City. & ANALYTICS REPORT Work-Bench was founded with a unique, thesis-driven approach that flips traditional The State of Data Infrastructure & Analytics Report is a collection of the top enterprise venture capital upside down by technological trends and tools that have been dominating the modern data first validating Fortune 500 IT pain points infrastructure landscape over the past year. through our extensive corporate network and then investing in the most promising As an early stage investor at Work-Bench, I’ve been keeping an active pulse companies addressing these challenges. on the data ecosystem by speaking with data practitioners and startup founders who are laying the groundwork in the space as well as corporate www.work-bench.com executives from our network to develop my understanding of the top challenges and opportunities in data. @Work_Bench [email protected] This report is a culmination of these conversations and illustrates the next generation of tools that we think are best poised to tackle the next phase in data infrastructure. AUTHORED BY: PRIYANKA SOMRAH � Additionally, listen to the recording of The State of Data Infrastructure & Analyst Analytics Webinar, where I walk through the report and dive a layer deeper. [email protected] 2 The Rise of Data Engineering 4 Trends Shaping The Modern Data Stack 11 TABLE OF Data Catalog, Lineage & Discovery 13 CONTENTS The Last Mile Problem in ETL 19 The Rise of Operational Analytics 28 Augmenting Stream & Batch 31 Processing 3 THE RISE OF DATA ENGINEERING & WHAT IT MEANS FOR THE MODERN DATA STACK THE RISE OF DATA ENGINEERING IS BRIDGING THE TOOLING GAP IN THE MODERN DATA ECOSYSTEM Today, there are more data engineering jobs We are on the cusp of a new era in data where than ever before! innovation in data engineering is shaping the future of data-driven organizations and their initiatives. While it’s still an emerging discipline, data engineering holds the promise of bridging the 28,000+ 170,000+ workflow gap between business intelligence, Open jobs on Open jobs on analytics, data science, and engineering functions. This represents a critical step, largely driven by advancement in tooling and best practices that empower different teams across the data ecosystem to work collaboratively. 5 DATA ENGINEERS, SCIENTISTS & ANALYSTS ARE STUCK IN WORKFLOWS THAT DON’T ALWAYS PERTAIN TO THEIR CORE FUNCTIONS Today, data engineers, data scientists, and analysts live in disconnected workflows and tools. There is a serious lack of cohesion between these functions, despite the fact that they are all working towards the same business initiatives. For example, many data scientists are still building and maintaining their own data This Twitter poll by Vicki Boykis pipelines when they really should be building captures this sentiment. machine learning models and exploring data for analysis. 6 Data engineers are also getting pulled in different directions, catering to the ad-hoc needs of the data scientists and analysts. Zhamak Dehghani, Director of Next Tech Incubation at ThoughtWorks puts it best: “I personally don’t envy the life of a data platform engineer. They need to consume data from teams who have no incentive in providing meaningful, truthful and correct data. They have very little understanding of the source domains that generate the data and lack the domain expertise in their teams. They need to provide data for a diverse set of needs, operational or analytical, without a clear understanding of the application of the data and access to the consuming domain’s experts.” 7 THE MODERN DATA STACK: THE RISE OF THE CLOUD DATA WAREHOUSE The data infrastructure layer has improved significantly over recent years. During that time we’ve seen the rise of Snowflake as the data warehouse, built for the cloud-first world. What differentiates Snowflake from legacy warehouses is that it is designed as a multi-cluster shared architecture in which storage, compute, and services exist as decoupled layers. This is important because this enables compute and storage to be scaled up to near-infinite limits, giving data teams greater agility in their handling of massive workloads. Built around the cloud providers, Snowflake leverages a SQL database and query engine. This enables data to be ingested from a variety of sources and it allows for anybody who knows SQL to query data directly from it for their own analytical purposes. As such, Snowflake created a new standard for querying, loading and storing data, but most importantly, it created a more coherent way for centralizing data into a single source of truth, accessible by all (given the right permissions). 8 THE MODERN DATA STACK: THE EVOLUTION OF ETL TO EL(T) The data ingestion and transformation layers have also gotten better over time. In fact, it’s the rise of the cloud data warehouse (e.g. Snowflake, BigQuery, Redshift) and data lake / lakehouse (e.g. Databricks) that spurred the shift from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform) in data pipelines and introduced a new way for transforming data at scale. Owing to the decoupling of storage and compute, data is now being loaded in bulk and transformed right within the warehouse. Tools like Fivetran, Stitch Data, Airbyte, Datalogue, among others have made it easier to pull data from multiple sources and load it into the warehouse at much greater reliability. Fivetran for instance, serves analysts with data flowing from disparate sources through fully automated connectors. As for dbt, the tool is slowly becoming the de facto tool for data modeling by allowing the data analysts to transform data right within the warehouse through a simple SQL interface without having to rely on any engineers. 9 THE MODERN DATA STACK: WHAT’S NEXT? Together, advancements in the data warehousing, ingestion, and transformation layers have paved the way for making of the data warehouse one of the most critical components in the data stack. From Snowflake to Fivetran to dbt, these tools have changed the way in which data and business teams collaborate with one another. As companies keep on centralizing all of their data processing into the warehouse, this will create new opportunities for business teams to consume data directly out of the warehouse in true autonomous fashion. A lot of the engineering work that usually goes into serving data across business teams, will be abstracted away through more robust tooling. Just like what dbt did for open sourcing the data transformation layer to analysts, we are going to see more of that philosophy carry over onto the next step defining the Modern Data Stack where data and business teams will have more of a symbiotic way of working with one another. 10 TRENDS SHAPING THE MODERN DATA STACK TRENDS SHAPING THE MODERN DATA STACK TREND 1 | DATA CATALOG, LINEAGE & DISCOVERY TREND 2 | THE LAST MILE PROBLEM IN ETL TREND 3 | THE RISE OF OPERATIONAL ANALYTICS TREND 4 | AUGMENTING STREAM & BATCH PROCESSING 12 DATA CATALOG, LINEAGE & DISCOVERY THE EVOLUTION OF DATA OPERATIONS & MANAGEMENT IN THE DATA ECOSYSTEM There has been an explosive growth in the number of data infrastructure startups that have emerged in recent years, with data operations and data management being two of the most popular categories. 14 AS MORE AND MORE DATA IS BEING PRODUCED, THE NEED FOR A CENTRALIZED CATALOG BECOMES MORE CRITICAL The rise of data warehousing, ingestion, and transformation solutions has enabled users to work with massive datasets stored across disparate sources with greater ease. Yet, large enterprises that have been collecting data over the years weren’t even built from the start to facilitate higher orders of data accessibility. In a lot of organizations today, data teams still lack access to a centralized catalog. This not only limits their access to the right datasets, but it also limits their understanding of the provenance of the metadata which their reports are built on, making it hard for them to trust the data they are working with. 15 THE EMERGENCE OF METADATA-DRIVEN LINEAGE & CATALOG TOOLS Over the past few months, data-driven organizations, especially the big tech companies have developed their own internal tools to reduce friction around converting data into actionable insights and improve productivity of data users. These tools are powered by metadata which helps users understand their datasets and the jobs that read from and write to them. They help answer questions such as: • What is the schema of this dataset? • What is the provenance of the data? • Who owns and produces this job? • What is the business context of this data? • What is the quality of the data? • And more. 16 SELECT OPEN SOURCE METADATA-DRIVEN TOOLS POWERING DATA DISCOVERY, GOVERNANCE, QUALITY & MORE These tools serve as the foundational layer for data management and help improve internal operational efficiency. INTERNAL ORGANIZATION TOOL DESCRIPTION SOLUTION Amundsen is a metadata engine built by Lyft that provisions programmatic access to data. It facilitates the discovery of datasets based on their relevance and popularity across the organization. DataHub was created by LinkedIn to support both online and offline analyses and use cases such as search/ discovery, data privacy, governance, auditability, and more. Metacat is Netflix’s open source data tool that facilitates the discovery and management of data at scale. It acts as an access layer for metadata from Netflix’s data sources into Netflix’s Big Data platform. Databook, which was developed by Uber, leverages an automated approach to data exploration. Instead of fetching data in real-time, it captures metadata right in its architecture, enabling periodical crawls of the data.