THE STATE OF DATA INFRASTRUCTURE & ANALYTICS REPORT 2021 Work-Bench is an enterprise-focused THE STATE OF DATA INFRASTRUCTURE venture capital firm based in New York City. & ANALYTICS REPORT Work-Bench was founded with a unique, thesis-driven approach that flips traditional The State of Data Infrastructure & Analytics Report is a collection of the top enterprise venture capital upside down by technological trends and tools that have been dominating the modern data first validating Fortune 500 IT pain points infrastructure landscape over the past year. through our extensive corporate network and then investing in the most promising As an early stage investor at Work-Bench, I’ve been keeping an active pulse companies addressing these challenges. on the data ecosystem by speaking with data practitioners and startup founders who are laying the groundwork in the space as well as corporate www.work-bench.com executives from our network to develop my understanding of the top challenges and opportunities in data. @Work_Bench

[email protected] This report is a culmination of these conversations and illustrates the next generation of tools that we think are best poised to tackle the next phase in data infrastructure. AUTHORED BY: PRIYANKA SOMRAH � Additionally, listen to the recording of The State of Data Infrastructure & Analyst Analytics Webinar, where I walk through the report and dive a layer deeper. [email protected]

2 The Rise of Data Engineering 4

Trends Shaping The Modern Data Stack 11

TABLE OF Data Catalog, Lineage & Discovery 13 CONTENTS The Last Mile Problem in ETL 19 The Rise of Operational Analytics 28

Augmenting Stream & Batch 31 Processing

3 THE RISE OF DATA ENGINEERING & WHAT IT MEANS FOR THE MODERN DATA STACK

THE RISE OF DATA ENGINEERING IS BRIDGING THE TOOLING GAP IN THE MODERN DATA ECOSYSTEM Today, there are more data engineering jobs We are on the cusp of a new era in data where than ever before! innovation in data engineering is shaping the future of data-driven organizations and their initiatives.

While it’s still an emerging discipline, data engineering holds the promise of bridging the 28,000+ 170,000+ workflow gap between business intelligence, Open jobs on Open jobs on analytics, data science, and engineering functions. This represents a critical step, largely driven by advancement in tooling and best practices that empower different teams across the data ecosystem to work collaboratively.

5 DATA ENGINEERS, SCIENTISTS & ANALYSTS ARE STUCK IN WORKFLOWS THAT DON’T ALWAYS PERTAIN TO THEIR CORE FUNCTIONS Today, data engineers, data scientists, and analysts live in disconnected workflows and tools. There is a serious lack of cohesion between these functions, despite the fact that they are all working towards the same business initiatives.

For example, many data scientists are still building and maintaining their own data This Twitter poll by Vicki Boykis pipelines when they really should be building captures this sentiment. models and exploring data for analysis.

6 Data engineers are also getting pulled in different directions, catering to the ad-hoc needs of the data scientists and analysts.

Zhamak Dehghani, Director of Next Tech Incubation at ThoughtWorks puts it best:

“I personally don’t envy the life of a data platform engineer. They need to consume data from teams who have no incentive in providing meaningful, truthful and correct data. They have very little understanding of the source domains that generate the data and lack the domain expertise in their teams. They need to provide data for a diverse set of needs, operational or analytical, without a clear understanding of the application of the data and access to the consuming domain’s experts.”

7 THE MODERN DATA STACK: THE RISE OF THE CLOUD DATA WAREHOUSE

The data infrastructure layer has improved significantly over recent years. During that time we’ve seen the rise of Snowflake as the data warehouse, built for the cloud-first world.

What differentiates Snowflake from legacy warehouses is that it is designed as a multi-cluster shared architecture in which storage, compute, and services exist as decoupled layers. This is important because this enables compute and storage to be scaled up to near-infinite limits, giving data teams greater agility in their handling of massive workloads.

Built around the cloud providers, Snowflake leverages a SQL database and query engine. This enables data to be ingested from a variety of sources and it allows for anybody who knows SQL to query data directly from it for their own analytical purposes.

As such, Snowflake created a new standard for querying, loading and storing data, but most importantly, it created a more coherent way for centralizing data into a single source of truth, accessible by all (given the right permissions).

8 THE MODERN DATA STACK: THE EVOLUTION OF ETL TO EL(T)

The data ingestion and transformation layers have also gotten better over time. In fact, it’s the rise of the cloud data warehouse (e.g. Snowflake, BigQuery, Redshift) and data lake / lakehouse (e.g. Databricks) that spurred the shift from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform) in data pipelines and introduced a new way for transforming data at scale.

Owing to the decoupling of storage and compute, data is now being loaded in bulk and transformed right within the warehouse. Tools like Fivetran, Stitch Data, Airbyte, Datalogue, among others have made it easier to pull data from multiple sources and load it into the warehouse at much greater reliability. Fivetran for instance, serves analysts with data flowing from disparate sources through fully automated connectors. As for dbt, the tool is slowly becoming the de facto tool for data modeling by allowing the data analysts to transform data right within the warehouse through a simple SQL interface without having to rely on any engineers.

9 THE MODERN DATA STACK: WHAT’S NEXT?

Together, advancements in the data warehousing, ingestion, and transformation layers have paved the way for making of the data warehouse one of the most critical components in the data stack. From Snowflake to Fivetran to dbt, these tools have changed the way in which data and business teams collaborate with one another.

As companies keep on centralizing all of their data processing into the warehouse, this will create new opportunities for business teams to consume data directly out of the warehouse in true autonomous fashion.

A lot of the engineering work that usually goes into serving data across business teams, will be abstracted away through more robust tooling. Just like what dbt did for open sourcing the data transformation layer to analysts, we are going to see more of that philosophy carry over onto the next step defining the Modern Data Stack where data and business teams will have more of a symbiotic way of working with one another.

10 TRENDS SHAPING THE MODERN DATA STACK TRENDS SHAPING THE MODERN DATA STACK

TREND 1 | DATA CATALOG, LINEAGE & DISCOVERY

TREND 2 | THE LAST MILE PROBLEM IN ETL

TREND 3 | THE RISE OF OPERATIONAL ANALYTICS

TREND 4 | AUGMENTING STREAM & BATCH PROCESSING

12 DATA CATALOG, LINEAGE & DISCOVERY THE EVOLUTION OF DATA OPERATIONS & MANAGEMENT IN THE DATA ECOSYSTEM

There has been an explosive growth in the number of data infrastructure startups that have emerged in recent years, with data operations and data management being two of the most popular categories.

14 AS MORE AND MORE DATA IS BEING PRODUCED, THE NEED FOR A CENTRALIZED CATALOG BECOMES MORE CRITICAL

The rise of data warehousing, ingestion, and transformation solutions has enabled users to work with massive datasets stored across disparate sources with greater ease.

Yet, large enterprises that have been collecting data over the years weren’t even built from the start to facilitate higher orders of data accessibility. In a lot of organizations today, data teams still lack access to a centralized catalog.

This not only limits their access to the right datasets, but it also limits their understanding of the provenance of the metadata which their reports are built on, making it hard for them to trust the data they are working with.

15 THE EMERGENCE OF METADATA-DRIVEN LINEAGE & CATALOG TOOLS

Over the past few months, data-driven organizations, especially the big tech companies have developed their own internal tools to reduce friction around converting data into actionable insights and improve productivity of data users. These tools are powered by metadata which helps users understand their datasets and the jobs that read from and write to them. They help answer questions such as:

• What is the schema of this dataset? • What is the provenance of the data?

• Who owns and produces this job? • What is the business context of this data?

• What is the quality of the data? • And more.

16 SELECT OPEN SOURCE METADATA-DRIVEN TOOLS POWERING DATA DISCOVERY, GOVERNANCE, QUALITY & MORE

These tools serve as the foundational layer for data management and help improve internal operational efficiency.

INTERNAL ORGANIZATION TOOL DESCRIPTION SOLUTION

Amundsen is a metadata engine built by Lyft that provisions programmatic access to data. It facilitates the discovery of datasets based on their relevance and popularity across the organization.

DataHub was created by LinkedIn to support both online and offline analyses and use cases such as search/ discovery, data privacy, governance, auditability, and more.

Metacat is Netflix’s open source data tool that facilitates the discovery and management of data at scale. It acts as an access layer for metadata from Netflix’s data sources into Netflix’s Big Data platform.

Databook, which was developed by Uber, leverages an automated approach to data exploration. Instead of fetching data in real-time, it captures metadata right in its architecture, enabling periodical crawls of the data.

Marquez was created by WeWork to collect metadata lineage and track data dependencies. It provides RESTful APIs that integrate with systems such as Airflow, Amundsen, and Dagster.

Google’s Data Catalog is a scalable metadata management service, powered by the search technology, that offers an auto-tagging mechanism for sensitive data.

Dataportal is Airbnb’s self-service system that centralizes tribal knowledge and employee-centric data to Dataportal provide transparency across data assets. 17 ADDITIONAL RESOURCES TO GET SMARTER ON THE TOPIC

TECHNICAL INSTRUCTIONAL PEOPLE TO BLOG POSTS VIDEOS FOLLOW

Data Catalogue — Knowing Your Amundsen: A Data Discovery Philip Dutton, Co-founder and Data Albert Franzi Platform From Lyft Tao Feng, Co-CEO of Solidatus Jin Hyuk Chang Data Discovery Platforms and Shirshanka Das, Software Their Open Source Solutions Solving Data Lineage Tracking engineer at LinkedIn and Co- And Data Discovery At Eugene Yan creator of DataHub, Apache WeWork Julien Le Dem, Willy Helix, Gobblin, and more Lulciuc Data Catalogs Are Dead; Long Julien Le Dem, Co-founder and Live Data Discovery Barr Moses Metadata Management: A CTO of Datakin, former Strategic Imperative for Data Principal Engineer at WeWork, Management Amnon Drori co-creator of Apache Parquet, Arrow, and more

18 THE LAST MILE PROBLEM IN ETL THE LAST MILE OF THE ETL FRAMEWORK

Operational Systems Load Extract Extract Transform Transform Data Warehouse

Flat

DATA SOURCES DATA FLOW DATA MODELING ANALYTICS

20 THE LAST MILE PROBLEM IN ETL REFERS TO THE CHALLENGES AROUND DATA GENERATION & CONSUMPTION

DATA GENERATION DATA CONSUMPTION

This process is traditionally owned by It’s the point at which clean data from the pipelines is the engineers and revolves around consumed by downstream users who leverage cleaning, validating, and transforming business intelligence tools to convert data into raw data into formats suited for end- meaningful business outcomes. user consumption. This final step in the analytics chain is the most important one because it is where data can be effectively used to generate actionable insights.

21 THE LAST MILE PROBLEM IN ETL REFERS TO THE CHALLENGES AROUND DATA GENERATION & CONSUMPTION

Data consumption leads to a new set of challenges around the way people collect and share metrics in an organization, which amplifies the agility problem that already exists throughout the data lifecycle: Multiple stakeholders are using different cuts of data, and storing their own slices of data in notebooks or excel spreadsheets on their laptops that become their main metric they report out on.

Since the current practices around streamlining the creation of metrics is lacking, analysts across the same organization end up collecting metrics in different ways, which leads to confusion and disagreement about how a metric is defined or how to handle an outlier in the data.

22 THE LAST MILE PROBLEM: FEW ORGANIZATIONS HAVE THE DISCIPLINE TO EFFECTIVELY MAINTAIN GOOD DATA HYGIENE

Metrics are hard to find, and aren’t Traditional BI tools and data usable across tools dashboards are not always actionable

Metrics reporting is largely fragmented, siloed, Over the past few years, we’ve seen the rapid and inconsistent because curating and sharing adoption of data dashboards that have been tribal knowledge does not scale well as pipelines driving business value through aesthetically get more complex. pleasing, interactive graphs that capture high level views of the data. While dashboards are Since most product teams may not be proficient great tools to visualize snapshots of data, they in SQL or have the engineering skills to build generally lack the underlying context around the and maintain pipelines, they often have to rely data to make them truly actionable and on data scientists to query metrics stored across trustworthy. multiple data stores.

23 THE LAST MILE SOLUTION: CENTRALIZED METRICS STORE

The purpose of a centralized metric repository is to create a single source of truth for metrics definitions, where key metrics are defined in one place only and are reusable across different tools. This solves the challenge of inconsistent metrics definitions by capturing metadata and lineage which enables users to understand the provenance of each metric and how it is being calculated, and frees data scientists from pipeline management.

TOOL DESCRIPTION

Transform Data is a shared data interface that captures the end state of the data and turns it into a metric entity. Through its centralized repository, Transform Data enables data lifecycle management, powering anomaly detection, A/B testing and experiments.

Other tools include (LookML), and SQL-based modeling tools, dbt and Dataform.

24 THE LAST MILE SOLUTION: COMPUTATIONAL DATA NOTEBOOKS

While next-generation business intelligence and analytics tools like Preset, Metabase and Redash have gained a lot of steam for democratizing analytics and simplifying data access to non-technical users, data notebooks offer a different set of benefits:

• They are process-oriented (authors can comment directly while working with their codes so that everyone else can understand and validate the process).

• They are easily accessible (as long as the users can code in the language in which the tool is written, they can query data directly from their notebooks).

• They generate better visualizations (unlike static visualizations, data notebooks enable users to dive in and play around with the results, interactively).

25 THE LAST MILE SOLUTION: COMPUTATIONAL DATA NOTEBOOKS

Computational notebooks such as Jupyter and Apache Zeppelin have emerged as great tools for data science and product teams to collaborate on. They enable users to create and share live code, graphics and visualizations and function as interactive IDEs through which users execute code and generate meaningful insights instantly, all in real time.

TOOL DESCRIPTION

Hex is an analytics workspace that enables data teams to analyze data in SQL and Python-powered notebooks and turn their work into shareable data applications.

Deepnote is a data science notebook that enables data scientists to collaborate in real-time on AI and ML projects.

Other tools include Count, Noteable, Observable, Kaggle, Polynote, and Mode.

26 ADDITIONAL RESOURCES TO GET SMARTER ON THE TOPIC

TECHNICAL INSTRUCTIONAL PEOPLE TO BLOG POSTS VIDEOS FOLLOW

An Island of Truth: Practical How Reporting and Nick Handel, Co-founder and Data Advice from Experimentation Fuel Product CEO of Transform Data and Airbnb James Mayfield Innovation at LinkedIn Kapil Surlaker Barry McCardel, Co-founder Data Analyst 3.0: The Next and CEO of Hex Technologies Evolution of Data Workflows Democratizing Metric Sid Sharma Definition and Discovery at Anthony Goldbloom, Co- Airbnb Lauren Chircus founder and CEO of Kaggle Dashboards are Dead Taylor Brownlow Beyond Interactive: Empowering Everyone with Jupyter Notebooks Michelle Ufford

27 THE RISE OF OPERATIONAL ANALYTICS POWERING REAL-TIME DECISION MAKING THROUGH OPERATIONAL ANALYTICS Ways in which data from the warehouse / lake is Up to this point, the distinct layers that make up served over to end consumers: the Modern Data Stack across data extraction and loading (Segment, Fivetran, , Business intelligence tools: Data is routed to BI tools such as Looker, Tableau, Mode Analytics, etc. where BI users use Stitch Data, Datalogue, etc.), transformation (dbt) it to build out reports, dashboards and visualizations. and cloud data warehousing (Snowflake, BigQuery, Redshift) are fairly consolidated at this Third-party applications: Data is routed to third-party point. operational apps such as Salesforce, Zendesk, HubSpot, Marketo, etc. where data is leveraged to drive specific business actions, many of which need to be executed in Given the prominence of the cloud data near real-time (e.g. responding to customer service warehouse as the single source of truth for data, as requests on Zendesk and HubSpot, tracking access to more and more companies centralize their data sensitive data stored in Salesforce, etc.). into the warehouses, the opportunity to leverage this data in real-time and unlock value out of it Data science applications: Data is fed into data science and machine learning apps where data scientists leverage becomes more apparent. the data for their own reporting and analytics purposes.

29 POWERING REAL-TIME DECISION MAKING THROUGH OPERATIONAL ANALYTICS

But what if we could give business teams superpowers to slice and dice their data without constantly having to rely on the engineers?

It’s what tools like Census, Hightouch, Grouparoo, and Polytomic aspire to do. They empower business teams through self-serve data access by enabling them to query data from the data warehouse through simple SQL interfaces that they are mostly familiar with. As these platforms handle all of the heavy lifting around synching data from the warehouse to the various applications, data engineers now have more time to focus on fixing the harder problems in the infrastructure layer.

It’s exciting to see how emerging tools in this space continue to shape the way in which data is accessed and operationalized at scale. This evolution is what will move the needle the most in helping data and business teams align their functions and drive meaningful outcomes.

30 AUGMENTING STREAM & BATCH PROCESSING THE DATA ECOSYSTEM IS SPLIT INTO 2 COMPUTING PARADIGMS: BATCH & STREAM DATA PROCESSING

Due to the rising demand for large-scale data processing, today’s data systems have undergone significant changes to effectively handle transactional data models as well as support a larger variety of sources, including log and metrics from web servers, sensor data from IoT systems and more. In fact, the current data ecosystem is split between two fundamental computing paradigms, batch processing, where large volumes of data are scheduled and processed offline, and stream data processing, where large streams of data are continuously processed for real-time analysis.

Today, there is an increasing number of applications that require both batch and stream data processing. For example, financial services organizations utilize stream analytics in areas where it is important to get fast analytical results on critical jobs, such as monitoring fraud detection and analyzing customer behavior data and stock trades. On the other hand, batch processing is used for use cases where it’s more critical to process large volumes of data than it is to get near instant results, such as end-of-cycle data reconciliation.

32 CURRENT TECHNOLOGIES UNDERPINNING BATCH & STREAM DATA PROCESSES ARE EVOLVING

TREND 1 TREND 2 TREND 3

SHIFTING FROM UNIFYING BATCH DELIVERING STREAM KAFKA TO AND STREAM PROCESSING PULSAR DATA PROCESSING CAPABILITIES TO A SYSTEMS DIVERSE USER BASE

VIA SQL

33 TREND 1 SHIFTING FROM KAFKA TO PULSAR

Kafka, the leading platform for high-throughput and low-latency messaging created by LinkedIn, became largely popular because it can publish log and event data from multiple sources into databases in a real-time, scalable and durable manner.

But Kafka is not as scalable and performant as it should be: Owing to its monolithic architecture, the storage and serving layers in Kafka are coupled and can only be deployed together. What this means is that every time someone needs to access data from the storage layer, the request has to go through the message broker first which slows down time to query and reduces latency and throughput.

34 TREND 1 SHIFTING FROM KAFKA TO PULSAR

Trend: Apache Pulsar is a next generation messaging and queuing system that came out of Yahoo. Unlike Kafka, Pulsar is architected in a multi-layer way which decouples its compute, storage and messaging layers into separate distinct layers, enabling developers to access data directly from each individual layer. This not only enables instant access to data as it gets published by the broker, but it also significantly increases throughput, data scalability and availability.

TOOL DESCRIPTION

StreamNative is a cloud-native event streaming service powered by Apache Pulsar, founded by the co- creators of Apache Pulsar and BookKeeper.

Pandio is a distributed messaging service that offers Apache Pulsar as a service.

35 UNIFYING BATCH AND STREAM DATA PROCESSING TREND 2 SYSTEMS

Like event streaming platforms, batch processing systems have their own advantages and disadvantages. Innovation in the ETL pipeline has made it easier for data and business teams to collaboratively work with batch data. But since data in ETL is loaded on schedule, every time the user poses a question, the data has to be processed all over again in order to return that particular query. As more and more users query from the same pipeline and spin up multiple workflows on an ad-hoc basis, this results in slower query times and higher infrastructure costs.

Trend: Traditionally the operational and analytical stacks have largely been separate owing to the complexities of integrating the two. However, a trend that we’re observing in this space is around unifying these stacks in a model that expresses both batch and streaming computations to offer the best of each of these technologies.

TOOL DESCRIPTION

Materialize is a SQL streaming database built on top of the Timely Dataflow research project and is used for processing streaming data in real-time.

Estuary provides a unified foundational layer for batch and streaming workflows, built on top of Gazette, an open source streaming infrastructure. 36 DELIVERING STREAM PROCESSING CAPABILITIES TO A DIVERSE TREND 3 USER BASE VIA SQL

While the first two forward-looking trends dealt with the infrastructure side of the challenge in batch and stream data processing, the user side of the problem remains largely unsolved for. Data-driven organizations today have a large number of non-technical employees who need to analyze real-time streaming data but need to do so without having to interact with the complexities of the underlying infrastructure.

Trend: We are seeing a growing number of tools that democratize access to non-technical users by enabling them to query data through SQL, a common programming language among most data practitioners. These tools create an end-to-end self-service experience for the users by providing a common base for data engineers, data scientists and analysts to work collaboratively.

TOOL DESCRIPTION

AthenaX is Uber’s open source streaming analytics platform built on top of Apache Flink that empowers both technical and non-technical users to run and analyze streaming analytics in production using SQL.

37 SELECT RESOURCES TO GET SMARTER ON THE TOPIC

TECHNICAL INSTRUCTIONAL PEOPLE TO BLOG POSTS VIDEOS FOLLOW

Rethinking Flink’s APIs for a Unified Data Processing with Stephan Ewen, PMC member Unified Data Processing Apache Flink and Apache of the Apache Flink project Framework Aljoscha Krettek Pulsar Seth Wiesman and CTO of Ververica

Logs & Offsets: (Near) Real Designing ETL Pipelines with Arjun Narayan, CEO and Co- Time ELT with Apache Kafka + Structured Streaming and founder of Materialize Snowflake Adrian Kreuziger Delta Lake — How to Architect Things Right Tathagata Das Eric Sammer, CEO of

Decodable, former VP Kafka Alternative Pulsar From Batch to Streaming to Distinguished Engineer at Unifies Streaming and Both Herman Schaaf Splunk and CTO of Rocana Queuing Susan Hall

Maximilian Michels, Committer to Apache Flink and Apache Beam

38 As the data infrastructure landscape continues to evolve, we expect to see more and more companies emerge to address the existing challenges across every vertical in the enterprise.

If you’re a data practitioner or startup building in this space, please reach out, I’d love to chat! And, sign up for my monthly newsletter, The Data Source, where I cover the top innovation CLOSING in data engineering, analytics and developer-first tooling. THOUGHTS www.work-bench.com

@Work_Bench PRIYANKA SOMRAH Analyst [email protected] [email protected]

39