Streaming-First Architectures Building the Real-Time Organization

By Julian Ereth July 2019

This publication may not be reproduced or distributed without Eckerson Group’s prior permission. Streaming-First Architectures

About the Author

Julian Ereth is a researcher and practitioner in business intelligence and data analytics. His research focuses on new approaches in big data, advanced analytics, and the Internet of Things. Ereth is author of multiple internationally accepted research papers and is currently earning his Ph.D. at the University of Stuttgart (Germany). He is cofounder of Pragmatic_Apps, which builds custom business software and analytics solutions.

About Eckerson Group

Eckerson Group helps organizations get more value from data and analytics. Our experts each have more than 25+ years of experience in the field. Data and analytics is all we do, and we’re good at it! Our goal is to provide organizations with a cocoon of support on their data journeys. We do this through online content (thought leadership), expert onsite assistance (full-service consulting), and 30+ courses on data and analytics topics (educational workshops). Get more value from your data. Put an expert on your side. Learn what Eckerson Group can do for you!

© Eckerson Group 2019 www.eckerson.com 2 Streaming-First Architectures

Table of Contents

Executive Summary ...... 4

Key Takeaways ...... 4

Recommendations ...... 5

The Rise of Data Streams ...... 6

From Batch Processing to ...... 6

Benefits of Stream Processing...... 7

Streaming Components...... 8

Stream Sourcing ...... 8

Stream Transportation...... 9

Stream Processing...... 10

Streaming in Analytics...... 11

Streaming in Real-Time Analytics ...... 11

Streaming-First Architectures ...... 12

Benefits of a Streaming-First Architecture...... 14

Implementing a Streaming Architecture ...... 15

Technology and Products...... 16

Open Source vs. Commercial Tools...... 16

Conclusion...... 17

About Eckerson Group...... 18

© Eckerson Group 2019 www.eckerson.com 3 Streaming-First Architectures

Executive Summary

Having the right data at the right time is essential for organizations that need to compete. Having the latest information about current market movements, customer interactions, or operational data from the shop floor can tip the scales. However, gathering and processing data in (near) real time is not as easy as it sounds. Traditional analytics architectures were built mostly to support strategic business decision making, where timeliness is rarely critical. This model hits a wall when data flow velocity increases and requirements like real-time processing come into play. Accordingly, traditional architectures are integrating real-time components and gradually shifting toward “streaming first” concepts. But integrating streaming components in analytical landscapes presents challenges such as new tools, technologies, concepts, and methods, as well as a novel way of thinking about analytics architectures. And both experience and best practices are scarce in this area. This report helps business and technical executives understand data streaming, analyze analytics architectures, and optimize accordingly.

Key Takeaways • Data streaming is superseding traditional batch operation in analytics architectures. • Streaming components can be categorized as stream sourcing (e.g., edge processing or CDC), stream transportation (e.g., messaging brokers or event logs) and stream processing (e.g., CEP and stream analytics). • There are two cases of data streaming in analytics:

1. Real-time analytics pipelines that provide ways to rapidly extract data from the edge and process it for immediate insights.

2. Stream-first architectures that utilize streaming components like event logs to combine systems in a flexible and asynchronous way.

• There are many great open source tools that perform certain tasks, but commercial vendors and tools help to integrate, extend, and run them on an enterprise level.

© Eckerson Group 2019 www.eckerson.com 4 Streaming-First Architectures

Recommendations • Analyze the scenario to decide whether this is a case for an isolated real- time analytics pipeline or more profound transformation of the underlying architecture. • Think beyond current needs. A streaming-first architecture enables the integration of streaming data and also improves the architecture’s agility and sustainability. • When choosing a tool, think about factors like scalability, latency, and durability, and also consider the trade-off between an open source, best-of-breed approach and a commercial enterprise-ready solution.

© Eckerson Group 2019 www.eckerson.com 5 Streaming-First Architectures

The Rise of Data Streams

Having the right data at the right time is essential for organizations that need to compete in today’s fast-moving and data-driven world. This simple maxim grows truer every day. Having the latest information about current market movements, customer interactions, or operational data from the shop floor can tip the scales. Companies are realizing this fact, and streaming components are gaining value in modern data landscapes.

Having the right data at the right time is essential for organizations that need to compete in today’s fast-moving and data-driven world.

Integrating real-time components in analytical landscapes presents challenges such as new tools, technologies, concepts, and methods, as well as a novel way of thinking about analytics architectures. To make things worse, both experience and best practices are scarce in this area. This report first explains basic ideas behind data streaming and then introduces necessary components for building modern data streaming solutions. Moreover, it describes how streaming can be integrated in analytics landscapes. Lastly, it outlines hands-on advice for implementing a streaming architecture and lists relevant streaming tools. From Batch Processing to Stream Processing Streaming data is a continuous flow that is generated from various sources. Stream processing is an umbrella term for methods to work with streaming data. During stream processing, data is constantly moving from one stage to another, where it is only temporarily saved and immediately processed. For that reason, streaming data is often referred to as data in motion or data in flight. In contrast, data at rest is permanently persisted in a database and can be read at any time for further (See Figure 1).

Figure 1. Data in Motion vs. Data at Rest

Source Processing / Processing / Analytics Storage Analytics

Data in Motion Data at Rest

Traditionally, analytical systems mostly work with data at rest. For example, in most data warehouse architectures, data is extracted, cleansed, and transformed by ETL processes and then persisted in a central data warehouse. From here all downstream analytical systems can access the data, e.g., for creating reports or dashboards. Here, ETL processes usually run on a regular basis, e.g., every night, and processes all available data in this batch.

© Eckerson Group 2019 www.eckerson.com 6 Streaming-First Architectures

Obviously, downstream analytics systems can only show data that has been processed by preceding ETL jobs. Accordingly, to show more current data in reports and dashboards, the batch jobs have to run more frequently, which in turn limits the size of the batch (see Figure 2). Following this logic, you eventually end up with a batch size of one, which means that each record is processed immediately and is available for downstream systems without delay. This is called stream processing.

Figure 2. From Batch to Stream Processing

n Monthly Batch

Daily Batch Size of Dataset Micro- Batch

Real Time 1

Batch Processing Stream Processing

Benefits of Stream Processing Stream processing comes with various benefits that help analytical systems on a technological and business level.

More up-to-date insights. The most obvious benefit of using data streams is the faster processing which makes more up-to-date data available to the business and thereby leads to more relevant and valuable insights. In time-sensitive scenarios, like fraud detection or financial trading, this can mean a real competitive advantage.

Enabling new analytical use cases. Besides improving the data for existing analytics applications, stream processing also enables entirely new use cases like operational decision support, where real-time insights are needed, e.g., on a manufacturing shop floor where a worker has to decided which machine to maintain. This is also why stream processing is gaining attention in relation to the Internet of Things where it can help transform sensor data into valuable business information.

© Eckerson Group 2019 www.eckerson.com 7 Streaming-First Architectures

Increase flexibility of analytical architectures. The concept of stream processing introduces a whole new mindset on working with data. Analytical architectures are less about central, highly cleansed data warehouses and more about process-oriented data pipelines and event-based data hubs. This structure provides more flexibility and makes many analytical systems more business-oriented.

Streaming Components

Streaming pipelines usually contain numerous components that collaborate in a procedural order. Figure 3 illustrates how streaming components can be categorized by systems that generate data streams (stream sourcing), components that handle the transportation of data (stream transportation), and systems that actually process streaming data by analyzing or transforming it (stream processing).

Figure 3. Components in a Streaming Pipeline

Direct Transform & Edge Processing Messaging Forward

Message Complex Event Other Streams Brokers Processing

Stream DB Change Data Capture Event Logs Analytics

Stream Sourcing Stream Transportation Stream Processing

Stream Sourcing There are numerous possibilities of how data streams originate. Many times data streams reflect events triggered in upstream systems, such as a machine on a manufacturing shop floor, or an operational database in an information system. Other times, however, incoming sources are already-prepared data streams that are forwarded by other stream processing systems.

Edge Processing: Processing of data near its source (the edge) before sending it to further processing in other systems is referred to as edge processing. In analytical landscapes this is a common method to save bandwidth (e.g., when a sensor is aggregating data and only sends average values instead of raw data) or to comply with privacy rules when sensitive data must remain within the source system. Edge processing is often proprietary and embedded in operational

© Eckerson Group 2019 www.eckerson.com 8 Streaming-First Architectures

systems, but there are also more general solutions like Nifi which provides analytical methods for low-level systems.

Change Data Capture (CDC): CDC provides an efficient way to extract data streams from traditional databases by producing a log of changes for other systems to replicate. As this happens on a low level in the database, CDC can meet scalability and performance requirements for a real-time application. CDC is implemented in many enterprise database systems, but there are also solutions that enrich systems with CDC capabilities. Stream Transportation Messaging is the most common way of transporting streaming data. The idea is simple: One system (sender) sends a data record to another system (receiver). The real-world implementations, however, can vary in many ways, e.g., confirmation of receipt or ensuring only-once delivery. As soon as more than two systems are involved, using a message broker or event log as an intermediary to distribute messages to multiple systems becomes handy.

Figure 4. Types of Messaging Systems

Event-Log Message Broker - Persistence of data Direct Messaging - Multiple clients - Multi clients - Messaging - Messaging - Messaging

Capabilities

Direct Messaging: This is the most straightforward way of transporting data streams, where one producer sends data to one consumer. It is often implemented directly via UDP and used in low-level environments or where low latency is more important than ensuring consistency. Common direct messaging systems: nanomsg or ZeroMq.

Message Broker: In more sophisticated scenarios a message broker can be useful as a central stream distributor. A broker usually connects data producers and a consumers via a publish-subscribe pattern, where one systems sends data to a topic and another can listen to this topic at the broker. With this structure it is easier to manage multiple clients as well as certain features like fault tolerance. Common message brokers: RabbitMQ or ActiveMQ.

© Eckerson Group 2019 www.eckerson.com 9 Streaming-First Architectures

Event Log: Most message brokers only keep data in-memory, which is sufficient to resend messages if a system is temporarily offline. However, they are inadequate when it is necessary to access old messages, e.g., for replaying or for aggregations). This is where event-logs, that additionally persist the stream data, come into play. Besides the possibility to access the stream history, such event logs also provide more sophisticated ways of scaling by providing individual partitions for various clients. Common event logs: Kafka or Amazon Kinesis Streams Stream Processing The most obvious way to process data streams is to apply simple transformations and forward it, e.g., make it a streaming source for another pipeline or write the transformed data in a persistent database. As a simple example you can think of a stream of temperature values that are validated (i.e., outlier and null values are removed) and which then provides cleansed data for downstream systems. However, more sophisticated ways to work with streams include complex event processing and stream analytics.

Complex Event Processing (CEP): CEP is mostly about monitoring a stream and flagging a “complex event” when a certain pattern occurs. CEP systems often use a declarative language to define patterns and compare stream data against existing databases. Use cases for CEP can be broad, ranging from fraud detection in finance up to intelligence or military security systems.

Stream Analytics: Stream analytics can be seen as the extension of CEP; rather than finding patterns, its focus is transforming and querying the data to extract insights. A common use case could be predictive maintenance, where real-time machine sensor data is aggregated and compared to historical data to yield insights about the likelihood of a failure. Most tools on the market support CEP as well as more sophisticated analytics scenarios. Common stream processing tools: , Apache Flink, , Streams or Concord.

© Eckerson Group 2019 www.eckerson.com 10 Streaming-First Architectures

Streaming in Analytics

Figure 5. Two Approaches to Streaming in Analytics

Streaming in Streaming-First Real-Time Analytics Architectures Analytics architectures that are built for specific Using streaming approaches to use cases that require real-time data processing. build agile analytical landscapes.

Streaming components can be found in most of today’s analytical landscapes. Most commonly, they are part of rather specific real-time applications that result in individual dashboards or somehow push data in other analytical components (e.g., a data warehouse). However, streaming components can also help to build a new type of architecture that enhances flexibility in analytical landscapes. Accordingly, one can distinguish two approaches to streaming in analytics: (i) streaming in real-time analytics and (ii) streaming-first architectures. Streaming in Real-Time Analytics Streaming for real-time analytics mostly reflects the streaming pipeline displayed in Figure 6. Such pipelines usually comprise an edge analytics component that pre-processes data near the source and then sends data via stream transportation (i.e., messaging) to downstream analytics systems that conduct complex event processing or other stream analytics to extract insights.

Figure 6. Streaming Pipeline for Real-Time Analytics

Complex Event Stream Processing & Stream Edge Analytics Transportation Analytics

Real-time analytics can mostly be found in scenarios with low-latency requirements and where immediate insights are key. Common use cases are the following:

• Analytics in the Internet of Things (IoT) In the IoT streaming is key, given the multitude of real-time data sources (i.e., sensors) and many use cases require immediate data processing, e.g., monitoring machines in manufacturing or automated video analytics for security cameras.

© Eckerson Group 2019 www.eckerson.com 11 Streaming-First Architectures

• Real-Time Fraud Detection Fraud detection is a good example of the need of real-time insights, as a delay in processing might lead directly damage for the business. Large finance companies, for instance, use CEP on big data to identify and stop suspicious transactions as they happen (or even before).

• Patient Monitoring in Hospitals Devices in modern hospitals generate a continuous stream of data that can be processed and analyzed to raise automatic alerts, e.g., when vital parameters match certain patterns.

Real-time analytics can mostly be found in scenarios with low- latency requirements and where immediate insights are key.

From an implementation perspective these approaches are often closely related to operational systems and therefore these systems are often not integrated with the rest of the analytical landscape, like the data warehouse or other dispositive decision support systems. This missing link between real-time and other business data can be a barrier for fully exploiting the value of the data. What this means and how an integrated analytics landscape can be built will be discussed below. Streaming-First Architectures As real-time data grows in relevance, so does the role of streaming components in analytical architectures. Figure 7 illustrates the evolution from traditional batch-first to hybrid lambda architectures to event-driven, streaming-first approaches.

Figure 7. Evolution of Streaming in Analytical Architectures

Analytics Analytics Analytics Persistence

Persistence Persistence Stream Stream Processing Processing

Batch-First Hybrid (Lambda Architecture) Streaming-First (Event Hub)

© Eckerson Group 2019 www.eckerson.com 12 Streaming-First Architectures

Batch-First: In most traditional data architectures, batch jobs periodically extract, transform, and load (ETL) from various data sources. This data is then stored in a persistence layer (e.g., a data warehouse), which provides an integrated view for downstream analytics systems. This approach is straightforward and ensures consistency, as all data is checked before entering the data warehouse and analytics systems can access a single data source. Such a traditional batch architecture is sufficient for most non-time-critical use cases, but as soon as more up-to-date or even near-real-time information is needed, these batch-oriented concepts hit a wall.

Hybrid “Two-Speed” Architectures: To integrate the real-time and batch worlds, various hybrid approaches emerged. One popular hybrid concept is the Lambda architecture, which combines pre-aggregated views with incoming streaming data to consider real-time data in the downstream systems. Although these hybrid ideas solve many issues, they often introduce new complexity and redundancy, making them far from perfect solutions.

Event-Driven Streaming-First Architectures: The emergence of sophisticated message brokers and event logs enabled a new way of building analytical architectures free from the constraints of regular batch jobs, with components connected in an event-driven way. Figure 8 illustrates an event-driven, streaming-first architecture where an event hub serves as a central integration component and individual systems can produce and consume arbitrary data via message queues. Here, it is important to see that the streaming-first idea is not only about integrating streaming data, but also transforms the way analytics architectures are built by enabling asynchronous analytics pipelines that put an end to traditional layer thinking.

Figure 8. Event-Driven Analytics Architecture

ERP (CDC) Cleanse Aggregate DW

Data Data Transformation Transformation Source Sink

Event Hub (e.g., Apache Kafka)

Data Data Transformation Source Sink

Supplier Merge Real-Time Dashboard

© Eckerson Group 2019 www.eckerson.com 13 Streaming-First Architectures

The example pipeline in Figure 8 cleanses and transfers orders from an ERP system into a data warehouse. Using CDC, this pipeline listens to incoming orders and uses them as a stream that is processed step-by-step using multiple message queues. This in turn enables other processes to hook into the pipeline and use the data, e.g., for a real-time dashboard that integrates supplier data and visualizes this information.

Event Hub vs. Enterprise Service Bus The event hub described here looks similar to an enterprise service bus (ESB). An event hub, however, is different in the following ways. First, its simple design enables a higher throughput and easier management. Second, tools like Kafka provide additional features like storage of data and can be run in distributed clusters that enable scalability. Last, the data flows in an event hub are usually managed in a decentralized manner by the individual teams, whereas ESBs are mostly managed by central teams.

Benefits of a Streaming-First Architecture As outlined here, an event-driven, streaming-first architecture constitutes a mind shift in building analytics architectures. The benefits can be condensed to the following three points:

Accelerating Insights: Using an event-driven approach obviously accelerates data processing in analytics architectures, as it renders regular batch processes obsolete. Records can be immediately processed and can be considered in downstream reports or dashboards.

Improving Flexibility: The underling idea of using an event log as a central component enables various systems to hook into a data pipeline. For instance, if a new system has to be integrated it can use its own queue and directly work on production data without interfering with existing systems.

Streamline Integration and Coping with Legacy Systems: One major benefit is the non-intrusive approach of event-driven architectures that generously allow for integrating various systems. It is easily possible to integrate existing systems as data sources and sinks, or even integrate legacy batch jobs. For instance, a streaming pipeline can write data in an existing operational data store that serves as a starting point for a traditional ETL structure that regularly loads data in a data warehouse.

Event-driven architectures not only enable the integration of streaming data but also improve agility and sustainability of an architecture.

© Eckerson Group 2019 www.eckerson.com 14 Streaming-First Architectures

Implementing a Streaming Architecture

Before implementing a streaming architecture in your organization, consider the following points. 1. Analyze Your Use Case Are you facing a very specific use case that overlaps little with the rest of the analytical landscape, an isolated real-time pipeline might be right for you. If a more sophisticated integration of streaming components into the rest of the analytical landscape is planned, a streaming-first architecture might be a reasonable choice. 2. Think of Scaling, Latency, and Durability When it comes to choosing tools for your scenario, you should think about the following factors:

• Scaling: Depending on throughput, the data volume, and the number of connected systems, a simple messaging tool or a distributed solution, like Kafka, might be more or less adequate.

• Latency: As a rule of thumb, the more low-level a system is, the higher the performance. Accordingly, low-latency use cases with high throughput might be better off with direct messaging on a low level, whereas message-broker and event logs provide additional features that are useful for more sophisticated architectures but also add overhead.

• Durability: Is there a need to persist streaming data, e.g., to provide additional query capabilities? Or are process-and-forget pipelines, e.g., for complex event processing, enough? 3. Think beyond Your Current Needs When making a case for a streaming-first architecture, think beyond your current needs. Building an event-driven architecture not only enables the integration of streaming data but also improves agility and sustainability of an architecture, which is especially important in the long run. At the same time, a simple real-time pipeline with direct messaging might grow fast into a more complex scenario where a message broker would be useful.

© Eckerson Group 2019 www.eckerson.com 15 Streaming-First Architectures

Technology and Products The range of products in the streaming sector is very large and can be confusing. Figure 9 shows a selection of popular tools categorized by their main use. Most of the streaming tools serve a specific task, but other products bundle different tools to provide end-to-end solutions.

Figure 9. A Selection of Data Streaming Tools

Open Source Tools Commercial Tools IBM Streams - Nifi - Pulsar - Attunity Stream - Streamlio - DataBus - Esper - Hazelcast Jet - Confluent - Flume - Samza - Tibco Stream Base - Talend Data Streams - Apex - Flink - Oracle Stream Analytics - Striim ZeroMq Spark Streams - Azure Stream Analytics - - - Apama - RabbitMQ - Storm - Cloudera Data Flow - IBM Event Streams - Kafka - SAS Event Stream - Processing - - Informatica Big Data Streaming - Apama IBM Event Streams

Open Source vs. Commercial Tools As Figure 9 shows, plenty of open source and commercial tools exist in the streaming sector. The open source tools usually come from certain use cases and mostly solve a specific task very well. Examples are Nifi for edge processing, Kafka for event log messaging and Spark Streams for advanced stream analytics. However, as the tools are open source and often developed and maintained by a community, they usually lack enterprise support. And when it comes to installation, configuration, and deployment, you are mostly on your own. This is where you might consider commercial vendors that provide enterprise-ready solutions and support hosting, running, and integrating streaming solutions. The commercial products often draw on selected open source products and enrich them with integrations to enterprise systems or capabilities like security and governance.

Open source tools are great for custom best-of-breed approaches, but commercial tools provide the convenience of enterprise-ready solutions.

© Eckerson Group 2019 www.eckerson.com 16 Streaming-First Architectures

Conclusion

This report has shown that data streaming becomes increasingly relevant for today’s businesses and that streaming gradually replaces traditional batch processing. Streaming pipelines usually involve systems for stream sourcing (e.g., edge processing or CDC), stream transportation (e.g., direct messaging, brokers or event logs) and stream processing (e.g., CEP and stream analytics).

Data streaming plays a key role in emerging areas and enables new use cases.

In relation to analytics, data streaming plays a key role in emerging areas like IoT, and enables new use cases like operational decision support. On closer examination, streaming in analytical landscapes can serve well in two use cases:

• Real-time analytics pipelines that provide ways to rapidly extract data from the edge and process it for immediate insights.

• Streaming-first architectures that utilize streaming components like event logs to combine systems in a flexible and asynchronous way. Both use cases are becoming increasingly important for modern businesses, and a plethora of tools are available to implement data streaming. Similar to many big data areas, there are numerous open source software products that serve very specific purposes and that are developed and maintained by an open community. However, there are also commercial vendors that bundle different products and enrich them to provide enterprise-ready solutions. In summary, data streaming is becoming important for more and more organizations, and the concept of streaming-first is gaining traction with companies building greenfield analytics architectures or restructuring legacy landscapes.

© Eckerson Group 2019 www.eckerson.com 17 Streaming-First Architectures

About Eckerson Group

Wayne Eckerson, a globally known author, speaker, and advisor, formed Eckerson Group to help organizations get more value from data and analytics. His goal is to provide organizations with a cocoon of support during every step of their data journeys. Today, Eckerson Group helps organizations in three ways:

• Our thought leaders publish practical, compelling content that keeps you abreast of the latest trends, techniques, and tools in the data analytics field.

• Our consultants listen carefully, think deeply, and craft tailored solutions that translate your business requirements into compelling strategies and solutions.

• Our educators share best practices in more than 30 onsite workshops that align your team around industry frameworks. Unlike other firms, Eckerson Group focuses solely on data analytics. Our experts each have more than 25+ years of experience in the field. They specialize in every facet of data analytics— from data architecture and data governance to business intelligence and artificial intelligence. Their primary mission is to help you get more value from data and analytics by sharing their hard-won lessons with you. Our clients say we are hard-working, insightful, and humble. We take the compliment! It all stems from our love of data and desire to help you get more value from analytics—we see ourselves as a family of continuous learners, interpreting the world of data and analytics for you and others. Get more value from your data. Put an expert on your side. Learn what Eckerson Group can do for you!

© Eckerson Group 2019 www.eckerson.com 18