An Overview Of Streaming Data Technology Why Streaming Data?

Many companies are looking to use their data more intelligently in order to improve their customer experience and the efficiency of their business.

Speed is an important part of this. The earlier you can respond to incoming data, the more opportunities you have to improve the customer experience.

Recognising this, businesses are implementing Streaming Data architectures and platforms. These allow them to process and integrate data in real time, analyse it whilst in flight, and automate actions and workflows in response to situations automatically as they arise.

Moving to Streaming and Event Driven Architectures is a fairly fundamental change for businesses and their technology landscape, but one which is increasingly valuable in meeting customer expectations in todays digital world. The technology at the heart of your Streaming architecture will be a streaming engine. This will be responsible for transporting data from source to destination, from the publishers to subscribers in a reliable, scalable and performant manner.

This can be thought of similar to the enterprise "message brokers" of old such as Tibco, ActiveMQ and Rabbit MQ, though modern open source offerings are much more lightweight and scalable. Streaming Key products to investigate include: Apache Kafka - By far the most widely deployed streaming engine in industry. Kafka is supported and driven by Confluent, who provided many commercial and managed service offerings on top of Kafka; Engine Apache Pulsar - A newer challenger to Kafka which is creating debate on it's performance, infrastructure requirements and operational simplicity compared to Kafka;

The cloud providers also provide streaming engines as a managed service, typically charged using consumption based billing. These are worth investigating to avoid the overhead of deploying and managing your own platform:

AWS Kinesis - A streaming engine provided by Amazon Web Services; Google PubSub - A streaming engine provided by Google Cloud Platform; Azure Event Hub - A streaming engine provided by Microsoft Azure. In order to move towards Event Streaming and Event Based Architecture, we need to be able to source and publish data in real time and a format which we can subsequently work with.

This can be hard because most line of business applications do not publish events as they are entered by users. Instead, they interact directly with a relational specific to their application.

Historically, businesses have got access to this data with batch extract and load into central data warehouses, but this process Event leads to delays and can be fragile. For this reason, we often need to implement the technology to source the events as a real time stream and transform them into a format that we can then go on to work with. Sometimes we can do this by making changes to our source application, Sourcing and sometimes we have to interrogate the source application to extract the events in near real time.

Key products to investigate include:

Debezium - Listens to the transaction logs of popular such as MySQL in order to turn them in a real time stream of events; Singer - Connects to hundreds of end sources to extract data and turns them into JSON based events which are suitable for subsequent process and analysis; Kafka Connect - Part of the Confluent ecosystem for sourcing data from databases, file stores and other repositories and pushing it into Kafka or other destinations. When you have a stream of events, often we want to process them as they are "in flight" before they are stored to a database.

For instance, we might want to filter out events, add additional information into them, or carry out analysis such as asking how many events we have had in a given time window. If a situation of interest occurs then we may wish to respond to this by alerting a human or interacting with some API to automatically respond to the situation.

Event To do this, we use an Event Processing Engines, where we can write code in a procedural language such as Java, or a higher level language such as SQL or another domain specific language to process the events as they pass through.

Processing Key products to investigate include:

Kafka Streams - Kafka Streams works as a library which can process and provide analytics over Kafka topics as data is streamed in. The Kafka streams model is attractive because it does not require any external cluster to be managed; Flink - A widely used open source library which is particularly strong for providing intelligent analytics over event streams with extremely low latency. Flink is less tightly bound to Kafka; Spark Streams or Structured Streaming - The streaming component of Apache Spark, allowing you to process extremely large datasets as they stream into a cluster, giving us the ability to unify batch and stream processing. Extract, Transform and Load is about extracting data from a data source, transforming it into the required format, then loading it into the destination repository. This is commonly found within enterprises who want to extract data form line of business operational systems, and move it into a data warehouse for reporting and business intelligence.

Streaming approaches can also be used to implement ETL, moving data from sources, transforming it, and inserting it to the destinations in real-time. This improves the time to insight for business users who have historically been waiting for batch Streaming integrations to run before they can see the latest data in their systems. In some ways Streaming ETL is a simpler subset of "Event Processing" and can be implemented using the same technologies. However, there are tools more tailored to the ETL process which concentrate on automating this data exchange and ETL & Data providing visual UIs for implementing and managing it. Integration Key products to investigate include: StreamSets - Can be used to deliver continuous data from sources to endpoint repositories based on a visual workflow designer; Striim - Provides real-time data data integration with intelligence in the transformation pipelines; AWS Glue - ETL tools by AWS which is increasingly offering support for streaming data. Having processed our data, the next thing we need to do is store the events into databases for subsequent search adn analytics.

This could be any database such as Oracle, SQL Server, Postgres or MySQL, but there are many databases which are more tailored for real time analytics. For instance, we are likely to need real-time ingestion and fast analytics over very large, Real Time unbounded streams of data found on event streams. It is therefore worth moving to a more appropriate database. Databases Key products to investigate include: Elastic - The Elastic stack allows you to ingest very large collections of schema-less events, analyse and search them, and visualise using Kibana; Apache Druid - An extremely scalable open source database which sits in the sweet spot between analytical, time series, document databases; SingleStore - A database focussed on high performance analytical workloads; Clickhouse - Extremely fast column oriented database which is very scalable and resilient. If we have very large volumes of data, this can be difficult and slow to analyse. We can often make this tractable using Event Processing approaches such as pre-aggregating or filtering data, but this isn't an ideal solution.

Another emerging technology is the ability to build materialised views over streaming data, where we constantly update Materialised the results of a query as new data streams in. For instance, every time an order takes place, we could update a view of how many orders have happened in this hour. When we query this view, the performance will be fast because the table has Views been pre-computed. Key products to investigate include:

Materialize - A platform which simplifies application development with streaming data; KSQLDB - A platform and database for building real time stream processing applications. When our data is in some appropriate data store, we often need to surface it to our business users who want a relatively real time view on dashboards, reports or within their applications.

Real Time Many users will be familiar with visualisations and reporting tools such as PowerBI and Tableau. Though they are not particularly real time, our aim as data engineers is to take the streaming data and prepare it in a format so that people can Visualisation build their self service analytics based on up to date data. To get a very real time front-end experience, you often need to move into the realm of bespoke application development. Timeflow offer a low code platform and a range of professional services to help companies become reactive real-time businesses using Streaming Data.

Visit our website, view our demos, or get in touch to start the conversation!