Real-Time Data Analytics with Apache Druid Correa Bosco Hilary Department of Information Technology, (Msc
Total Page:16
File Type:pdf, Size:1020Kb
IJARSCT ISSN (Online) 2581-9429 International Journal of Advanced Research in Science, Communication and Technology (IJARSCT) Volume 5, Issue 2, May 2021 Impact Factor: 4.819 Real-Time Data Analytics with Apache Druid Correa Bosco Hilary Department of Information Technology, (MSc. IT Part 1) Sir Sitaram and Lady Shantabai Patkar College of Arts and Science, Mumbai, India Abstract: The shift towards real-time data flow has a major impact on the way applications are designed and on the work of data engineers. Dealing with real-time data ingestion brings a paradigm shift and an added layer of challenges compared to traditional integration and processing methods. There are real benefits to leveraging real-time data, but it requires specialized considerations in setting up the ingestion, processing, storing, and serving of that data. It brings about specific operational needs and a change in the way data engineers work. These should be considered while embarking on a real- time journey. In this paper we are going to see real time data analytics with apache druid. Apache Druid (incubating) performant analytics data store for event-driven data .Druid’s core design combines ideas from OLAP/analytic databases, time series databases, and search systems to create a unified system for operational analytics. Keywords: Distributed, Real-time, Fault-tolerant, Highly Available, Open Source, Analytics, Column- oriented, Olap, Apache Druid I. INTRODUCTION Streaming data integration is the foundation for streaming analytics. Specific use cases such as IoT devices log, contextual marketing triggers, Dynamic pricing all rely on using a data feed or real-time data. If you cannot source the data in real-time, there is very little value to be gained in attempting to tackle these use cases. With data streaming and updating from ever more sources, businesses are looking to quickly translate this data into intelligence to make important decisions, usually in an automated way. Real time data streams have become more popular due to the Internet of Things (IoT), sensors in everyday devices and of course the rise of social media. These platforms provide all the changing states. Analysing them even a day later can give misleading or now currently false information. II. GENERIC INFRASTRUCTURE FOR REAL-TIME DATA FLOWS Besides enabling new use cases, real-time data ingestion brings other sets of benefits, such as a decreased time to land the data, need to handle dependencies, and some other operational aspects: 2.1 Ingestion Layer Ingesting clickstream data often requires a specific infrastructure component to be present to facilitate that. Snowplow and Dilvote are two open-source clickstream collectors programs. Frameworks such as Apache Flumes, Apache Nifi, offering features such as data buffering and backpressure, help integrate data onto message queues/streams. A message bus, streams is the component that will serve to transfer the data across the different components of the real-time data ecosystem. Some of the common technologies used are Kafka, Pulsar, Kinesis, Google Pub/Sub, Azure Service Bus, Azure Event Hub, and RabbitMQ, to name just a few. Different processing frameworks are there to simplify computation on data streams. Technologies such as Apache Beam, Flink, Apache Storm, Spark Streaming can significantly help with the more complicated processing of data streams. It is possible to query streams directly using SQL, like the type of languages. Azure Event Hub supports Azure Stream Analytics, Kafka KSQL, and Spark offers Spark Structured Streaming to query multiple types of message streams. 2.2 Processing Layer Streams typically need to be enriched to provide additional data meant to be used in real time. They can either do lookups on additional services, databases, do first stage ETL transformations, or add machine learning scores onto the Copyright to IJARSCT DOI: 10.48175/568 488 www.ijarsct.co.in IJARSCT ISSN (Online) 2581-9429 International Journal of Advanced Research in Science, Communication and Technology (IJARSCT) Volume 5, Issue 2, May 2021 Impact Factor: 4.819 stream. Enrichment of messages typically happens through a producer/consumer or publisher/subscriber type of pattern. These applications can be coded in any language and often do not require some framework for this type of enrichment. Although specialized frameworks and tooling exist, such as Spark Streaming, Flink, or Storm, for most use cases, a normal service application would be able to perform adequately without the overhead, complexity, or the specific expertise of a streaming computation framework. Stateful enrichments and cleanup of the data might be needed to be used downstream. Stateful Enrichments: Event based applications might need to consume data containing data enriched with historical data. Stateful Cleanup: This can be the case when attempting to use customer data coming from different sources to be used in CRM systems that want a 360 view of the customer, for instance, to leverage contextual marketing triggers. Stateful Deduplication: Some message brokers offer at least once delivery option, creating the need to deduplicate events. 2.3 Storage Layer Real-time data brings about different challenges in terms of storing and serving collected and processed data. Data tends to have different access patterns, latency, or consistency requirements, impacting how data needs to be stored and served. To properly handle the different needs arising from real-time processing, it is important to have the correct systems to manage the type of workload and access pattern for the data. Depending on how the data is consumed and the volume/velocity of data, they might complement the data platform with OLAP, OLTP, HTAP, or search engine systems. 2.4 Serving Layer There are many ways to integrate real-time data; the most common are through Dashboards, Query Interface, APIs, Webhook, Firehose, or Pub/Sub and directly integrating into OLTP databases. The particular method the data will be served through will be heavily dependent on the nature of the use case intended. For instance, when wanting to integrate onto a live application, different options are available, offering an API, publishing events through webhook, firehose, or a pub/sub mechanism, alternatively directly integrating onto an OLTP database. Analysts, on the other hand, might find a dashboard or a query interface fitter for purpose. III. DRUID AND REAL TIME ANALYTICS Apache Druid is a real-time analytics database that is designed for rapid analytics on large datasets. This database is used more often for powering use cases where real-time ingestion, high uptime, and fast query performance is needed. Druid can be used to analyze billions of rows not only in batch but also in real-time. It offers many integrations with different technologies like Apache Kafka Security, Cloud Storage, S3, Hive, HDFS, DataSketches, Redis, etc. It also follows the immutable past and append-only future. As past events happen once and never change, these are immutable, whereas the only append takes place for new events. Apache Druid provides users with a fast and deep exploration of large scale transaction data. Your data is stored into chunks. Chucks are immutable. Once segments are created, you cannot update them. (You can create a new version of a segment, but that implies re-indexing all the data for the period) You can configure how you want those chunks to be created (one per day, or one per hour, or one per month, …). You can also define the granularity of the data inside the chunks. If you know that you need the data per hour, you can set up your chunks to roll-up the data automatically. Inside a segment, the data is stored by timestamp, dimensions, and metrics: 1. Timestamp: the timestamp (rolled-up or not) 2. Dimension: A dimension will be used to cut or filter the data. Some examples of dimensions can be city, state, country, deviceId, campaignId, … Copyright to IJARSCT DOI: 10.48175/568 489 www.ijarsct.co.in IJARSCT ISSN (Online) 2581-9429 International Journal of Advanced Research in Science, Communication and Technology (IJARSCT) Volume 5, Issue 2, May 2021 Impact Factor: 4.819 3. Metric: A metric is a counter/aggregate that is done. A few examples of metrics can be keyword clicks, page impressions, response time,… Druid supports a variety of aggregations possible by default, such as first, last, doubleSum, longMax,… There are also custom/experimental aggregations available, such as Estimate Histogram, DataSketch, or your own! You can easily implement your own aggregations as a plugin to Druid. Some of Druid's key Features Are 1. Columnar storage format 2. Scalable distributed system. 3. Parallel processing: 4. Real Time or batch ingestion. 5. Self-healing, self-balancing, easy to operate. 6. Cloud-native, fault-tolerant architecture 7. Indexes for quick filtering. 8. Time-based partitioning. 9. Approximate algorithms. 10. Automatic summarization at ingest time. You should use Druid if you have the following challenges: 1. Time Series data to store 2. Data has a somewhat high cardinality 3. You need to be able to query this data fast 4. You want to support streaming data 5. Digital marketing (ads data) 6. User analytics and behavior in your products 7. APM (application performance management) 8. OLAP and business intelligence) 9. IoT and devices metrics How does it Work Under the Hood? Every Druid installation is a cluster that requires multiple components to run. The Druid cluster can run on a single machine (great for development), or totally distributed on a few to hundreds of machines. Copyright to IJARSCT DOI: 10.48175/568 490 www.ijarsct.co.in IJARSCT ISSN (Online) 2581-9429 International Journal of Advanced Research in Science, Communication and Technology (IJARSCT) Volume 5, Issue 2, May 2021 Impact Factor: 4.819 External Dependencies Required to Druid Metadata storage: An SQL powered database, such as PostgreSQL or MySQL.It is used to store the information about the segments, some loading rules, and to save some tasks information.