IJARSCT ISSN (Online) 2581-9429

International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)

Volume 5, Issue 2, May 2021 Impact Factor: 4.819

Real-Time Data Analytics with Apache Druid Correa Bosco Hilary Department of Information Technology, (MSc. IT Part 1) Sir Sitaram and Lady Shantabai Patkar College of Arts and Science, Mumbai, India

Abstract: The shift towards real-time data flow has a major impact on the way applications are designed and on the work of data engineers. Dealing with real-time data ingestion brings a paradigm shift and an added layer of challenges compared to traditional integration and processing methods. There are real benefits to leveraging real-time data, but it requires specialized considerations in setting up the ingestion, processing, storing, and serving of that data. It brings about specific operational needs and a change in the way data engineers work. These should be considered while embarking on a real- time journey. In this paper we are going to see real time data analytics with apache druid. Apache Druid (incubating) performant analytics data store for event-driven data .Druid’s core design combines ideas from OLAP/analytic databases, time series databases, and search systems to create a unified system for operational analytics.

Keywords: Distributed, Real-time, Fault-tolerant, Highly Available, Open Source, Analytics, Column- oriented, Olap, Apache Druid

I. INTRODUCTION Streaming data integration is the foundation for streaming analytics. Specific use cases such as IoT devices log, contextual marketing triggers, Dynamic pricing all rely on using a data feed or real-time data. If you cannot source the data in real-time, there is very little value to be gained in attempting to tackle these use cases. With data streaming and updating from ever more sources, businesses are looking to quickly translate this data into intelligence to make important decisions, usually in an automated way. Real time data streams have become more popular due to the Internet of Things (IoT), sensors in everyday devices and of course the rise of social media. These platforms provide all the changing states. Analysing them even a day later can give misleading or now currently false information.

II. GENERIC INFRASTRUCTURE FOR REAL-TIME DATA FLOWS Besides enabling new use cases, real-time data ingestion brings other sets of benefits, such as a decreased time to land the data, need to handle dependencies, and some other operational aspects:

2.1 Ingestion Layer Ingesting clickstream data often requires a specific infrastructure component to be present to facilitate that. Snowplow and Dilvote are two open-source clickstream collectors programs. Frameworks such as Apache Flumes, Apache Nifi, offering features such as data buffering and backpressure, help integrate data onto message queues/streams. A message bus, streams is the component that will serve to transfer the data across the different components of the real-time data ecosystem. Some of the common technologies used are Kafka, Pulsar, Kinesis, Google Pub/Sub, Azure Service Bus, Azure Event Hub, and RabbitMQ, to name just a few. Different processing frameworks are there to simplify computation on data streams. Technologies such as , Flink, , Spark Streaming can significantly help with the more complicated processing of data streams. It is possible to query streams directly using SQL, like the type of languages. Azure Event Hub supports Azure Stream Analytics, Kafka KSQL, and Spark offers Spark Structured Streaming to query multiple types of message streams.

2.2 Processing Layer Streams typically need to be enriched to provide additional data meant to be used in real time. They can either do lookups on additional services, databases, do first stage ETL transformations, or add machine learning scores onto the Copyright to IJARSCT DOI: 10.48175/568 488 www.ijarsct.co.in IJARSCT ISSN (Online) 2581-9429

International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)

Volume 5, Issue 2, May 2021 Impact Factor: 4.819

stream. Enrichment of messages typically happens through a producer/consumer or publisher/subscriber type of pattern. These applications can be coded in any language and often do not require some framework for this type of enrichment. Although specialized frameworks and tooling exist, such as Spark Streaming, Flink, or Storm, for most use cases, a normal service application would be able to perform adequately without the overhead, complexity, or the specific expertise of a streaming computation framework. Stateful enrichments and cleanup of the data might be needed to be used downstream.  Stateful Enrichments: Event based applications might need to consume data containing data enriched with historical data.  Stateful Cleanup: This can be the case when attempting to use customer data coming from different sources to be used in CRM systems that want a 360 view of the customer, for instance, to leverage contextual marketing triggers.  Stateful Deduplication: Some message brokers offer at least once delivery option, creating the need to deduplicate events.

2.3 Storage Layer Real-time data brings about different challenges in terms of storing and serving collected and processed data. Data tends to have different access patterns, latency, or consistency requirements, impacting how data needs to be stored and served. To properly handle the different needs arising from real-time processing, it is important to have the correct systems to manage the type of workload and access pattern for the data. Depending on how the data is consumed and the volume/velocity of data, they might complement the data platform with OLAP, OLTP, HTAP, or search engine systems.

2.4 Serving Layer There are many ways to integrate real-time data; the most common are through Dashboards, Query Interface, APIs, Webhook, Firehose, or Pub/Sub and directly integrating into OLTP databases. The particular method the data will be served through will be heavily dependent on the nature of the use case intended. For instance, when wanting to integrate onto a live application, different options are available, offering an API, publishing events through webhook, firehose, or a pub/sub mechanism, alternatively directly integrating onto an OLTP database. Analysts, on the other hand, might find a dashboard or a query interface fitter for purpose.

III. DRUID AND REAL TIME ANALYTICS Apache Druid is a real-time analytics database that is designed for rapid analytics on large datasets. This database is used more often for powering use cases where real-time ingestion, high uptime, and fast query performance is needed. Druid can be used to analyze billions of rows not only in batch but also in real-time. It offers many integrations with different technologies like Security, Cloud Storage, S3, Hive, HDFS, DataSketches, Redis, etc. It also follows the immutable past and append-only future. As past events happen once and never change, these are immutable, whereas the only append takes place for new events. Apache Druid provides users with a fast and deep exploration of large scale transaction data. Your data is stored into chunks. Chucks are immutable. Once segments are created, you cannot update them. (You can create a new version of a segment, but that implies re-indexing all the data for the period) You can configure how you want those chunks to be created (one per day, or one per hour, or one per month, …). You can also define the granularity of the data inside the chunks. If you know that you need the data per hour, you can set up your chunks to roll-up the data automatically. Inside a segment, the data is stored by timestamp, dimensions, and metrics: 1. Timestamp: the timestamp (rolled-up or not) 2. Dimension: A dimension will be used to cut or filter the data. Some examples of dimensions can be city, state, country, deviceId, campaignId, …

Copyright to IJARSCT DOI: 10.48175/568 489 www.ijarsct.co.in IJARSCT ISSN (Online) 2581-9429

International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)

Volume 5, Issue 2, May 2021 Impact Factor: 4.819

3. Metric: A metric is a counter/aggregate that is done. A few examples of metrics can be keyword clicks, page impressions, response time,… Druid supports a variety of aggregations possible by default, such as first, last, doubleSum, longMax,… There are also custom/experimental aggregations available, such as Estimate Histogram, DataSketch, or your own! You can easily implement your own aggregations as a plugin to Druid.

Some of Druid's key Features Are 1. Columnar storage format 2. Scalable distributed system. 3. Parallel processing: 4. Real Time or batch ingestion. 5. Self-healing, self-balancing, easy to operate. 6. Cloud-native, fault-tolerant architecture 7. Indexes for quick filtering. 8. Time-based partitioning. 9. Approximate algorithms. 10. Automatic summarization at ingest time.

You should use Druid if you have the following challenges: 1. Time Series data to store 2. Data has a somewhat high cardinality 3. You need to be able to query this data fast 4. You want to support streaming data 5. Digital marketing (ads data) 6. User analytics and behavior in your products 7. APM (application performance management) 8. OLAP and business intelligence) 9. IoT and devices metrics

How does it Work Under the Hood? Every Druid installation is a cluster that requires multiple components to run. The Druid cluster can run on a single machine (great for development), or totally distributed on a few to hundreds of machines.

Copyright to IJARSCT DOI: 10.48175/568 490 www.ijarsct.co.in IJARSCT ISSN (Online) 2581-9429

International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)

Volume 5, Issue 2, May 2021 Impact Factor: 4.819

External Dependencies Required to Druid  Metadata storage: An SQL powered database, such as PostgreSQL or MySQL.It is used to store the information about the segments, some loading rules, and to save some tasks information. Derby can be used for development.  Zookeeper: Zookeeper is required to communicate between the different components of the Druid architecture. It is used by certain types of nodes to transmit their state and other information to others.  Deep storage: The deep storage is used to save all the segment files for long-term storage. Multiple storages are supported, such as S3, HDFS, local mount, … Some of them are available natively whilst others require the installation of an extension.

Different Node Types that are Running in a Druid Cluster  Historical: They are loading part or all the segments available in your cluster. They are then responsible for responding to any queries made to those segments. They do not accept any writing.  Middle manager: They are responsible to index your data, either streaming or batch inserted. When a segment is being indexed, they are also able to respond to any query to those segments until the hand-off is done to a historical node.  Broker: This is the query interface. It processes queries from clients, and dispatches them to the relevant historical and middle manager nodes hosting the relevant segments. In the end, it merges the result back before sending to the clients.  Coordinators: Coordinators are here to manage the state of the cluster. They will notify historical nodes when segments need to be loaded via zookeeper, or to rebalance the segments across the cluster.  Overlord: It is responsible to manage all the indexing tasks. They coordinate the middle managers and ensure the publishing of the data.  Router (optional): some kind of API gateway in front of the overlord, broker and coordinator. As you can query those directly, I don’t really see any need for it. The real-time indexation from the middle manager often runs with Kafka, but other firehose are available (RabbitMQ, RocketMQ, ..) as extensions.

What Happens When You Run a Query The query will contain information about the interval (period of time), the dimensions and the metrics required. Copyright to IJARSCT DOI: 10.48175/568 491 www.ijarsct.co.in IJARSCT ISSN (Online) 2581-9429

International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)

Volume 5, Issue 2, May 2021 Impact Factor: 4.819

1. The query hits the broker. The broker knows where the important segments for the requested interval are (i.e. 2 segments are required from the historical node A, 2 from historical node B, and it also requires the data from the indexed segments published in the middleManager A). 2. The query is sent to all the nodes required (in our case Historical A, Historical B, and MiddleManager A). 3. Each of those nodes will do the requested sum up and splice the data according to the query, and they send back the result to the broker. 4. The data is then joined in the broker, depending on the query, and returned to the client. As the query interface is similar for the broker, middle manager, and the historical node, it is really easy to debug your segments, or test a single historical node. The broker just sends the same queries, but simply changes the requested interval to get only the data it needs from each other nodes.

IV. DRUID COMPARISON WITH HADOOP We are talking about two slightly related but very different technologies here. Druid is a real-time analytics system and is a perfect fit for time series and time based events aggregation. Hadoop is HDFS (a distributed file system) + Map Reduce (a paradigm for executing distributed processes), which together have created an eco system for distributed processing and act as underlying/influencing technology for many other open source projects. You can set up druid to use Hadoop; that is to fire MR jobs to index batch data and to read its indexed data from HDFS (of course it will cache them locally on the local disk). If you want to ignore Hadoop, you can do your indexing and loading from a local machine as well, of course with the penalty of being limited to one machine.

V. DRUID LIMITATIONS Even the best databases have limitations. A few of them know with Druid over the last 2 years: 1. No windowed functionality, such as rolling-average. You will have to implement it yourself within your API. 2. Not possible to join data. But if you really have this use case, you’re probably doing something wrong. 3. You will probably need some kind of API in front of it, just to remap your IDs to a user readable information. As the database is mostly append-only, I would not save the value of something, but only a reference (campaign id instead of campaign name, unless your data is also read-only in your database). There are possible ways to do this directly in Druid, but I haven’t tried yet.

VI. CONCLUSION Druid is “quite a beast”. Depending on the amount of data you have. You will also need to tweak the configuration of the process (heap, CPU, caching, threads…) once you start having more data. And that’s where Druid lacks in my opinion. They do not have any easy tooling available yet that can configure the OS parameters ( there is https://imply.io/product with their cloud services ) It is tedious to configure and maintain your various servers. You will probably need to set-up your own tooling to automate everything, with Chef, Ansible, Puppet, Kubernetes

REFERENCES [1] https://medium.com/analytics-and-data /real-time-data-pipelines-complexities-considerations-eecad520b70b [2] Real-time data Analytics Apache Druid (Mr. FODIL Youssouf Islam & Mr. MOKRAN Abdelrrahim) [3] A Real-time Analytical Data Store( Fangjin Yang, Eric Tschetter, Xavier Léauté,Nelson Ray Metamarkets Group, Inc) [4] Mining Big Data in Real Time ( Yahoo! Research Barcelona ) [5] Hadoop And Big Data Challenges (American University in the Emirates, College of Computer and Information technology, Dubai, United Arab Emirates.) [6] https://hadoopquiz.blogspot.com/2016/ 09/answer-to-hadoop-real-time-questions.html [7] https://lewisdgavin.medium.com/ what-is-real-time-data-37331ff91704 [8] https://druid.apache.org/

Copyright to IJARSCT DOI: 10.48175/568 492 www.ijarsct.co.in