An Empirical Performance Evaluation of Apache Kafka
Total Page:16
File Type:pdf, Size:1020Kb
How Fast Can We Insert? An Empirical Performance Evaluation of Apache Kafka Guenter Hesse, Christoph Matthies, Matthias Uflacker Hasso Plattner Institute University of Potsdam Germany fi[email protected] about such study results is a prerequisite for making informed Abstract—Message brokers see widespread adoption in modern decisions about whether a system is suitable for the existing IT landscapes, with Apache Kafka being one of the most use cases. Additionally, it is also crucial for finding or fine- employed platforms. These systems feature well-defined APIs for use and configuration and present flexible solutions for tuning appropriate system configurations. various data storage scenarios. Their ability to scale horizontally The contributions of this research are as follows: enables users to adapt to growing data volumes and changing • We propose a user-centered and extensible monitoring environments. However, one of the main challenges concerning framework, which includes tooling for analyzing any message brokers is the danger of them becoming a bottleneck within an IT architecture. To prevent this, knowledge about the JVM-based system. amount of data a message broker using a specific configuration • We present an analysis that highlights the capabilities of can handle needs to be available. In this paper, we propose a Apache Kafka regarding the maximum achievable rate of monitoring architecture for message brokers and similar Java incoming records per time unit. Virtual Machine-based systems. We present a comprehensive • We enable reproducibility of the presented results by performance analysis of the popular Apache Kafka platform 1 using our approach. As part of the benchmark, we study selected making all needed artifacts available online . data ingestion scenarios with respect to their maximum data The rest of the paper is structured as follows: In Section II ingestion rates. The results show that we can achieve an ingestion we give a brief introduction of Apache Kafka. Section III-A rate of about 420,000 messages/second on the used commodity presents the benchmark setup and the developed data sender hardware and with the developed data sender tool. Index Terms—performance, benchmarking, big data, Apache tool. Subsequently, we describe the results of the ingestion rate Kafka analyses. Section V introduces related work and Section VI elaborates on the lessons learned. The last chapter concludes I. INTRODUCTION the study and outlines areas of future work. In the current business landscape, with an ever-increasing II. APACHE KAFKA growth in data and popularity of cloud-based applications, Apache Kafka is a distributed open-source message broker horizontal scalability is becoming an increasingly common and or messaging system originally developed at LinkedIn in important requirement. Message brokers play a central role in 2010 [7]. The core of this publish-subscribe system is a modern IT systems as they satisfy this requirement and thus, distributed commit log, although it has extended its scope allow for adaptations of the IT landscape to data sources that through extensions. An example is Kafka Streams [8], a client grow both in volume and velocity. Moreover, they can be used library for developing stream processing applications. to decouple disparate data sources from applications using this The high-level architecture of an exemplary Apache Kafka arXiv:2003.06452v3 [cs.PF] 3 Mar 2021 data. Usage scenarios where message brokers are employed are cluster is visualized in Figure 1. A cluster consists of multiple manifold and reach from e.g., machine learning [1] to stream brokers, which are numbered and store data assigned to topics. processing architectures [2], [3], [4] and general-purpose data Data producers send data to a certain topic stored in the cluster. processing [5]. Consumers subscribe to a topic and are forwarded new values In the context of a complex IT architecture, the degree, sent to this topic as soon as they arrive. to which a system aligns with its application scenarios and Topics are divided into partitions. The number of topic parti- the functional and non-functional requirements derived from tions can be configured at the time of topic creation. Partitions it, are key [6]. If non-functional requirements related to of a single topic can be distributed across different brokers of performance are not satisfied, the system might become a a cluster. However, a message order across partitions is not bottleneck. This situation does not directly imply that the guaranteed by Apache Kafka [10], [9]. system itself is inadequate for the observed use case, but might Next to the number of partitions, it is possible to define indicate a suboptimal configuration. Therefore, it is crucial a replication factor for each topic, one being the minimum. to be able to evaluate the capabilities of a system in certain environments and with distinct configurations. The knowledge 1https://github.com/guenter-hesse/KafkaAnalysisTools Copyright ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: will follow as soon as the paper is published. Producer A Producer B messages, a producer can require multiple retries, which would increase the difference between the timestamp assigned with a message and the time when it is appended to the log and Apache Kafka Cluster thus, made available for consuming applications [10]. Broker #1 Broker #2 Broker #3 _________________________ _________________________ _________________________ III. BENCHMARK SETUP topic1 / part1 topic1 / part1 This section introduces the monitoring architecture em- topic1 / part2 topic1 / part2 topic2 / part1 topic2 / part1 topic2 / part1 ployed in the ingestion rate study as well as the developed data sender tool. A. Monitoring Architecture Consumer A Consumer B Consumer C The architecture of the monitoring system is shown in Fig- ure 3. We use Grafana [12], an open-source tool for creating Fig. 1. Apache Kafka cluster architecture (based on [9]) and managing dashboards and exporting data, as the interface to the user. The presented benchmarks employ version 5.4.5 This allows preventing data loss in the case of a single broker of its docker image. OS-level virtualization through docker failure. In the context of replication, Apache Kafka defines is used for ease of installation and replicability of results. leaders and followers for each partition. The leader handles all The OS base image used in this image allows a simple reads and writes for the corresponding topic partition, whereas time zone configuration via an environment variable, which is followers copy or replicate the inserted data. In Figure 1, the important for time synchronization among all systems. Later leader partitions are shown in bold type. The first topic, topic1 versions of the image contain a different OS, specifically has two partitions and a replication factor of one, while topic2 Alpine Linux [13], which no longer supports this feature. has only one partition and a replication factor of two [10]. Figure 2 shows the structure of an Apache Kafka topic, Grafana specifically of a topic with two partitions. Each of these Data Visualization (Dashboard) partitions is an ordered and immutable record sequence where new values are appended. A sequential number is assigned to each topic record within a partition, referred to as an offset. Graphite Apache Kafka itself provides the topic consumer offsets for Monitoring Tool w/ Storage Functionality storing the offsets. However, consumers must manage their offset. They can commit their current offset either automati- cally in certain intervals or manually. The latter can be done jmxtrans collectd either synchronously or asynchronously. When polling data, JVM Metrics Collection System Metrics Collection a consumer needs to pass the offset to the cluster. Apache Kafka returns all messages with a greater offset, i.e., all new Apache Kafka messages that have not already been sent to this consumer. As Message Broker / System Under Test the consumer has control over its offset, it can also decide to start from the beginning and to reread messages [10]. Fig. 3. Monitoring architecture in Fundamental Modeling Concepts (FMC) [14] Partition 0 0 1 2 3 4 5 6 7 8 Grafana fetches the data to display from Graphite [15], an open-source monitoring tool. It consists of three components: writes Carbon, Whisper, and Graphite-web. Carbon is a service that Partition 1 0 1 2 3 4 5 retrieves time-series data, which is stored in Whisper, a persis- tence library. Graphite-web includes an interface for designing old new dashboards. However, these dashboards are not as appealing and functionally comprehensive as the corresponding compo- Fig. 2. Apache Kafka topic structure (based on [11]) nents of Grafana, which is why Grafana is employed. For the installation of Graphite, the official docker image in version Furthermore, Apache Kafka can be configured to use the Lo- 1.1.4 is used, again for time zone configuration reasons. gAppendTime feature, which induces Apache Kafka to assign Graphite receives its input from two sources: collectd [16] a timestamp to each message once it is appended to the log. and jmxtrans [17]. The former is a daemon collecting system The existing alternative, which represents the default value, and application performance metrics, that runs on the broker’s is CreateTime. In this setting, the timestamp created by the machines in the described setup. It offers plugins for gathering Apache Kafka producer when creating the message, i.e., before OS-level measurements, such as memory usage, system load, sending it, is stored along with the message. For transmitting and received or transmitted packages over the network. Jmxtrans, the other data source for Graphite, is a tool for TABLE II collecting JVM runtime metrics.