<<

Introduction

Apache Pulsar ​ is a cloud-native, distributed messaging and streaming platform that manages hundreds of billions of events per day. Pulsar was originally developed at Yahoo! ​as the unified messaging platform connecting critical Yahoo applications such as Yahoo Finance, Yahoo Mail, and to data.​

Many people know Pulsar as a publish-subscribe (pub/sub) messaging technology, but it has evolved extensively since its inception. The Pulsar project has grown and evolved to meet the needs of real-time event streaming use-cases, including data pipelines, microservices, and stream processing. Its cloud-native architecture and built-in multi-tenancy differentiate it from its predecessors and uniquely position it as an enterprise-ready, event streaming platform.

While it has remained under the radar to some, Pulsar has been in production environments at major tech and companies, such as Media, Yahoo! JAPAN, and for years and was designated an Apache Top-Level project in 2018. Additionally, the Pulsar community has experienced tremendous growth over the last few years, including a fivefold increase in the number of contributors in just the last 12 months.

To better understand the growth in adoption and how organizations are leveraging the project today, the Apache Pulsar Project Management Committee (PMC) sent a survey to Pulsar users. The survey was administered between November 2019 and January 2020, and 165 users responded. Of the survey respondents, 88% held technical roles as architects, data scientists, developers, engineers, and DevOps engineers, and the most heavily represented industries were and hardware, internet, finance, and e-commerce.

In this paper, we’ll look at how Pulsar is driving value for businesses—from enabling new, innovative offerings and improving customer experiences to reducing overhead complexity and costs. We’ll look at survey results across

1 three core categories, noted below, and the fourth section of the report will include details on survey respondent demographics.

I. Why Organizations are Adopting Pulsar

II. How Organizations are Using Pulsar Today

III. The Current State of Pulsar Adoption

I. Why Organizations Are Adopting Pulsar

As businesses increasingly look to data-driven strategies to streamline operations, develop innovative offerings, and improve customer experiences, they are driven to seek new processes and infrastructure in order to support their efforts. The ability to deliver and execute on event-based strategies is often dependent on the availability of real-time information, leading many organizations to move from batch-based models to event-driven models. In this section we’ll look at specific factors driving organizations to adopt streaming solutions, generally, and Pulsar, specifically.

The Benefits of Streaming Platforms and Pulsar To better understand the motivations driving the adoption of streaming, we asked survey respondents, “What value do Pulsar and streaming platforms bring to your organization?” The graph in Figure 1 shows what percentage of users voted for each category.

2

What value do Pulsar and streaming platforms bring to your organization?

Figure 1.

The top answers were (1) ​increased agility,​ (2) unlocks new use cases for the business, (3) reduced costs, and (4) improved customer experience. Here, we’ll discuss how streaming is providing value to companies across these areas.

#1) ​Increased Agility:​ Streaming platforms enable companies to respond quickly to market changes, competitive threats, and emerging opportunities.

#2) U​ nlocks New Use Cases for the Business​: As the paradigm shifts from batch to streaming to unified batch and streaming, Pulsar offers potential for new use cases and provides a competitive advantage in the marketplace.

#3) R​ educed Costs:​ Pulsar provides a unified messaging model that supports both online core business services and analytical workloads. It can be used as a unified event streaming platform to consolidate and reduce the cost of maintaining multiple messaging technologies.

3

#4) ​Improved Customer Experience:​ A streaming platform helps accelerate computation and processing. B​ y providing companies the ability to react quickly, streaming platforms enable companies to deliver more value and better experiences to their customers.

Takeaway:​ As evidenced above, organizations are adopting streaming platforms to expand their offerings with real-time, data-enabled solutions. This approach enables companies to be more agile in the market, to reduce costs, and to improve customer experiences.

Next, we’ll look at how Pulsar stands out from other event-streaming platforms.

Pulsar Differentiators To better understand why users are adopting Pulsar over older, more established messaging platforms, respondents were asked to share their top three highlights of the project. Respondents put “architecture design,” “scalability,” and “reliability” at the top of the list. Let’s talk in more detail about each of these.

4

What are the top 3 highlights for Pulsar?

Figure 2.

#1 Architecture Design From databases to messaging systems, most distributed data processing technologies have taken the approach of co-locating data processing and data storage on the same cluster nodes or instances. While this approach provided some benefits back when network bandwidth was more limited and data transfer was more expensive, it limits the scalability, resiliency, and operations of the platform.

Pulsar’s architecture takes a different approach; one that’s starting to be seen in a number of “cloud-native” solutions and that is made possible in part by the significant improvements in network bandwidth that are commonplace today. The Pulsar approach separates data serving and data storage into different layers: data serving is handled by stateless “broker” nodes, while data storage is handled by Apache BookKeeper—a scalable,

5 strongly consistent, and durable log storage system1. Figure 3 provides an overview of Apache Pulsar’s multi-layer and segment-centric architecture.

Apache Pulsar’s multi-layer and segment-centric architecture

Figure 3.

The architectural differences in Pulsar also extend to how Pulsar stores data. Pulsar breaks topic partitions into segments and then distributes the segments across the storage nodes in Apache BookKeeper to get better performance, scalability, and availability [​ 1]​. Compared to messaging systems that use a monolithic architecture, Pulsar’s multi-layered and segment-centric design is container-friendly. This architectural design enables Pulsar to provide a cloud-native, event-streaming platform with better performance, scalability and resiliency than its competitors.

Below, Pierre Zemb, Tech Lead at OVHCloud, shares how his team used Pulsar’s architecture to build the foundation for their new messaging product and some key benefits they’ve gained from Pulsar:

1 Apache BookKeeper is a large-scale distributed log storage system that is able to store ​ trillions of events per day. It has been adopted by , Yahoo, , Bytedance (Tiktok), and other major technology companies.

6

“Internally, we had been running Apache Kafka for years, and despite all the skills obtained from operating multiple clusters with millions of messages per second, we decided to shift and build the foundation of our new messaging product based on Apache Pulsar. The overall architecture with Apache Bookkeeper greatly facilitates multi-tenancy, scalability, and operations, and provides new features like Pulsar SQL with Presto.”

- Pierre Zemb, Tech Lead OVHCloud

#2 Scalability With its multi-layered architectural foundation, Pulsar provides innovative options for scaling infrastructure. B​ ecause message serving (brokers) and storage (bookies) are separated into two layers, a topic partition can be moved from one broker to another almost instantly and without the need for data rebalancing. Unlike other messaging platforms, Pulsar doesn’t have to recopy old data from existing storage nodes to new storage nodes. This means that Pulsar can offer instant ​scalability without partition rebalancing.

This architecture also enables each layer to scale independently. W​ hen you need to support more consumers or producers, you can simply add more brokers. When you need more long-term storage for messages, you can simply add more bookies. This provides a much better capacity planning model that can be more cost efficient and can provide infinite, elastic capacity.

#3 Resiliency Pulsar’s architecture design, notably the separation of serving from storage, lends to both its scalability and resiliency. ​ By leveraging the ability of elastic environments, such as cloud and containers, to automatically scale resources up and down, this architecture can dynamically adapt to traffic

7 spikes, which reduces the chances of the system slowing down or becoming overloaded. It also improves system availability and manageability by significantly reducing the complexity of cluster expansions and upgrades. ​This characteristic is crucial to many things, such as cluster expansion and the ability to react quickly to broker and bookie failures.

Takeaway:​ Pulsar’s superior architecture design, scalability, and resiliency are the key drivers of its adoption.

II. How Organizations Are Using Pulsar

In the previous section, we discovered that organizations are moving to streaming in order to unlock new business use cases, to improve their customer offering, and to achieve a competitive advantage in the market. Looking at Pulsar, specifically, we learned that its unique architecture design, scalability, and resiliency were the top drivers for adoption in the market. The remainder of this paper will focus on Pulsar-related business applications and its current state of adoption.

In this section, we will look at the following: 1. Pulsar’s Most-Used Features 2. Pulsar’s Top Use-Cases 3. Pulsar And The Messaging Ecosystem 4. How Organizations Deploy Pulsar

Pulsar’s Most-Used Features To better understand how users are leveraging Pulsar, respondents were asked, “What are the top 3 features you frequently use in Pulsar?” Figure 4, below, shows “Pub/Sub” was at the top of the list, with more than 72% of respondents noting that they used this Pulsar feature. As noted earlier,

8

Pulsar’s origins were in the pub/sub domain, and the survey results further validate its dominance in this area.

What are the top 3 features you frequently use in Pulsar?

Figure 4.

Perhaps a more significant insight from this figure is the high adoption rate of several of Pulsar’s lesser-known functions, notably, multi-tenancy, geo-replication, functions, connectors, and tiered storage. Let’s walk through some of the key highlights of each of these Pulsar features.

Multi-Tenancy Multi-tenancy refers to a single instance of software that is able to serve multiple independent applications—or tenants—in a shared environment. With a hierarchical topic namespace, Pulsar enables users to maintain thousands, or even millions, of topics on a single cluster. This architecture simplifies infrastructure and management, thus reducing operational costs

9 and overhead. It can also be a competitive differentiator; ​as one respondent shared, it enables them to offer “topic-as-a-service” to their clients.

The quotation below from Qiang Fei, Tech Lead at Tencent, highlights how Pulsar’s multi-tenancy and architectural design has enabled them to reduce operational overhead and to perform with high consistency and reliability at scale:

“Pulsar provides us with a highly consistent and highly reliable distributed message queue that fits well in our financial use cases. Multi-tenant and storage separation architecture design greatly reduces our operational and maintenance overhead. We have used Pulsar on a very large scale in our organization and we are impressed that Pulsar is able to provide high consistency while supporting high concurrent client connections.”

- Qiang Fei, Tech Lead at Tencent

Geo-Replication Geo-replication is a typical mechanism used to provide disaster recovery. Generally, any database or message bus solution replicates data between two data centers. Pulsar supports multi-datacenter replication (n-mesh) with both asynchronous and synchronous replication. Geo-replication is built into Pulsar, which means you don’t need to set up additional tools to replicate data between clusters.

Pulsar Functions + Connectors Pulsar Functions bring serverless computation to event streaming by providing an easy-to-use interface. Developers can choose their preferred language to write functions to process events in real-time. With Pulsar Functions, you don’t need to set up additional full-fledged computation

10 engines f or lightweight computation logic such as Extract Transform Load (ETL), filtering, and routing. This significantly reduces the cost of running and managing lightweight computation logic for enterprises.

Pulsar IO is a lightweight connector framework that is built on top of Pulsar Functions. It provides the ability to ingest data into and consume data out of Pulsar without writing any code. Pulsar offers a comprehensive list of built-in connectors integrated with existing Big Data ecosystems. This set of connectors simplifies integration for enterprises that are bringing Pulsar into their existing infrastructure.

Tiered Storage To deliver an event streaming service, platforms must manage large numbers of messages and data in real-time, and this requires keeping large amounts of data on the platform, or readily accessible. As the amount of data increases, it becomes significantly more expensive to store, manage, and retrieve, so operators and application developers look to external stores like S3 for long-term storage. Pulsar leverages a unique tiered storage solution that addresses some of these key challenges faced by other distributed log systems.

This tiered storage solution extends the storage capabilities of Pulsar by offloading the majority of the data from Apache BookKeeper to external remote storage which provides a cheaper form of storage that readily scales with the volume of data. Pulsar is able to retain both historic and real-time data and provides a unified view as infinite event streams [​ 2],​ which can be easily reprocessed or backloaded into new systems. Companies can integrate Pulsar with a unified data processing engine (such as Apache Flink or Apache Spark) to unlock many new use cases stemming from infinite data retention.

Takeaway:​ A review of Pulsar’s most-used features highlights its powerful capabilities and highly differentiated offering. From its built-in multi-tenancy, which reduces architectural complexity and enables

11 organizations to scale, to its multi-data center replication, which allows Pulsar to handle datacenter failures and to produce and consume messages on a topic from any datacenter, we see how Pulsar has evolved into a robust and differentiated streaming platform.

Next, we’ll look at how organizations are using Pulsar’s streaming capabilities.

Pulsar’s Top Use-Cases To understand how Pulsar is being leveraged for streaming applications, we asked survey respondents, “What do you use Pulsar’s stream processing capabilities for?” The top three responses were (1) asynchronous applications, (2) building core business applications, and (3) ETL (extraction, transformation, and loading).

What do you use Pulsar’s stream processing capabilities for?

Figure 5.

12

In this section, we’ll look at how organizations are leveraging Pulsar’s robust technology to build their core messaging and streaming applications.

#1 Asynchronous Applications Asynchronous applications help speed up software development through organizational alignment and independent deployability, as well as polyglotism, or the use of multiple languages. As the world moves toward microservices, asynchronous applications are gaining wider adoption.

Pulsar’s unified messaging (queuing and streaming) model provides a comprehensive mechanism for microservices to choose the right messaging method to connect with each other and to pass and process events in an asynchronous way. In fact, more than 50% of the respondents noted that they leverage Pulsar to build asynchronous applications.

#2 Building Core Business Applications Core business applications have complex requirements that include strong consistency and durability, and also the ability to scale as the business grows. While existing platforms may be able to deliver on one or two of these requirements, Pulsar is able to deliver on all of them.

Pulsar provides strong consistency and durability, in addition to the ability to scale to hundreds, or even thousands, of nodes per cluster. It also offers scale-out consumption via shared and key_shared subscription, which allows scaling of consumption beyond the number of partitions. Finally, it offers selective acknowledgement capability to recognize events once they are processed. This avoids duplicates from being introduced into the core business logic.

#3 ETL ETL workloads usually handle larger volumes of data compared to core online business applications, but they have fewer consistency and durability requirements and can often tolerate losing a small percentage of

13 data. Companies look to Pulsar for ETL due to its scale-out and multi-layered architecture design, which improves scalability and resiliency and simplifies operations. Pulsar’s ability to scale without the need to rebalance partitions is a unique differentiator that is attractive to operators.

Takeaway: U​ nderstanding the top-used Pulsar features and the most popular applications being built on Pulsar today help to highlight how it is being used to deliver scalable, reliable, real-time streaming solutions. It also helps us understand the high growth rate in Pulsar adoption.

Next, we’ll look at how survey respondents view Pulsar in the competitive landscape.

Pulsar and the Messaging Ecosystem To better understand Pulsar’s role within the larger messaging ecosystem, we asked survey respondents, “W​ hich do you evaluate as Pulsar alternatives when choosing/replacing a message queue?​” Survey respondents were able to select as many options as were applicable, and the top-ranked responses were as follows:

1. Kafka: 87% 2. RabbitMQ: 37% 3. RocketMQ: 12% 4. ActiveMQ: 7%

14

Which do you evaluate as Pulsar alternatives when choosing/replacing a message queue?

Figure 6.

Now, let’s look at how Pulsar and the top-noted alternatives, Kafka and RabbitMQ, compare across asynchronous applications, building core business applications, and ETL.

Kafka is commonly used in building data pipelines for analytical applications and services, such as log collection, engagement, and impression analysis. R​ abbitMQ is most commonly used in building core online business applications, such as payments, billing, and transactions. Both RabbitMQ and Kafka are leveraged to build asynchronous applications.

While Kafka and RabbitMQ are both widely adopted technologies, they each come with challenges. For example, Kafka’s challenges include availability and partition rebalancing, while RabbitMQ has challenges with scalability. Additionally, they are each limited to their respective expertise, be it point-to-point communications or event streaming.

15

Pulsar is increasingly being adopted because of its scalability, resiliency, and unique ability to provide unified messaging. While it is widely regarded as a better alternative to many existing messaging solutions, it is also increasingly being considered a replacement for more than one messaging solution. In fact, 42% of respondents said they consider Pulsar to be an alternative to two or more messaging systems.

Pulsar is able to replace multiple systems because it provides a unified messaging model that combines both streaming and queuing capabilities for online core business services and offline analytical applications and services. It enables organizations to create a unified messaging platform to improve operational efficiencies, remove redundant systems, and reduce both hardware and software costs.

Takeaway: U​ nderstanding the top applications being built with Pulsar and its ability to replace competitive messaging solutions highlights some of its most valuable capabilities and differentiators. In the final question in this section, we’ll look at how organizations are deploying Pulsar from an operational perspective.

How Organizations Deploy Pulsar Here, we asked respondents, “In which environments are you using Pulsar?” The top two responses were “on-premises” and “public cloud (using Kubernetes).”

16

In which environments are you using Pulsar?

Figure 7.

Pulsar’s cloud-agnostic and container-friendly architecture makes it an ideal technology for hosting cloud-native event streaming platforms, as it has built-in support for deployments in different environments. It is not only easily deployed on standard on-premise machines, but also on Kubernetes Engine, , and custom clusters.

Survey respondents also note that Pulsar is easy to deploy on premises or on cloud virtual machines using tools like Ansible and Terraform. 38% of the survey respondents deploy Pulsar on Kubernetes because Pulsar can take advantage of cloud-native technologies to auto-scale each layer independently, scale computing capability, and provide cloud portability. In addition, Pulsar Functions supports Kubernetes/Docker runtime, which enables you to plug into the wider Kubernetes ecosystem.

Takeaway: P​ ulsar’s design makes it easy to run in cloud-native environments, such as Kubernetes, as well as in bare metal environments.

17

III. The Current State of Pulsar Adoption

So far, we’ve looked at why and how organizations leverage Pulsar to drive value for their business. In this section we’ll look at the following:

1. Key Statistics For Pulsar Project 2. Pulsar’s Global Adoption 3. Pulsar Adoption By Stage 4. Increased Adoption In 2020

Key Statistics For Pulsar Project To start, we’ll look at key website statistics for Pulsar. As noted earlier, Pulsar became a top-rated project in 2018. The statistics in Figure 8 show the continued maturity of the project and the growth in the Pulsar community. In fact, since December 2018, the number of stars have almost doubled and the number of contributors has increased by 5x.

Github Statistics of Pulsar Repository2

Figure 8.

2 The Github statistics of Pulsar repository was retrieved on 02/20/2020.

18

Pulsar’s Global Adoption Next we look at adoption by country. As Figure 9 shows, the adoption of Pulsar is roughly equal across North America, Asia, and Europe.

Pulsar Website Visitors by Country3

Figure 9.

3 The website visitor metric was retrieved from Google Analytics data of https://pulsar.apache.org on 02/20/2020.

19

Figure 10 shows the major adopters for each region.

Major Pulsar Adopters by Region4 ​

Figure 10.

Pulsar Adoption By Stage To find out the current state of Pulsar adoption, we asked respondents, “What stage best describes Pulsar in your organization?” More than 50% of those surveyed were already on production or in the process of building a Proof of Concept (PoC).

4 The adopter statistics were retrieved from http://pulsar.apache.org/en/powered-by/ on 02/20/2020. ​ ​

20

What stage best describes Pulsar in your organization?

Figure 11

Increased Adoption In 2020 To gain insights on future adoption plans from Pulsar users, we asked survey respondents,​ “Will your organization deploy more applications or systems using Pulsar in 2020?” As Figure 12 shows, nearly two-thirds of the respondents answered “yes” and nearly one-third reported that additional Pulsar deployments were under consideration.

Will your organization deploy more applications or systems using Pulsar in 2​ 020?

Figure 12.

21

Figure 13 illustrates the respondents answers to the question, “Will your organization increase the budget for Pulsar in 2020?” As you can see, most have either already increased their Pulsar budget or they are considering an increase. No current Pulsar users anticipate a budget decrease this year.

Will your organization increase the budget for Pulsar in 2020?

Figure 13.

Takeaway:​ This section reveals high satisfaction from current Pulsar users, with a majority of respondents committing to deploy more applications on Pulsar in 2020, and more than 50% of respondents either increasing budget or considering increasing budget for Pulsar in 2020.

Conclusion

The 2020 Pulsar User Survey Report reveals the wide adoption of Pulsar, from major technology companies to small- and midsize-companies, across many industries. Furthermore, it highlights the maturity of the project in production environments at major companies globally, in addition to the growth, both in terms of size and engagement, of the Pulsar community.

The report highlights the fundamental differentiators that set Pulsar apart from the messaging and streaming landscape. Its multi-layer architectural design enables unparalleled scalability and resiliency, while simplifying management complexity and reducing costs. Its built-in multi-tenancy and

22 multi-data center replication ensure that companies are able to build applications with disaster recovery. From top-used features to user testimonials, this report highlights how Pulsar is enabling companies to streamline infrastructure, simplify operations and deliver more value to customers.

As the move to real-time streaming continues across industries, the infrastructure and support requirements needed to deliver these products and services will only continue to grow. And, as businesses demand more capabilities, scalability, and resiliency from streaming platforms, we can expect the adoption of Pulsar to continue to grow.

Looking forward, Pulsar’s product roadmap reveals exciting, new, community-driven features. Perhaps the most anticipated feature is Kafka-on-Pulsar. Kafka-on-Pulsar enables Kafka applications to leverage Pulsar’s powerful features, such as streamlined operations with enterprise-grade multi-tenancy, without modifying code. This and many other exciting features will be rolling out in 2020.

To stay on top of and project updates, we invite you to join the Pulsar community today. Please check out the P​ ulsar project​ and follow us at Twitter.​

References

[1] “Comparing Pulsar and Kafka: How a Segment-Based Architecture Delivers Better Performance, Scalability, and Resilience.” S​ plunk,​ 5 Dec. 2017, https://www.splunk.com/en_us/blog/it/comparing-pulsar-and-kafka-how-a -segment-based-architecture-delivers-better-performance-scalability-and -resilience.html [2] Guo, Sijie. “When Flink & Pulsar Come Together.” A​ pache Flink, 3​ May 2019, https://flink.apache.org/2019/05/03/pulsar-flink.html

23

About the Survey To better understand the demographics of the survey respondents, PMC included the following questions to learn about the audience.

1. What is your role?

Figure 14.

24

2. What industry does your organization operate in?

Figure 15.

3. Where is your organization based?

Figure 16.

25

4. How many employees work in your organization?

Figure 17.

5. How large is your organization in annual sales?

Figure 18.

26