Real-Time Data Infrastructure at Uber Yupeng Fu Chinmay Soman [email protected] [email protected] Uber, Inc Uber, Inc
Total Page:16
File Type:pdf, Size:1020Kb
Real-time Data Infrastructure at Uber Yupeng Fu Chinmay Soman [email protected] [email protected] Uber, Inc Uber, Inc ABSTRACT for tracking things such as trip updates, driver status change, or- Uber’s business is highly real-time in nature. PBs of data is con- der cancellation and so on. Some of it is also derived from the tinuously being collected from the end users such as Uber drivers, OnLine Transactional Processing (OLTP) database changelog used riders, restaurants, eaters and so on everyday. There is a lot of internally by such microservices. As of October 2020, trillions of valuable information to be processed and many decisions must messages and petabytes of such data were generated per day across be made in seconds for a variety of use cases such as customer all regions. incentives, fraud detection, machine learning model prediction. In Real-time data processing plays a critical role in Uber’s technol- addition, there is an increasing need to expose this ability to differ- ogy stack and it empowers a wide range of use cases. At high level, ent user categories, including engineers, data scientists, executives real-time data processing needs within Uber consists of three broad and operations personnel which adds to the complexity. areas: 1) Messaging platform that allows communication between In this paper, we present the overall architecture of the real-time asynchronous producers and subscribers 2) Stream processing that data infrastructure and identify three scaling challenges that we allows applying computational logic on top of such streams of mes- need to continuously address for each component in the architec- sages and 3) OnLine Analytical Processing (OLAP) that enables ture. At Uber, we heavily rely on open source technologies for the analytical queries over all this data in near real time. Each area has key areas of the infrastructure. On top of those open-source soft- to deal with three fundamental scaling challenges within Uber: ware, we add significant improvements and customizations to make • Scaling data: The total incoming real time data volume has the open-source solutions fit in Uber’s environment and bridge the been growing exponentially at a rapid rate of year over year gaps to meet Uber’s unique scale and requirements. produced by several thousands of micro services. In addition, We then highlight several important use cases and show their Uber deploys its infrastructure in several geographical re- real-time solutions and tradeoffs. Finally, we reflect on the lessons gions for high availability, and it has a multiplication factor we learned as we built, operated and scaled these systems. in terms of handling aggregate data. Each real time process- ing system has to handle this data volume increase while CCS CONCEPTS maintaining SLA around data freshness, end-to-end latency • Information systems ! Stream management; • Computer and availability. systems organization ! Real-time system architecture. • Scaling use cases: As Uber’s business grows, new use cases emerge from various business verticals and groups. Different KEYWORDS parts of the organization have varying requirements for the Real-time Infrastructure; Streaming Processing real time data systems, which are often competing in nature. For instance, dynamic pricing[31] for a given Uber product ACM Reference Format: (such as rides or eats) is a highly complex real-time workflow Yupeng Fu and Chinmay Soman. 2021. Real-time Data Infrastructure at Uber. involving multi-stage stream processing pipelines that run In Proceedings of the 2021 International Conference on Management of Data various machine learning algorithms along with a fast key- (SIGMOD ’21), June 20–25, 2021, Virtual Event, China. ACM, New York, NY, value store. This system is designed for favoring freshness USA, 14 pages. https://doi.org/10.1145/3448016.3457552 and availability over data consistency, and it’s implemented entirely by engineers. On the other hand, monitoring real- 1 INTRODUCTION time business metrics around orders and sales requires a SQL arXiv:2104.00087v1 [cs.DC] 31 Mar 2021 A lot of real-time data is generated within Uber’s data centers. This like interface used by data scientists with more emphasis data originates in different sources such as end user applications given to data completeness. (driver/rider/eater) or the backend microservices. Some of this data • Scaling users: The diverse users interacting with the real consists of application or system logs continuously emitted as part time data system fall on a big spectrum of technical skills of day to day operation. A lot of services also emit special events from operations personnel who have no engineering back- ground to advanced users capable of orchestrating complex Permission to make digital or hard copies of all or part of this work for personal or real time computational data pipelines. As the personnel of classroom use is granted without fee provided that copies are not made or distributed Uber grows, the platform teams also face increasing chal- for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM lenges on the user imposed complexities, such as safe client- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, side version upgrade for a large number of applications and to post on servers or to redistribute to lists, requires prior specific permission and/or a managing an increasing number of user requests. fee. Request permissions from [email protected]. SIGMOD ’21, June 20–25, 2021, Virtual Event, China In short, the biggest challenge is to build a unified platform with © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8343-1/21/06...$15.00 standard abstractions that can work for all such varied use cases and https://doi.org/10.1145/3448016.3457552 users at scale instead of creating custom solutions. A key decision made by us to overcome such challenges is to adopt the open source are often competing with those of other use cases. The different solutions in building this unified platform. Open source software requirements generally include the following aspects: adoption has many advantages such as the development velocity, • Consistency: Mission-critical applications such as financial cost effectiveness as well as the power of the crowd. Given the scale dashboards require data to be consistent across all regions. and rapid development cycles in Uber, we had to pick technologies This includes zero data loss in the inter-region and intra- that were mature enough to be able to scale with Uber’s data as well region dispersal and processing mechanisms, de-duplication as extensible enough for us to integrate it in our unified real-time as well as ability to certify data quality. data stack. • Availability: The real time data infrastructure stack must Figure 1 depicts the high-level flow of data inside Uber’s infras- be highly available with 99.99 percentile guarantee. Loss tructure. Various kinds of analytical data are continuously collected of availability has a direct impact on Uber’s business and from Uber’s data centers across multiple regions. These streams of may result in significant financial losses. For instance, dy- raw data form the source of truth for all analytics at Uber. Most namic pricing leverages the real-time data infrastructure of these streams are incrementally archived in batch processing component for calculating demand and supply ratios per systems and ingested in the data warehouse. This is then made geo-fence, which in turn is used to influence the price ofa available for machine learning and other data science use cases. trip or UberEats delivery. The Real Time Data Infra component continuously processes such • Data Freshness: Most of the use cases require seconds level data streams for powering a variety of mission critical use cases freshness. In other words a given event or log record must be such as dynamic pricing (Surge), intelligent alerting, operational available for processing or querying, seconds after it has been dashboards and so on. This paper focuses on the real time data produced. This is a critical requirement to ensure ability to eco-system. respond to certain events such as security incidents, demand- supply skews, business metric alerts and so on. • Query latency: Some use cases need the ability to execute queries on the raw data stream and require the p99th query latency to be under 1 second. For instance, site facing or exter- nal analytical tools such as UberEats Restaurant manager[53] will execute several analytical queries for each page load. Each such query must be very fast to provide a good experi- ence for the restaurant owner. • Scalability: The raw data streams constitute petabytes of data volume collected per day across all regions. This data is constantly growing based on organic growth of our user base, new lines of business deployed by Uber as well as new real time analytics use cases that arise over time. The ability to scale with this ever growing data set in a seamless man- ner, without requiring users to re-architect the processing Figure 1: The high-level data flow at Uber infrastructure pipelines is a fundamental requirement of the real-time data The paper is organized as follows. In Section 2, we list the re- infrastructure stack. • quirements derived from the use cases which are used to guide Cost: Uber is a low margin business. We need to ensure the the design decisions for various real-time use cases. In Section cost of data processing and serving is low and ensure high 3, we provide an overview of the high-level abstractions of the operational efficiency. This influences a variety of design real-time data infrastructure at Uber.