Helios: Hyperscale Indexing for the Cloud & Edge

Helios: Hyperscale Indexing for the Cloud & Edge Rahul Potharaju, Terry Kim, Wentao Wu, Vidip Acharya, Steve Suh, Andrew Fogarty, Apoorve Dave, Sinduja Ramanujam, Tomas Talius, Lev Novik, Raghu Ramakrishnan Microsoft Corporation frapoth, terryk, wentwu, viachary, stsuh, anfog, apdave, sindujar, tomtal, levn, [email protected] ABSTRACT tant for debugging a large internal analytics system called Cosmos. Helios is a distributed, highly-scalable system used at Microsoft The indexes are used to quickly retrieve relevant logs when debug- for flexible ingestion, indexing, and aggregation of large streams ging failures. Over 50% of streams have more than seven columns of real-time data that is designed to plug into relational engines. that are frequently queried, and thus need to be indexed for faster The system collects close to a quadrillion events indexing approxi- retrieval. Coupled with the high cardinality (see Figure 1(b)) of mately 16 trillion search keys per day from hundreds of thousands the underlying columns, primary indexes or simple partitioning of machines across tens of data centers around the world. Helios schemes are not sufficient; secondary indexes are a necessity to use cases within Microsoft include debugging/diagnostics in both support such diverse workloads. Service telemetry also exhibits public and government clouds, workload characterization, cluster skewed temporal behavior — data streams are both diverse (Fig- health monitoring, deriving business insights and performing im- ure 1(c)) and high-volume (Figure 1(d)) — incoming telemetry pact analysis of incidents in other large-scale systems such as Azure can go as high as 4 TB/minute. None of our existing systems could Data Lake and Cosmos. Helios also serves as a reference blueprint handle these requirements adequately; either the solution did not for other large-scale systems within Microsoft. We present the sim- scale or it was too expensive to capture all the desired telemetry. ple data model behind Helios, which offers great flexibility and In this paper, we present our experiences in building and operat- control over costs, and enables the system to asynchronously in- ing Helios, a system for inexpensive and flexible ingestion, index- dex massive streams of data. We also present our experiences in ing, and aggregation of large streams of real-time data. Helios has building and operating Helios over the last five years at Microsoft. evolved over the last five years to support several capabilities. First, it gives users an easy-to-understand data to memory hierarchy map- PVLDB Reference Format: ping based on their query needs providing them with a cost/benefit Rahul Potharaju, Terry Kim, Wentao Wu, Vidip Acharya, Steve Suh, An- trade-off. This fits well with write-once, read-often, long-lifetime drew Fogarty, Apoorve Dave, Sinduja Ramanujam, Tomas Talius, Lev Novik, scenarios, which are increasingly common. There is a recency bias Raghu Ramakrishnan. Helios: Hyperscale Indexing for the Cloud & Edge. PVLDB, 13(12): 3231-3244, 2020. (since long-lived data becomes irrelevant, inaccurate, or outdated DOI: https://doi.org/10.14778/3415478.3415547 as it ages) and users can cache recent data in faster SSDs and spill older data to disks or remote storage. Second, we support computation offloading in that we can distribute computation (including fil- 1. INTRODUCTION tering, projections, and index generation) to host machines, allow- International Data Corporation (IDC) estimates that data will ing the user to optimize costs. As a concrete example, for telemetry grow from 0.8 to 163 ZBs this decade [31]. As an example, Mi- in our data centers, we handle ingestion rates of 10s of PBs/day by crosoft’s Azure Data Lake Store already holds many EBs [61] and distributing index generation and online aggregation to the nodes is growing rapidly. Users seek ways to focus on the few things where the data is generated (see Section 1.1), thus utilizing com- they really need, but without getting rid of the original data. This putational slack on those machines rather than incurring additional is a non-trivial challenge since a single dataset can be used for an- costs in the indexing infrastructure. Our approach therefore applies swering a multitude of questions. As an example, inside Microsoft, naturally to edge applications, since we can similarly distribute the telemetry (e.g., logs, heartbeat information) from services such as task of data summarization — including filtering, projections, in- Azure Data Lake are stored and analyzed to support a variety of dex generation, and online aggregation — to edge devices. Third, developer tasks (e.g., monitoring, reporting, debugging). With the since we build indexes on-the-fly, a much wider range of queries, monetary cost of downtime ranging from $100k to millions of dol- including point look-ups and aggregations, can be supported effi- lars per hour [37], real-time processing and querying of this service ciently on the full spectrum of data from the freshest to the oldest. data becomes critical. At query time, we want users to leverage existing relational en- Figure 1(a) shows the distribution of the total number of columns gines without having to choose one over another. While Helios is requested to be indexed for streams that users identified as impor- an indexing and aggregation subsystem, it is designed to provide This work is licensed under the Creative Commons Attribution- query engines an abstraction of relations with secondary indexes, NonCommercial-NoDerivatives 4.0 International License. To view a copy leveraged through traditional access path selection [63] at query of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For optimization time. We have successfully verified the practicality any use beyond those covered by this license, obtain permission by emailing of such an abstraction through integration with Apache Spark [11] [email protected]. Copyright is held by the owner/author(s). Publication rights and Azure SQL Data Warehouse [1]. licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. 12 We view this work as our first effort on building a new genre of ISSN 2150-8097. distributed big-data processing systems that operate in the context DOI: https://doi.org/10.14778/3415478.3415547 3231 100 Across all Streams 1.00 0.75 75 0.50 0.25 50 0.00 P[X < x] Across individual Stream 1.00 P[X < x] 25 0.75 0.50 0 0.25 5 10 15 20 25 0.00 Total Columns Requested 0 1000 2000 3000 to be Indexed per Stream Data size (GB) per minute (a) (b) (c) (d) Figure 1: (a) Distribution of the number of columns queried per stream; (b) Distribution of the cardinality for columns being queried; (c) Heatmap of a subset of data streams in our ingestion workload; (d) Distribution of raw data size ingested. of cloud and edge. We see a number of future research problems that are worth investigating. One major problem is how to distribute Figure 2: Example CREATE computation between cloud and edge. In this respect, Helios pro- CREATE STREAM Table vides one example for a specific type of computation, i.e., indexing FROM FileSystemMonitorSource(SourceList,"d:\logs_ *.txt") USING DefaultTextExtractor("-d",",") of data. There are a number of related problems, such as resource ( allocation, query optimization, consistency, and fault tolerance. We Timestamp: datetime PRIMARY KEY, believe that this is a fertile ground for future research. user_id: string INDEX, job_id: string INDEX, vertex_id: string INDEX 1.1 An End-to-End Tour of Helios ) OUTPUTTO"abfs://bigdata/logdata" In this section, we discuss Helios through an example. PARTITIONBY FORMAT(Timestamp,"yyyy/mm/dd/hh") Example (Log Analytics). Consider a data-parallel cluster such as CHUNK EVERY1 MINUTE; Microsoft’s internal service Cosmos [19], which is similar to a fully managed collection of Hadoop clusters, each with tens of thousands only streams on a distributed file system (e.g., ADLS). Therefore, of nodes, containing several exabytes of data. The scope of ana- each block is persisted as an independently addressable chunk of lytics on these datasets ranges from traditional batch-style queries data with a unique durable URI (Step 2 in Figure 3). (e.g., OLAP) to explorative, “finding a needle in a haystack” type Subsequently, a collection of index blocks derived from these queries (e.g., point-look ups, summarization). Users (denoted by data blocks (by simply extracting the values of the correspond- user id) submit user jobs. A user job (denoted as job id) is a dis- ing “to be indexed” columns tenant id, job id and vertex id) are tributed program execution that runs on multiple machines, and can merged into a global index, which holds the (Column; Value) ! be thought of as a graph in which each vertex (denoted vertex id) is Data Block URI mapping (Step 3 in Figure 3). This step allows us a task that runs on a single machine and edges denote precedence to provide indexed access to massive data streams. While we dis- constraints between tasks. cussed the case where indexing happens on the ingestion layer for Example users of log analytics over Cosmos telemetry include: simplicity, in reality, it gets pushed down into the agent. We discuss User A, a service engineer interested in quickly locating all logs the complete model in Section 2.1. generated by a single failing vertex. 2. Query. Users can query the underlying data using SQL. The User B, a sales representative computing an hourly report for optimizer will transparently prepare the best execution plan, lever- jobs belonging to a specific user. aging the indexed columns if they help. User C, a data explorer looking for all relevant streams that con- SELECT Timestamp tain latency information pertaining to a specific user so she may FROM Table WHERE user_id =='adam' compute a per-tenant resource prediction model.

Load more