Azure Data Lake Store: a Hyperscale Distributed File Service for Big Data Analytics
Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics Raghu Ramakrishnan*, Baskar Sridharan*, John R. Douceur*, Pavan Kasturi*, Balaji Krishnamachari-Sampath*, Karthick Krishnamoorthy*, Peng Li*, Mitica Manu*, Spiro Michaylov*, Rogério Ramos*, Neil Sharman*, Zee Xu*, Youssef Barakat*, Chris Douglas*, Richard Draves*, Shrikant S Naidu**, Shankar Shastry**, Atul Sikaria*, Simon Sun*, Ramarathnam Venkatesan* {raghu, baskars, johndo, pkasturi, balak, karthick, pengli, miticam, spirom, rogerr, neilsha, zeexu, youssefb, cdoug, richdr, shrikan, shanksh, asikaria, sisun, venkie}@microsoft.com ABSTRACT open-source Big Data system, and the underlying file system HDFS Azure Data Lake Store (ADLS) is a fully-managed, elastic, has become a de-facto standard [28]. Indeed, HDInsight is a scalable, and secure file system that supports Hadoop distributed Microsoft Azure service for creating and using Hadoop clusters. file system (HDFS) and Cosmos semantics. It is specifically ADLS is the successor to Cosmos, and we are in the midst of designed and optimized for a broad spectrum of Big Data analytics migrating Cosmos data and workloads to it. It unifies the Cosmos that depend on a very high degree of parallel reads and writes, as and Hadoop ecosystems as an HDFS compatible service that well as collocation of compute and data for high bandwidth and supports both Cosmos and Hadoop workloads with full fidelity. low-latency access. It brings together key components and features of Microsoft’s Cosmos file system—long used internally at It is important to note that the current form of the external ADL Microsoft as the warehouse for data and analytics—and HDFS, and service may not reflect all the features discussed here since our goal is a unified file storage solution for analytics on Azure.
[Show full text]