Differentiate Big Data Vs Data Warehouse Use Cases for a Cloud Solution

Total Page:16

File Type:pdf, Size:1020Kb

Differentiate Big Data Vs Data Warehouse Use Cases for a Cloud Solution Differentiate Big Data vs Data Warehouse use cases for a cloud solution James Serra Big Data Evangelist Microsoft [email protected] Blog: JamesSerra.com About Me ▪ Microsoft, Big Data Evangelist ▪ In IT for 30 years, worked on many BI and DW projects ▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer ▪ Been perm employee, contractor, consultant, business owner ▪ Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference ▪ Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions ▪ Blog at JamesSerra.com ▪ Former SQL Server MVP ▪ Author of book “Reporting with Microsoft SQL Server 2012” I tried to understand the Microsoft data platform on my own… And felt like I was body slammed by Randy Savage: Let’s prevent that from happening… Agenda • Data Lake Driven Analytics • Compute technologies • Architecture Patterns Enterprise Data Maturity Stages STAGE 4: Transformative STAGE 3: Predictive Data transforms STAGE 2: business to drive Informative Data capture is desired outcomes. STAGE 1: Any data, any Reactive comprehensive and Structured data is scalable and leads source, anywhere at managed and business decisions scale Structured data is analyzed centrally based on advanced transacted and and informs the analytics locally managed. business Data used reactively Laggards Leaders (rear-view (Real-time mirror) intelligence) Benefits of the cloud for Big Data Agility • Unlimited elastic scale • Pay for what you need Innovation • Quick “Time to market” • Fail fast Risk • Availability • Reliability • Security Total cost of ownership calculator: https://www.tco.microsoft.com/ Need to collect any data Harness the growing and changing nature of data Structured Semi-structured Streaming “ ” Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data. Requires a data lake Get the right information to the right people at the right time in the right format Two Approaches to getting value out of data: Top-Down + Bottoms-Up How can we make it happen? Prescriptive What will Analytics happen? Theory Predictive Theory Analytics Why did Hypothesis Hypothesis it happen? Diagnostic Pattern What Observation Analytics happened? Observation Confirmation Descriptive Analytics Data Warehousing Uses A Top-Down Approach Understand Gather Implement Data Warehouse Corporate Requirements Reporting & Strategy Reporting & Analytics Design Analytics Business Development Requirements Dimension Modelling Physical Design ETL ETL Design Development Technical Requirements Data sources Setup Infrastructure Install and Tune The “data lake” Uses A Bottoms-Up Approach Ingest all data Store all data Do analysis regardless of requirements in native format without Using analytic engines schema definition like Hadoop Batch queries Devices Interactive queries Real-time analytics Machine Learning Data warehouse Exactly what is a data lake? A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. • Inexpensively store unlimited data • Collect all data “just in case” • Store data with no modeling – “Schema on read” • Complements Enterprise data warehouse (EDW) • Frees up expensive EDW resources • Quick user access to data • ETL Hadoop tools • Easily scalable • Place to backup data to • Place to move older data (archive) Needs data governance so your data lake does not turn into a data swamp! Objectives ✓ Plan the structure based on optimal data retrieval ✓ Avoid a chaotic, unorganized data swamp Common ways to organize the data: Time Partitioning Data Retention Policy Probability of Data Access Year/Month/Day/Hour/Minute Temporary data Recent/current data Permanent data Historical data Applicable period (ex: project lifetime) etc… Subject Area etc… Confidential Classification Security Boundaries Business Impact / Criticality Public information Department High (HBI) Internal use only Business unit Medium (MBI) Supplier/partner confidential etc… Low (LBI) Personally identifiable information (PII) etc… Sensitive – financial Sensitive – intellectual property Downstream App/Purpose Owner / Steward / SME etc… Data Lake + Data Warehouse Better Together What happened? What will happen? Descriptive Predictive Analytics Analytics Why did it happen? How can we make it happen? Diagnostic Prescriptive Data sources Analytics Analytics Data Lake Data Warehouse Schema-on-read Schema-on-write Physical collection of uncurated data Data of common meaning System of Insight: Unknown data to do System of Record: Well-understood data to do experimentation / data discovery operational reporting Any type of data Limited set of data types (ie. relational) Skills are limited Skills mostly available All workloads – batch, interactive, streaming, Optimized for interactive querying machine learning Complementary to DW Can be sourced from Data Lake Data Warehouse Serving, Security & Compliance • Business people • Low latency • Complex joins • Interactive ad-hoc query • High number of users • Additional security • Large support for tools • Dashboards • Easily create reports (Self-service BI) • Know questions A data lake is just a glorified file folder with data files in it – how many end- users can accurately create reports from it? What is Azure Databricks? A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs) Fully-managed Hadoop and Spark Azure for the cloud HDInsight 100% Open Source Hortonworks data platform Hadoop and Spark as a Service on Azure Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best SLA Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight” Hortonworks Data Platform (HDP) 3.0 (under the covers of HDInsight 4.0 – public preview) Simply put, Hortonworks ties all the open source products together (20) Have a mature Hadoop ecosystem already established that you want to migrate Want to be 100% open source Want to use Hadoop tools and make them available 24/7 at a lower cost Want to use certain tools like Kafka/Storm/HBase/R Server/LLAP/Pig/Beam/Flink KNOWING THE VARIOUS BIG DATA SOLUTIONS CONTROL EASE OF USE Azure Databricks Azure HDInsight Azure Marketplace HDP | CDH | MapR Reduced Administration BIG DATA BIG DATA ANALYTICS Any Hadoop technology, Workload optimized, Frictionless & Optimized any distribution managed clusters Spark clusters IaaS Clusters Managed Clusters Azure Data Lake Analytics Azure Data Lake Store Gen2 STORAGE Azure Storage BIG DATA Azure SQL Data Warehouse A relational data warehouse-as-a-service, fully managed by Microsoft. Industries first elastic cloud data warehouse with enterprise-grade capabilities. Support your smallest to your largest data storage needs while handling queries up to 100x faster. SMP vs MPP Azure Analysis Services Microsoft Products vs Hadoop/OSS Products Microsoft Product Hadoop/Open Source Software Product Office365/Excel OpenOffice/Calc Cosmos DB MongoDB, MarkLogic, HBase, Cassandra SQL Database SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite Azure Data Lake Analytics/YARN None Azure VM/IaaS OpenStack Blob Storage HDFS, Ceph Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion Event Hub Apache Kafka Azure Stream Analytics Apache Storm, Apache Spark Streaming, Apache Flink, Apache Beam, Twitter Heron Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay) Azure ML (Machine Learning) Apache Mahout, Apache Spark MLib, Apache PredictionIO Microsoft R Open R SQL Data Warehouse/Interactive queries Apache Hive LLAP, Presto, Apache Spark SQL, Apache Drill, Apache Impala IoT Hub Apache NiFi Azure Data Factory Apache Falcon, Airbnb Airflow, Apache Oozie, Apache Azkaban Azure Data Lake Storage/WebHDFS HDFS Ozone Azure Analysis Services/SSAS Apache Kylin, Apache Druid, AtScale (pay) SQL Server Reporting Services None Hadoop Indexes Jethro Data (pay) Azure Data Catalog Apache Atlas PolyBase Apache Drill Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES) SQL Server Integration Services (SSIS) Talend Open Studio, Pentaho Data Integration Apache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for Others Hadoop clusters), Apache Flume (collecting log data) Note: Many of the Hadoop/OSS products are available in Azure Custom apps BI Analytics SQL Server SQL master instance Compute pool Compute pool Compute pool Directly read from SQL Compute SQL Compute SQL Compute SQL Compute SQL Compute … HDFS Node Node Node Node Node Data mart Storage pool SQL Data SQL Data SQL SQL SQL Node
Recommended publications
  • Digital Commerce Transformation and Modern Customer Experience with Microsoft Azure and Billing Solutions with SAP Customer Experience
    The Microsoft Journey Digital Commerce Transformation and Modern Customer Experience with Microsoft Azure and Billing Solutions with SAP Customer Experience Matt Forrest (Principal Software Engineering Manager) & Cassandra Wong (Principal Program Manager) Nishant Vats (Technical Quality Manager) Session ID: ASUG83722 May 7 – 9, 2019 About the Speakers Matt Forrest Cassandra Wong Nishant Vats • Principal Software • Principal Program • Technical Quality Engineering Manager Manager, SAP Manager • Core Services • Core Services Engineering, Engineering, Microsoft Microsoft Agenda • Microsoft – history & background with SAP • Microsoft Online Commerce • Business Case for BRIM at Microsoft • Architecture & Process Monitoring Key Outcomes/Objectives 1. Understanding of E2E business process 2. Technical architecture leveraging BRIM on Azure 3. Understand how we’re leveraging Azure tools to get the best out of BRIM About Microsoft Over the years, major forces and innovations in our industry required Microsoft to transform Deliver Reliant & Agile Enable Modern Provide Real Time ERP Platform Experiences Processes Highly Up to Monitored Up to 17TB compressed 9M Dialog 300K Batch 300M Transaction database Steps/Day Jobs/Month steps/Month Named User SAP Surround Internal Users Non-SAPGUI (Mostly Indirect Accounts 110K 8K 96% users Strategy Access to SAP) Raw Seconds user SQL/Win 0.4 response time 99.998% Uptime Transaction Compression System growth Servers 15-30% Incident Ticket volume ever Storage 2x in past 2 years 2x ≈600 (100% virtual) 250TB Reduction
    [Show full text]
  • Azure Stream Analytics Reference Data Manual Upload
    Azure Stream Analytics Reference Data Manual Upload Squabbiest Patty encaged maestoso, he grubbed his Jamestown very chock-a-block. Glossographical and caressing Micheil phagocytose transcendentalizingjealously and sightsee his his ferula. Akhmatova how and precisely. Uniramous and interpenetrative Stevy always reran supremely and You could change the spring boot applications bound to optimize costs have to ﬕnd the jdbc connection string, azure stream analytics picks the specified Azure Stream Analytics HDInsight with Spark Streaming Apache Spark in. Oct 17 2019 Open Visual Studio and select to join a new Azure Functions project You. Upload UP Squared Sensor Data to Microsoft Azure Blob. Empower police data users with self-service building to data lakes using Presto Hive Spark talk about who best SQL engine for batch streaming interactive data processing more getting Ready Security Common Easy-to-Use UI Big as in the robust Single Platform. Monitoring and scouring technologies to elude and transfer data on users of illegal. The Internet of Things IoT Backend reference architecture demonstrates. Stream Analytics Query Language Reference Azure Stream Analytics offers a. Azure Log on Policy. If you disperse a run target system a predefined table may then edit custom table manually. Learn how you we utilize the Azure Data Lake swamp Stream Analytics Tools extension. Learn how to read and interpret data to Azure Synapse Analytics formerly SQL. Spreadsheets you an use the odbc load command to import the road see D odbc Currently. Azure Cosmos DB real-time data movement using Change. High end Big Data processing batch streaming machine learning graph Upload a single TSV file containing the details of project to 500 individual.
    [Show full text]
  • Extended Version
    Sina Sheikholeslami C u rriculum V it a e ( Last U pdated N ovember 2 0 18) Website: http://sinash.ir Present Address : https://www.kth.se/profile/sinash EIT Digital Stockholm CLC , https://linkedin.com/in/sinasheikholeslami Isafjordsgatan 26, Email: si [email protected] 164 40 Kista (Stockholm), [email protected] Sweden [email protected] Educational Background: • M.Sc. Student of Data Science, Eindhoven University of Technology & KTH Royal Institute of Technology, Under EIT-Digital Master School. 2017-Present. • B.Sc. in Computer Software Engineering, Department of Computer Engineering and Information Technology, Amirkabir University of Technology (Tehran Polytechnic). 2011-2016. • Mirza Koochak Khan Pre-College in Mathematics and Physics, Rasht, National Organization for Development of Exceptional Talents (NODET). Overall GPA: 19.61/20. 2010-2011. • Mirza Koochak Khan Highschool in Mathematics and Physics, Rasht, National Organization for Development of Exceptional Talents (NODET). Overall GPA: 19.17/20, Final Year's GPA: 19.66/20. 2007-2010. Research Fields of Interest: • Distributed Deep Learning, Hyperparameter Optimization, AutoML, Data Intensive Computing Bachelor's Thesis: • “SDMiner: A Tool for Mining Data Streams on Top of Apache Spark”, Under supervision of Dr. Amir H. Payberah (Defended on June 29th 2016). Computer Skills: • Programming Languages & Markups: o F luent in Java, Python, Scala, JavaS cript, C/C++, A ndroid Pr ogram Develop ment o Familia r wit h R, SAS, SQL , Nod e.js, An gula rJS, HTM L, JSP •
    [Show full text]
  • Mapreduce Service
    MapReduce Service Troubleshooting Issue 01 Date 2021-03-03 HUAWEI TECHNOLOGIES CO., LTD. Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Issue 01 (2021-03-03) Copyright © Huawei Technologies Co., Ltd. i MapReduce Service Troubleshooting Contents Contents 1 Account Passwords.................................................................................................................. 1 1.1 Resetting
    [Show full text]
  • Poweredge R640 Apache Hadoop
    A Principled Technologies report: Hands-on testing. Real-world results. The science behind the report: Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC PowerEdge R640 servers This document describes what we tested, how we tested, and what we found. To learn how these facts translate into real-world benefits, read the report Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC PowerEdge R640 servers. We concluded our hands-on testing on October 27, 2019. During testing, we determined the appropriate hardware and software configurations and applied updates as they became available. The results in this report reflect configurations that we finalized on October 15, 2019 or earlier. Unavoidably, these configurations may not represent the latest versions available when this report appears. Our results The table below presents the throughput each solution delivered when running the HiBench workloads. Dell EMC™ PowerEdge™ R640 Dell EMC PowerEdge R630 Percentage more throughput solution solution Latent Dirichlet Allocation 4.13 1.94 112% (MB/sec) Random Forest (MB/sec) 100.66 94.43 6% WordCount (GB/sec) 5.10 3.45 47% The table below presents the minutes each solution needed to complete the HiBench workloads. Dell EMC PowerEdge R640 Dell EMC PowerEdge R630 Percentage less time solution solution Latent Dirichlet Allocation 17.11 36.25 52% Random Forest 5.55 5.92 6% WordCount 4.95 7.32 32% Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC PowerEdge R640 servers November 2019 System configuration information The table below presents detailed information on the systems we tested.
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Desarrollo De Una Solución Business Intelligence Mediante Un Paradigma De Data Lake
    Desarrollo de una solución Business Intelligence mediante un paradigma de Data Lake José María Tagarro Martí Grado de Ingeniería Informática Consultor: Humberto Andrés Sanz Profesor: Atanasi Daradoumis Haralabus 13 de enero de 2019 Esta obra está sujeta a una licencia de Reconocimiento-NoComercial-SinObraDerivada 3.0 España de Creative Commons FICHA DEL TRABAJO FINAL Desarrollo de una solución Business Intelligence mediante un paradigma de Data Título del trabajo: Lake Nombre del autor: José María Tagarro Martí Nombre del consultor: Humberto Andrés Sanz Fecha de entrega (mm/aaaa): 01/2019 Área del Trabajo Final: Business Intelligence Titulación: Grado de Ingeniería Informática Resumen del Trabajo (máximo 250 palabras): Este trabajo implementa una solución de Business Intelligence siguiendo un paradigma de Data Lake sobre la plataforma de Big Data Apache Hadoop con el objetivo de ilustrar sus capacidades tecnológicas para este fin. Los almacenes de datos tradicionales necesitan transformar los datos entrantes antes de ser guardados para que adopten un modelo preestablecido, en un proceso conocido como ETL (Extraer, Transformar, Cargar, por sus siglas en inglés). Sin embargo, el paradigma Data Lake propone el almacenamiento de los datos generados en la organización en su propio formato de origen, de manera que con posterioridad puedan ser transformados y consumidos mediante diferentes tecnologías ad hoc para las distintas necesidades de negocio. Como conclusión, se indican las ventajas e inconvenientes de desplegar una plataforma unificada tanto para análisis Big Data como para las tareas de Business Intelligence, así como la necesidad de emplear soluciones basadas en código y estándares abiertos. Abstract (in English, 250 words or less): This paper implements a Business Intelligence solution following the Data Lake paradigm on Hadoop’s Big Data platform with the aim of showcasing the technology for this purpose.
    [Show full text]
  • TR-4744: Secure Hadoop Using Apache Ranger with Netapp In
    Technical Report Secure Hadoop using Apache Ranger with NetApp In-Place Analytics Module Deployment Guide Karthikeyan Nagalingam, NetApp February 2019 | TR-4744 Abstract This document introduces the NetApp® In-Place Analytics Module for Apache Hadoop and Spark with Ranger. The topics covered in this report include the Ranger configuration, underlying architecture, integration with Hadoop, and benefits of Ranger with NetApp In-Place Analytics Module using Hadoop with NetApp ONTAP® data management software. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 4 1.1 Overview .........................................................................................................................................................4 1.2 Deployment Options .......................................................................................................................................5 1.3 NetApp In-Place Analytics Module 3.0.1 Features ..........................................................................................5 2 Ranger ................................................................................................................................................... 6 2.1 Components Validated with Ranger ................................................................................................................6 3 NetApp In-Place Analytics Module Design with Ranger..................................................................
    [Show full text]
  • Hortonworks Data Platform
    Hortonworks Data Platform Apache Ambari Installation for IBM Power Systems (November 15, 2018) docs.cloudera.com Hortonworks Data Platform November 15, 2018 Hortonworks Data Platform: Apache Ambari Installation for IBM Power Systems Copyright © 2012-2018 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included. Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs. Except where otherwise noted, this document is licensed under Creative Commons Attribution ShareAlike 4.0 License.
    [Show full text]
  • Real-Time Data Analytics with Apache Druid Correa Bosco Hilary Department of Information Technology, (Msc
    IJARSCT ISSN (Online) 2581-9429 International Journal of Advanced Research in Science, Communication and Technology (IJARSCT) Volume 5, Issue 2, May 2021 Impact Factor: 4.819 Real-Time Data Analytics with Apache Druid Correa Bosco Hilary Department of Information Technology, (MSc. IT Part 1) Sir Sitaram and Lady Shantabai Patkar College of Arts and Science, Mumbai, India Abstract: The shift towards real-time data flow has a major impact on the way applications are designed and on the work of data engineers. Dealing with real-time data ingestion brings a paradigm shift and an added layer of challenges compared to traditional integration and processing methods. There are real benefits to leveraging real-time data, but it requires specialized considerations in setting up the ingestion, processing, storing, and serving of that data. It brings about specific operational needs and a change in the way data engineers work. These should be considered while embarking on a real- time journey. In this paper we are going to see real time data analytics with apache druid. Apache Druid (incubating) performant analytics data store for event-driven data .Druid’s core design combines ideas from OLAP/analytic databases, time series databases, and search systems to create a unified system for operational analytics. Keywords: Distributed, Real-time, Fault-tolerant, Highly Available, Open Source, Analytics, Column- oriented, Olap, Apache Druid I. INTRODUCTION Streaming data integration is the foundation for streaming analytics. Specific use cases such as IoT devices log, contextual marketing triggers, Dynamic pricing all rely on using a data feed or real-time data. If you cannot source the data in real-time, there is very little value to be gained in attempting to tackle these use cases.
    [Show full text]
  • Modern Technologies of Bigdata Analytics: Case Study on Hadoop Platform Dharminder Yadav1, Umesh Chandra2
    International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 6, Issue 4, July- August 2017 ISSN 2278-6856 Modern Technologies of BigData Analytics: Case study on Hadoop Platform Dharminder Yadav1, Umesh Chandra2 1Research Scholar, Computer Science Department, Glocal University, Saharanpur, UP, India 2PhD, Assistant Professor, Computer Science Department, Glocal University, Saharanpur, UP, India Abstract exchange, banking, on-line and on-site procuring [2]. Data is growing in the worldwide by daily activities, by using the Enormous Information as an examination subject from a hand-held devices, the Internet, and social media sites.This few focuses course of events, paper main discusses about data processing by using various geographic yield, disciplinary output, types of distributed tool of Hadoop.This present study cover most of the tools used papers, topical and theoretical advancement. The Big Data in Hadoop that help in parallel processing and MapReduce. challenges define in 6V's that are variety, velocity, volume, The day since BigData term introduced to database world , Hadoop act like a savior for most of the large, small value, veracity, and volatility [23]. organization. Researchers will definitely found a way through Hadoop to work huge data concept and most of the researchers are being done in the field of BigData analytics and data mining with the help of Hadoop. Keywords— Big Data, Hadoop, HDFS (Hadoop Distributed File System), NOSQL 1.INTRODUCTION Big Data provide storage and data processing facilities to Cloud computing [26]. Big data comes around 2005 but now it is used everywhere in daily life, which alludes to an expansive scope of informational collections practically difficult to manage, handle and prepare utilizing accessible Volume: Data is growing exponentially by daily activities regular apparatuses and information administration which we handle.
    [Show full text]
  • Building File System Semantics for an Exabyte Scale Object Storage System
    Building File System Semantics for an Exabyte Scale Object Storage System Shane Mainali Raji Easwaran Microsoft 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1 Agenda . Analytics Workloads (Access patterns & challenges) . Azure Data Lake Storage overview . Under the hood . Q&A 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 2 Analytics Workloads Access Patterns and Challenges Analytics Workload Pattern Cosmos DB INGEST EXPLORE PREP & TRAIN MODEL & SERVE Sensors and IoT (unstructured) Real-time Apps Logs (unstructured) Media (unstructured) SQL Data Warehouse Azure Databricks Azure SQL Azure Data Factory Azure Databricks Data Warehouse Azure Data Explorer Azure Analysis Files (unstructured) Services Business/custom apps STORE (structured) Azure Data Lake Storage Gen2 Power BI 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 4 Challenges - Containers are mounted as filesystems on Analytics Engines like Hadoop and Databricks - Client-side file system emulation impacts performance, semantics, and correctness - Directory operations are expensive - Coarse grained Access Control - Throughput is critical for Big Data 2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 5 Storage for Analytics - Goals . Address shortcomings of client-side design . First-class hierarchical namespace . Interoperability with Object Storage (Blobs) . Object-level ACLs (POSIX) . Platform for future filesystem-based protocols (e.g. NFS) 2019 Storage Developer Conference. © Microsoft. All Rights
    [Show full text]