Real-Time Data Infrastructure at Uber Yupeng Fu Chinmay Soman [email protected] [email protected] Uber, Inc Uber, Inc

Total Page:16

File Type:pdf, Size:1020Kb

Real-Time Data Infrastructure at Uber Yupeng Fu Chinmay Soman Yupeng@Uber.Com Chinmay.Cerebro@Gmail.Com Uber, Inc Uber, Inc Real-time Data Infrastructure at Uber Yupeng Fu Chinmay Soman [email protected] [email protected] Uber, Inc Uber, Inc ABSTRACT for tracking things such as trip updates, driver status change, or- Uber’s business is highly real-time in nature. PBs of data is con- der cancellation and so on. Some of it is also derived from the tinuously being collected from the end users such as Uber drivers, OnLine Transactional Processing (OLTP) database changelog used riders, restaurants, eaters and so on everyday. There is a lot of internally by such microservices. As of October 2020, trillions of valuable information to be processed and many decisions must messages and petabytes of such data were generated per day across be made in seconds for a variety of use cases such as customer all regions. incentives, fraud detection, machine learning model prediction. In Real-time data processing plays a critical role in Uber’s technol- addition, there is an increasing need to expose this ability to differ- ogy stack and it empowers a wide range of use cases. At high level, ent user categories, including engineers, data scientists, executives real-time data processing needs within Uber consists of three broad and operations personnel which adds to the complexity. areas: 1) Messaging platform that allows communication between In this paper, we present the overall architecture of the real-time asynchronous producers and subscribers 2) Stream processing that data infrastructure and identify three scaling challenges that we allows applying computational logic on top of such streams of mes- need to continuously address for each component in the architec- sages and 3) OnLine Analytical Processing (OLAP) that enables ture. At Uber, we heavily rely on open source technologies for the analytical queries over all this data in near real time. Each area has key areas of the infrastructure. On top of those open-source soft- to deal with three fundamental scaling challenges within Uber: ware, we add significant improvements and customizations to make • Scaling data: The total incoming real time data volume has the open-source solutions fit in Uber’s environment and bridge the been growing exponentially at a rapid rate of year over year gaps to meet Uber’s unique scale and requirements. produced by several thousands of micro services. In addition, We then highlight several important use cases and show their Uber deploys its infrastructure in several geographical re- real-time solutions and tradeoffs. Finally, we reflect on the lessons gions for high availability, and it has a multiplication factor we learned as we built, operated and scaled these systems. in terms of handling aggregate data. Each real time process- ing system has to handle this data volume increase while CCS CONCEPTS maintaining SLA around data freshness, end-to-end latency • Information systems ! Stream management; • Computer and availability. systems organization ! Real-time system architecture. • Scaling use cases: As Uber’s business grows, new use cases emerge from various business verticals and groups. Different KEYWORDS parts of the organization have varying requirements for the Real-time Infrastructure; Streaming Processing real time data systems, which are often competing in nature. For instance, dynamic pricing[31] for a given Uber product ACM Reference Format: (such as rides or eats) is a highly complex real-time workflow Yupeng Fu and Chinmay Soman. 2021. Real-time Data Infrastructure at Uber. involving multi-stage stream processing pipelines that run In Proceedings of the 2021 International Conference on Management of Data various machine learning algorithms along with a fast key- (SIGMOD ’21), June 20–25, 2021, Virtual Event, China. ACM, New York, NY, value store. This system is designed for favoring freshness USA, 14 pages. https://doi.org/10.1145/3448016.3457552 and availability over data consistency, and it’s implemented entirely by engineers. On the other hand, monitoring real- 1 INTRODUCTION time business metrics around orders and sales requires a SQL arXiv:2104.00087v1 [cs.DC] 31 Mar 2021 A lot of real-time data is generated within Uber’s data centers. This like interface used by data scientists with more emphasis data originates in different sources such as end user applications given to data completeness. (driver/rider/eater) or the backend microservices. Some of this data • Scaling users: The diverse users interacting with the real consists of application or system logs continuously emitted as part time data system fall on a big spectrum of technical skills of day to day operation. A lot of services also emit special events from operations personnel who have no engineering back- ground to advanced users capable of orchestrating complex Permission to make digital or hard copies of all or part of this work for personal or real time computational data pipelines. As the personnel of classroom use is granted without fee provided that copies are not made or distributed Uber grows, the platform teams also face increasing chal- for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM lenges on the user imposed complexities, such as safe client- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, side version upgrade for a large number of applications and to post on servers or to redistribute to lists, requires prior specific permission and/or a managing an increasing number of user requests. fee. Request permissions from [email protected]. SIGMOD ’21, June 20–25, 2021, Virtual Event, China In short, the biggest challenge is to build a unified platform with © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8343-1/21/06...$15.00 standard abstractions that can work for all such varied use cases and https://doi.org/10.1145/3448016.3457552 users at scale instead of creating custom solutions. A key decision made by us to overcome such challenges is to adopt the open source are often competing with those of other use cases. The different solutions in building this unified platform. Open source software requirements generally include the following aspects: adoption has many advantages such as the development velocity, • Consistency: Mission-critical applications such as financial cost effectiveness as well as the power of the crowd. Given the scale dashboards require data to be consistent across all regions. and rapid development cycles in Uber, we had to pick technologies This includes zero data loss in the inter-region and intra- that were mature enough to be able to scale with Uber’s data as well region dispersal and processing mechanisms, de-duplication as extensible enough for us to integrate it in our unified real-time as well as ability to certify data quality. data stack. • Availability: The real time data infrastructure stack must Figure 1 depicts the high-level flow of data inside Uber’s infras- be highly available with 99.99 percentile guarantee. Loss tructure. Various kinds of analytical data are continuously collected of availability has a direct impact on Uber’s business and from Uber’s data centers across multiple regions. These streams of may result in significant financial losses. For instance, dy- raw data form the source of truth for all analytics at Uber. Most namic pricing leverages the real-time data infrastructure of these streams are incrementally archived in batch processing component for calculating demand and supply ratios per systems and ingested in the data warehouse. This is then made geo-fence, which in turn is used to influence the price ofa available for machine learning and other data science use cases. trip or UberEats delivery. The Real Time Data Infra component continuously processes such • Data Freshness: Most of the use cases require seconds level data streams for powering a variety of mission critical use cases freshness. In other words a given event or log record must be such as dynamic pricing (Surge), intelligent alerting, operational available for processing or querying, seconds after it has been dashboards and so on. This paper focuses on the real time data produced. This is a critical requirement to ensure ability to eco-system. respond to certain events such as security incidents, demand- supply skews, business metric alerts and so on. • Query latency: Some use cases need the ability to execute queries on the raw data stream and require the p99th query latency to be under 1 second. For instance, site facing or exter- nal analytical tools such as UberEats Restaurant manager[53] will execute several analytical queries for each page load. Each such query must be very fast to provide a good experi- ence for the restaurant owner. • Scalability: The raw data streams constitute petabytes of data volume collected per day across all regions. This data is constantly growing based on organic growth of our user base, new lines of business deployed by Uber as well as new real time analytics use cases that arise over time. The ability to scale with this ever growing data set in a seamless man- ner, without requiring users to re-architect the processing Figure 1: The high-level data flow at Uber infrastructure pipelines is a fundamental requirement of the real-time data The paper is organized as follows. In Section 2, we list the re- infrastructure stack. • quirements derived from the use cases which are used to guide Cost: Uber is a low margin business. We need to ensure the the design decisions for various real-time use cases. In Section cost of data processing and serving is low and ensure high 3, we provide an overview of the high-level abstractions of the operational efficiency. This influences a variety of design real-time data infrastructure at Uber.
Recommended publications
  • Beyond Relational Databases
    EXPERT ANALYSIS BY MARCOS ALBE, SUPPORT ENGINEER, PERCONA Beyond Relational Databases: A Focus on Redis, MongoDB, and ClickHouse Many of us use and love relational databases… until we try and use them for purposes which aren’t their strong point. Queues, caches, catalogs, unstructured data, counters, and many other use cases, can be solved with relational databases, but are better served by alternative options. In this expert analysis, we examine the goals, pros and cons, and the good and bad use cases of the most popular alternatives on the market, and look into some modern open source implementations. Beyond Relational Databases Developers frequently choose the backend store for the applications they produce. Amidst dozens of options, buzzwords, industry preferences, and vendor offers, it’s not always easy to make the right choice… Even with a map! !# O# d# "# a# `# @R*7-# @94FA6)6 =F(*I-76#A4+)74/*2(:# ( JA$:+49>)# &-)6+16F-# (M#@E61>-#W6e6# &6EH#;)7-6<+# &6EH# J(7)(:X(78+# !"#$%&'( S-76I6)6#'4+)-:-7# A((E-N# ##@E61>-#;E678# ;)762(# .01.%2%+'.('.$%,3( @E61>-#;(F7# D((9F-#=F(*I## =(:c*-:)U@E61>-#W6e6# @F2+16F-# G*/(F-# @Q;# $%&## @R*7-## A6)6S(77-:)U@E61>-#@E-N# K4E-F4:-A%# A6)6E7(1# %49$:+49>)+# @E61>-#'*1-:-# @E61>-#;6<R6# L&H# A6)6#'68-# $%&#@:6F521+#M(7#@E61>-#;E678# .761F-#;)7-6<#LNEF(7-7# S-76I6)6#=F(*I# A6)6/7418+# @ !"#$%&'( ;H=JO# ;(\X67-#@D# M(7#J6I((E# .761F-#%49#A6)6#=F(*I# @ )*&+',"-.%/( S$%=.#;)7-6<%6+-# =F(*I-76# LF6+21+-671># ;G';)7-6<# LF6+21#[(*:I# @E61>-#;"# @E61>-#;)(7<# H618+E61-# *&'+,"#$%&'$#( .761F-#%49#A6)6#@EEF46:1-#
    [Show full text]
  • Advanced Model Deployments with Tensorflow Serving Presentation.Pdf
    Most models don’t get deployed. Hi, I’m Hannes. An inefficient model deployment import json from flask import Flask from keras.models import load_model from utils import preprocess model = load_model('model.h5') app = Flask(__name__) @app.route('/classify', methods=['POST']) def classify(): review = request.form["review"] preprocessed_review = preprocess(review) prediction = model.predict_classes([preprocessed_review])[0] return json.dumps({"score": int(prediction)}) Simple Deployments @app.route('/classify', methods=['POST']) Why Flask is insufficient def classify(): review = request.form["review"] ● No consistent APIs ● No consistent payloads preprocessed_review = preprocess(review) ● No model versioning prediction = model.predict_classes( ● No mini-batching support [preprocessed_review])[0] ● Inefficient for large models return json.dumps({"score": int(prediction)}) Image: Martijn Baudoin, Unsplash TensorFlow Serving TensorFlow Serving Production ready Model Serving ● Part of the TensorFlow Extended Ecosystem ● Used internally at Google ● Highly scalable model serving solution ● Works well for large models up to 2GB TensorFlow 2.0 ready! * * With small exceptions Deploy your models in 90s ... Export your Model import tensorflow as tf TensorFlow 2.0 Export tf.saved_model.save( ● Consistent model export model, ● Using Protobuf format export_dir="/tmp/saved_model", ● Export of graphs and signatures=None estimators possible ) $ tree saved_models/ Export your Model saved_models/ └── 1555875926 ● Exported model as Protobuf ├── assets (Saved_model.pb)
    [Show full text]
  • Smart Anomaly Detection and Prediction for Assembly Process Maintenance in Compliance with Industry 4.0
    sensors Article Smart Anomaly Detection and Prediction for Assembly Process Maintenance in Compliance with Industry 4.0 Pavol Tanuska * , Lukas Spendla, Michal Kebisek, Rastislav Duris and Maximilian Stremy Faculty of Materials Science and Technology in Trnava, Slovak University of Technology in Bratislava, 917 24 Trnava, Slovakia; [email protected] (L.S.); [email protected] (M.K.); [email protected] (R.D.); [email protected] (M.S.) * Correspondence: [email protected]; Tel.: +421-918-646-061 Abstract: One of the big problems of today’s manufacturing companies is the risks of the assembly line unexpected cessation. Although planned and well-performed maintenance will significantly reduce many of these risks, there are still anomalies that cannot be resolved within standard mainte- nance approaches. In our paper, we aim to solve the problem of accidental carrier bearings damage on an assembly conveyor. Sometimes the bearing of one of the carrier wheels is seized, causing the conveyor, and of course the whole assembly process, to halt. Applying standard approaches in this case does not bring any visible improvement. Therefore, it is necessary to propose and implement a unique approach that incorporates Industrial Internet of Things (IIoT) devices, neural networks, and sound analysis, for the purpose of predicting anomalies. This proposal uses the mentioned approaches in such a way that the gradual integration eliminates the disadvantages of individual approaches while highlighting and preserving the benefits of our solution. As a result, we have created and deployed a smart system that is able to detect and predict arising anomalies and achieve significant reduction in unexpected production cessation.
    [Show full text]
  • An Evaluation of Tensorflow As a Programming Framework for HPC Applications
    DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 An Evaluation of TensorFlow as a Programming Framework for HPC Applications WEI DER CHIEN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE An Evaluation of TensorFlow as a Programming Framework for HPC Applications WEI DER CHIEN Master in Computer Science Date: August 28, 2018 Supervisor: Stefano Markidis Examiner: Erwin Laure Swedish title: En undersökning av TensorFlow som ett utvecklingsramverk för högpresterande datorsystem School of Electrical Engineering and Computer Science iii Abstract In recent years, deep-learning, a branch of machine learning gained increasing popularity due to their extensive applications and perfor- mance. At the core of these application is dense matrix-matrix multipli- cation. Graphics Processing Units (GPUs) are commonly used in the training process due to their massively parallel computation capabili- ties. In addition, specialized low-precision accelerators have emerged to specifically address Tensor operations. Software frameworks, such as TensorFlow have also emerged to increase the expressiveness of neural network model development. In TensorFlow computation problems are expressed as Computation Graphs where nodes of a graph denote operation and edges denote data movement between operations. With increasing number of heterogeneous accelerators which might co-exist on the same cluster system, it became increasingly difficult for users to program efficient and scalable applications. TensorFlow provides a high level of abstraction and it is possible to place operations of a computation graph on a device easily through a high level API. In this work, the usability of TensorFlow as a programming framework for HPC application is reviewed.
    [Show full text]
  • The Functionality of the Sql Select Clause
    The Functionality Of The Sql Select Clause Neptunian Myron bumbles grossly or ablates ana when Dustin is pendant. Rolando remains spermatozoon: she protrude lickety-split.her Avon thieve too bloodlessly? Donnie catapults windily as unsystematic Alexander kerns her bisections waggle The sql the functionality select clause of a function avg could Capabilities of people SELECT Statement Data retrieval from several base is done through appropriate or efficient payment of SQL Three concepts from relational theory. In other subqueries that for cpg digital transformation and. How do quickly write the select statement in SQL? Lesson 32 Introduction to SAS SQL. Exist in clauses with. If function is clause of functionality offered by if you can use for modernizing legacy sql procedure in! Moving window function takes place of functionality offered by clause requires select, a new table in! The SQL standard requires that switch must reference only columns in clean GROUP BY tender or columns used in aggregate functions However MySQL. The full clause specifies the columns of the final result table. The SELECT statement is probably the less important SQL command It's used to return results from our databases. The following SQL statement will display the appropriate of books. For switch, or owe the DISTINCT values, date strings and TIMESTAMP data types. SQL SELECT Statement TechOnTheNet. Without this table name, proporcionar experiencias personalizadas, but signify different tables. NULL value, the queries presented are only based on tables. So, GROUP BY, step CREATE poverty AS statement provides a superset of functionality offered by the kiss INTO statement. ELSE logic: SQL nested Queries with Select.
    [Show full text]
  • See Schema of Table Mysql
    See Schema Of Table Mysql Debonair Antin superadd that subalterns outrivals impromptu and cripple helpfully. Wizened Teodoor cross-examine very yestereve while Lionello remains orthophyric and ineffective. Mucking Lex usually copping some casbahs or flatten openly. Sql on the examples are hitting my old database content from ingesting data lake using describe essentially displays all of mysql, and field data has the web tutorial on The certification names then a specified table partitions used. Examining the gap in SQL injection attacks Web. Create own table remain the same schema as another output of FilterDeviceAlertEvents 2. TABLES view must query results contain a row beginning each faucet or view above a dataset The INFORMATIONSCHEMATABLES view has left following schema. Follow along with other database schema diff you see cdc. A Schema in SQL is a collection of database objects linked with an particular. The oracle as drop various techniques of. Smart phones and mysql database include: which table and web business domain and inductive impedance always come to see where. Cookies on a backup and their db parameter markers or connection to run, and information about is built up. MySQL Show View using INFORMATIONSCHEMA database The tableschema column stores the schema or database of the preliminary or waist The tablename. Refer to Fetch rows from MySQL table in Python to button the data that usually just. Data dictionary as follows: we are essentially displays folders and foreign keys and views, processing are hypothetical ideas of field knows how many other. Maintaining tables is one process the preventative MySQL database maintenance tasks.
    [Show full text]
  • Using Sqlancer to Test Clickhouse and Other Database Systems
    Using SQLancer to test ClickHouse and other database systems Manuel Rigger Ilya Yatsishin @RiggerManuel qoega Plan ▎ What is ClickHouse and why do we need good testing? ▎ How do we test ClickHouse and what problems do we have to solve? ▎ What is SQLancer and what are the ideas behind it? ▎ How to add support for yet another DBMS to SQLancer? 2 ▎ Open Source analytical DBMS for BigData with SQL interface. • Blazingly fast • Scalable • Fault tolerant 2021 2013 Project 2016 Open ★ started Sourced 15K GitHub https://github.com/ClickHouse/clickhouse 3 Why do we need good CI? In 2020: 361 4081 261 11 <15 Contributors Merged Pull New Features Releases Core Team Requests 4 Pull Requests exponential growth 5 All GitHub data available in ClickHouse https://gh.clickhouse.tech/explorer/ https://gh-api.clickhouse.tech/play 6 How is ClickHouse tested? 7 ClickHouse Testing • Style • Flaky check for new or changed tests • Unit • All tests are run in with sanitizers • Functional (address, memory, thread, undefined • Integrational behavior) • Performance • Thread fuzzing (switch running threads • Stress randomly) • Static analysis (clang-tidy, PVSStudio) • Coverage • Compatibility with OS versions • Fuzzing 9 100K tests is not enough Fuzzing ▎ libFuzzer to test generic data inputs – formats, schema etc. ▎ Thread Fuzzer – randomly switch threads to trigger races and dead locks › https://presentations.clickhouse.tech/cpp_siberia_2021/ ▎ AST Fuzzer – mutate queries on AST level. › Use SQL queries from all tests as input. Mix them. › High level mutations. Change query settings › https://clickhouse.tech/blog/en/2021/fuzzing-clickhouse/ 12 Test development steps Common tests Instrumentation Fuzzing The next step Unit, Find bugs earlier.
    [Show full text]
  • Mapreduce Service
    MapReduce Service Troubleshooting Issue 01 Date 2021-03-03 HUAWEI TECHNOLOGIES CO., LTD. Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Issue 01 (2021-03-03) Copyright © Huawei Technologies Co., Ltd. i MapReduce Service Troubleshooting Contents Contents 1 Account Passwords.................................................................................................................. 1 1.1 Resetting
    [Show full text]
  • From XML to Flat Buffers: Markup in the Twenty-Teens Warning! the Contenders
    Elliotte Rusty Harold [email protected] August 2018 From XML to Flat Buffers: Markup in the Twenty-teens Warning! The Contenders ● XML ● JSON ● YAML ● EXI ● Protobufs ● Flat Protobufs XML JSON YAML EXI Protobuf Flat Buffers App Engine X X Standard Java App Engine X Flex What Uses What Kubernetes X X From technology, tools, and systems Eclipse X I use frequently. There are many others. Maven X Ant X Google X X X X X “APIs” Publishing X XML XML ● Very well defined standard ● By far the most general format: ○ Mixed content ○ Attributes and elements ● By far the best tool support. Nothing else is close: ○ XSLT ○ XPath ○ Many schema languages: ■ W3C XSD ■ RELAX NG More Reasons to Choose XML ● Most composable for mixing and matching markup; e.g. MathML+SVG in HTML ● Does not require a schema. ● Streaming support: very large documents ● Better for interchange amongst unrelated parties ● The deeper your needs the more likely you’ll end up here. Why Not XML? ● Relatively complex for simple tasks ● Limited to no support for non-string programming types: ○ Numbers, booleans, dates, money, etc. ○ Lists, maps, sets ○ You can encode all these but APIs don’t necessarily recognize or support them. ● Lots of sharp edges to surprise the non-expert: ○ 9/10 are namespace related ○ Attribute value normalization ○ White space ● Some security issues if you’re not careful (Billion laughs) JSON ● Simple for object serialization and program data. If your data is a few basic types (int, string, boolean, float) and data structures (list, map) this works well. ● More or less standard (7-8 of them in fact) ● Consumption libraries for essentially all significant languages Why Not JSON? ● It is surprising how fast needs grow past a few basic types and data structures.
    [Show full text]
  • Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)
    Install Guide for Data Preparation for Amazon Redshift and S3 Version: 7.1 Doc Build Date: 05/26/2020 Copyright © Trifacta Inc. 2020 - All Rights Reserved. CONFIDENTIAL These materials (the “Documentation”) are the confidential and proprietary information of Trifacta Inc. and may not be reproduced, modified, or distributed without the prior written permission of Trifacta Inc. EXCEPT AS OTHERWISE PROVIDED IN AN EXPRESS WRITTEN AGREEMENT, TRIFACTA INC. PROVIDES THIS DOCUMENTATION AS-IS AND WITHOUT WARRANTY AND TRIFACTA INC. DISCLAIMS ALL EXPRESS AND IMPLIED WARRANTIES TO THE EXTENT PERMITTED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND FITNESS FOR A PARTICULAR PURPOSE AND UNDER NO CIRCUMSTANCES WILL TRIFACTA INC. BE LIABLE FOR ANY AMOUNT GREATER THAN ONE HUNDRED DOLLARS ($100) BASED ON ANY USE OF THE DOCUMENTATION. For third-party license information, please select About Trifacta from the Help menu. 1. Quick Start . 4 1.1 Install from AWS Marketplace . 4 1.2 Upgrade for AWS Marketplace . 7 2. Configure . 8 2.1 Configure for AWS . 8 2.1.1 Configure for EC2 Role-Based Authentication . 14 2.1.2 Enable S3 Access . 16 2.1.2.1 Create Redshift Connections 28 3. Contact Support . 30 4. Legal 31 4.1 Third-Party License Information . 31 Page #3 Quick Start Install from AWS Marketplace Contents: Product Limitations Internet access Install Desktop Requirements Pre-requisites Install Steps - CloudFormation template SSH Access Troubleshooting SELinux Upgrade Documentation Related Topics This guide steps through the requirements and process for installing Trifacta® Data Preparation for Amazon Redshift and S3 through the AWS Marketplace.
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Desarrollo De Una Solución Business Intelligence Mediante Un Paradigma De Data Lake
    Desarrollo de una solución Business Intelligence mediante un paradigma de Data Lake José María Tagarro Martí Grado de Ingeniería Informática Consultor: Humberto Andrés Sanz Profesor: Atanasi Daradoumis Haralabus 13 de enero de 2019 Esta obra está sujeta a una licencia de Reconocimiento-NoComercial-SinObraDerivada 3.0 España de Creative Commons FICHA DEL TRABAJO FINAL Desarrollo de una solución Business Intelligence mediante un paradigma de Data Título del trabajo: Lake Nombre del autor: José María Tagarro Martí Nombre del consultor: Humberto Andrés Sanz Fecha de entrega (mm/aaaa): 01/2019 Área del Trabajo Final: Business Intelligence Titulación: Grado de Ingeniería Informática Resumen del Trabajo (máximo 250 palabras): Este trabajo implementa una solución de Business Intelligence siguiendo un paradigma de Data Lake sobre la plataforma de Big Data Apache Hadoop con el objetivo de ilustrar sus capacidades tecnológicas para este fin. Los almacenes de datos tradicionales necesitan transformar los datos entrantes antes de ser guardados para que adopten un modelo preestablecido, en un proceso conocido como ETL (Extraer, Transformar, Cargar, por sus siglas en inglés). Sin embargo, el paradigma Data Lake propone el almacenamiento de los datos generados en la organización en su propio formato de origen, de manera que con posterioridad puedan ser transformados y consumidos mediante diferentes tecnologías ad hoc para las distintas necesidades de negocio. Como conclusión, se indican las ventajas e inconvenientes de desplegar una plataforma unificada tanto para análisis Big Data como para las tareas de Business Intelligence, así como la necesidad de emplear soluciones basadas en código y estándares abiertos. Abstract (in English, 250 words or less): This paper implements a Business Intelligence solution following the Data Lake paradigm on Hadoop’s Big Data platform with the aim of showcasing the technology for this purpose.
    [Show full text]