<<

ETL vs ELT vs eETL (entity-based ETL).

ETL vs ELT vs eETL (entity-based ETL).

Table of contents

Leveraging analytic insights for business outcomes 1

Traditional ETL 1

The emergence of cloud technology and the rise of ELT 3

ETL vs. ELT 5

A new approach: entity-based ETL (eETL) 5

ETL vs. ELT vs. eETL 7

Bottom line – which is best for you? 7

About K2View 8

1 ETL vs ELT vs eETL (entity-based ETL).

Leveraging analytic insights It is the industry standard among established organizations, and the acronym ETL is often used for business outcomes colloquially to describe integration activities in According to Gartner, by 2022, only 20% of analytic general. insights will lead to business outcomes. The workflow that data engineers and analysts must Not having the right data is one of the key reasons perform to produce an ETL pipeline looks like this: for projects to fail. To start with, collecting, creating, and/or purchasing data is not easy. And even if you can get access to the data, you must still answer some serious questions, such as: Extract: The first step is to extract the data • Can the data be processed quickly and from its source systems. Data teams must cost-effectively? decide how, and how often, to access each data source – whether via recurrent batch processes, Can the data be cleansed and masked? • real-time streaming, or triggered by specific • Can the data be secured and protected? events or actions. Can the data be used, in ethical and legal terms? • Transform: This step is about cleansing, formatting, and normalizing the data for storage In this whitepaper, we will review the in the target data lake or warehouse. The common methods used for creating and resultant data is used by reporting and analytics preparing data for analytical purposes, tools. and discuss the pros and cons of each approach. Load: This step is about delivering the data into a data store, where applications and reporting Traditional ETL tools can access it. The data store can be as simple as an unstructured text file, or as complex The traditional approach to , known as as a highly structured , depending extract-transform-load (ETL), has been predominant on the business requirements, applications, and since the 1970s. At its core, ETL is a standard user profiles provided. process where data is collected from various sources (extracted), converted into a desired format (transformed), then stored into its new destination (loaded).

ETL workflow

Extract Transform Load

Raw data is read and collected Business rules are applied to clean The transformed data is loaded from various sources, including the data, enrich it, anonymize it, if into a big data store, such as a data message queues, , flat necessary, and format it for analysis. warehouse, data lake, or non- files, spreadsheets, data streams, and relational . event streams.

2 ETL vs ELT vs eETL (entity-based ETL).

ETL steps and timeline

Traditional ETL has the following disadvantages: continuous data streaming. It is also extremely limited when it comes to real-time data processing, Smaller extractions: Heavy processing of • ingestion, or integration. data transformations (e.g., I/O and CPU processing of high-volume data) often means having to compromise on smaller data extractions. ETL has evolved quite a bit from the 1970s and • Complexity: Traditional ETL is comprised of 1980s, when the process was sequential, data was custom-coded programs and scripts, based on more static, systems were monolithic, and reporting the specific needs of specific transformations. was needed on a weekly or monthly basis. This means that the data engineering team must develop highly specialized, and often non- transferrable, skill sets for managing its code base. The emergence of cloud • Cost and time consumption: Once set technology and the rise up, adjusting the ETL process can be both costly and time consuming, often requiring lengthy of ELT re-engineering cycles – by highly skilled data ELT stands for Extract-Load-Transform. engineers. Unlike traditional ETL, ELT extracts and loads • Rigidity: Traditional ETL limits the agility of data the data into the target first, where it runs scientists, who receive only the data after it was transformations, often using proprietary scripting transformed and prepared by the data engineers – which are executed on the target data store. The as opposed to the entire pool of raw data – to work target is most commonly data lake, or big data with. store, such as Teradata, Spark, or Hadoop. • Legacy technology: Traditional ETL was primarily designed for periodic, batch migrations, was performed on-premise and does not support

ELT workflow

Extract Load Transform

Raw data is read and collected The extracted data is loaded into a Data transformations are performed from various sources, including data store, whether it is a data lake in the data lake or warehouse, message queues, databases, flat or warehouse, or non-relational primarily using scripts. files, spreadsheets, data streams, and database. event streams.

3 ETL vs ELT vs eETL (entity-based ETL).

ELT steps and timeline

ELT offers several advantages: So, ELT has challenges of its own, including:

• Fast extraction and loading: Data is • Costly and time consuming: Data scientists delivered into the target system immediately, with need to match, clean, and transform the data very little processing in-flight. before applying analytics. • Lower upfront development costs: E LT • Compliance risks: With raw data being tools are good at moving source data into target loaded into the data store, ELT, by nature, doesn’t systems with minimal user intervention, since user- anonymize or mask the data, so compliance with defined transformations are not required. privacy laws may be compromised. • Low maintenance: ELT was designed for use • costs and risks: in the cloud, so things like schema changes can be The movement of massive amounts of data, from fully automated. on-premise to cloud environments, consumes high network bandwidth. • Greater flexibility:Data analysts no longer have to determine what insights and data • Big store requirement: ELT tools require a types they need in advance, but can perform modern data staging technology, such as a data transformations on the data as needed in the data lake, where the data is loaded. Data teams then warehouse or lake. transform the data into a data warehouse where it can be sliced and diced for reporting and analysis. • Greater trust: All the data, in its raw format, is available for exploration and analysis. No data is • Limited connectivity: ELT tools lacks lost, or mis-transformed along the way. connectors to legacy and on-premise systems, although this is becoming less of an issue as ELT products mature, and legacy systems are retired. While the ability to transform data in the data store answers ELT’s volume and scale limitations, it does not address the issue of , which is still very costly and time-consuming.

Data scientists, who are scarce, high-value company resources, need to match, clean, and transform the data – accounting for 40% of their time – before even engaging in any analytics.

4 ETL vs ELT vs eETL (entity-based ETL).

ETL vs. ELT

The following table summarizes the main differences between ETL and ELT:

ETL ELT Process • Data is extracted in bulks from sources, transformed, • Raw data is extracted and loaded directly into a then loaded into a DWH/lake DWH/lake, where it is transformed • Typically batch • Typically batch Primary use • Smaller sets of structured data that require complex • Massive sets of structured and unstructured data data transformation • Offline, analytical workloads • Offline, analytical workloads Flexibility • Rigid, requiring data pipelines to be scripted, tested, • Data scientists and analysts have access to all the and deployed raw data • Difficult to adapt, costly to maintain • Data is prepared for analytics when needed, using self-service tools Time to insights • Slow - data engineers spend a lot of time building • Slow - data scientists and analysts spend a lot of data pipelines time preparing the data for analytics Compliance • ETL anonymizes confidential and sensitive • With raw data loaded directly into the big data information before loading it to the target data store stores, there are greater chances of accidental data exposure and breaches Technology • Mature and stable, used for 20+ years • Comparatively new, with fewer data connectors, and • Supported by many tools less advanced transformation capabilities • Supported by fewer professionals and tools Bandwidth and • Can be costly due to lengthy, high-scale, and • Can be very costly due to cloud-native data computation costs complex data processing transformations • High bandwidth costs for large data loads • Typically requires staging area • Can impact source systems when extracting large • High bandwidth costs for large data loads data sets

5 ETL vs ELT vs eETL (entity-based ETL).

A new approach: entity-based ETL (eETL)

A new approach that addresses the limitations of both traditional ETL and ELT, and delivers trusted, clean, and complete data that you can immediately use to generate insights.

eETL steps and timeline

eETL

An eETL tool pipelines data into a target datastore predefined rules. In the load phase, the entity data is by business entities rather than by database tables. safely delivered to any big data store. Effectively, it applies the ETL (Extract-Transform- Load) process to a business entity. eETL executes concurrently to thousands of business entity instances at a time, to support enterprise-scale, At the foundation of the eETL approach is a logical sub-second response times. As opposed to batch abstraction layer that captures all the attributes of any processes, this is done continuously by capturing given business entity (such as a customer, product data changes in real time, from all source systems, or order), from all source systems. Accordingly, data and streaming them through the business entity layer is collected, processed, and delivered per business to the destination data store. entity (instance) as a complete, clean, and connected data asset. Collecting, processing, and pipelining data by business entity, continuously, ensures fresh data, and In the extract phase, the data for a particular entity is by design. You wind up with insights collected from all source systems. In the transform you can trust, because you start with data you can phase, the data is cleansed, enriched, anonymized, trust. and transformed – as an entity – according to

6 6 ETL vs ELT vs eETL (entity-based ETL).

The eETL process eETL represents the best of both (ETL and ELT) worlds, because the data in the target store is: Entity-based pipelining resolves the scale and processing drawbacks of ETL, since all phases Extracted, transformed, and loaded – from all (Extract-Transform-Load) are performed on small sources, to any data store, at any scale – via any amounts of data vs massive database tables joined data integration method: messaging, CDC, streaming, together. virtualization, JDBC, and APIs Entity-based ETL technology supports real-time Always clean, fresh, and analytics-ready data movement through messaging, streaming and • CDC methods for data integration and delivery – • Built, packaged and reused by data engineers, for and matches the right integrated data to the right invoking by business analysts and data scientists business entity in flight. • Continuously enriched and connected, to support complex queries and avoid the need for running

ETL vs. ELT vs. eETL The table below summarizes the ETL, ELT and eETL approaches to data pipelining:

ETL ELT eETL Process • Data is extracted in bulks from • Raw data is extracted and • ETL is multi-threaded by business sources, transformed, then loaded directly into a DWH/lake, entities loaded into a DWH/lake where it is transformed • Data is clean, fresh, and complete by • Typically batch • Typically batch design • Batch or real time Primary use • Smaller sets of structured • Massive sets of structured and • Massive amounts of structured and data that require complex data unstructured data unstructured data, with low impact on transformation • Offline, analytical workloads sources and destinations • Offline, analytical workloads • Complex data transformation is performed in real time at the entity level, leveraging a 360-degree view of the entity • Operational and analytical workloads Flexibility • Rigid, requiring data pipelines to • Data scientists and analysts • Highly flexible, easy to set up and adapt be scripted, tested, and deployed have access to all the raw data • Data engineers define the entity data • Difficult to adapt, costly to • Data is prepared for analytics flows maintain when needed, using self-service • Data scientists decide on scope, time tools and destination of data Time to • Slow - data engineers spend • Slow - data scientists and • Quick - data preparation is done instantly insights a lot of time building data analysts spend a lot of time and continuously, in real time pipelines preparing the data for analytics Compliance • ETL anonymizes confidential • With raw data loaded directly • Data is anonymized and is fully and sensitive information before into the big data stores, there are compatible with privacy regulations loading it to the target data store greater chances of accidental (GDPR, CCPA) before loading it to the data exposure and breaches target data store Technology • Mature and stable, used for 20+ • Comparatively new, with fewer • Mature and stable, used for 12+ years years data connectors, and less • Proven at very large enterprises, at • Supported by many tools advanced transformations massive scale • Supported by fewer professionals and tools Bandwidth • Can be costly due to lengthy, • Can be very costly due to cloud- • Low computing costs since and high-scale, and complex data native data transformations transformation is done per digital entity, computation processing Typically requires staging area on commodity hardware costs • • High bandwidth costs for large • High bandwidth costs for large • No data staging data loads data loads • Bandwidth costs are reduced by 90% due • Can impact source systems to smart when extracting large data sets

7 ETL vs ELT vs eETL (entity-based ETL).

Bottom line – which is best for you? As described above, both ETL and ELT methods have their advantages and disadvantages.

Applying an entity-based approach to data preparation and pipelining enables you to overcome the limitations of both ETL and ELT methods and deliver clean and ready-to-use data, at high scale, without having to compromise on flexibility, and privacy compliance requirements.

About K2View

K2View provides an operational data fabric dedicated to making every customer experience personalized and profitable.

The K2View platform continually ingests all customer data from all systems, enriches it with real-time insights, and transforms it into a patented Micro-Database™ - one for every customer. To maximize performance, scale, and security, every micro-DB is compressed and individually encrypted. It is then delivered in milliseconds to fuel quick, effective, and pleasing customer interactions.

Copyright 2021 K2View. All rights reserved. Digital Entity and Micro-Database are trademarks of K2View. Content subject to change without notice.

8