Data lake Outline

• Definition • Architecture for data lake • Software component History

• The term “data lake” was coined 2010 by James Dixon • He distinguished between the approach to manage data on – Hadoop and data marts /warehouses: • data mart is a store of bottled water – cleansed and packaged and structured for easy consumption • The data lake is a large body of water in a more natural state. – The contents of the data lake stream in from a source to fill the lake, – users of the lake can come to examine, dive in, or take samples. Needs

• Enterprise Data Warehouse (EDW) are rigid, very well structured data models • New sources – Volume – Variety – Velocity (less relevant, but not always) • Licence costs • Change of paradigm Comparison From top down to bottom up

• https://www.slideshare.net/jamserra/big-data-architectures-and-the-data-lake Data Lake

• To realize a bottom up approach there is no need to define the schema of data before to load data • Data must be modelled according to the task • In EDW the task is strongly related to how the data are modelled When Write? Hadhoop based architecture Data Lake Architecture Data lake component architecture A different perspective Storage Storage

• What is the best solution for – Any type, any size, any rate? • File system! – A distributed file system→Hadoop Distributed File System (HDFS) • And NoSQL engine? – Any type (no)? – Any size (yes) – Any rate (maybe) Data storage Data ingestion Data Ingestion

Pub/sub

Producer publish(topic, msg) Consumer subscribe

Topic Topic msg 1 2 Topic 3 Publish subscribe system Consumer Producer msg Typical example

Frontend Frontend Service Service

BrokerBroker BrokerKafka

Real time Security News feed monitorin systems Hadoop DWH g connectors

• Apache Flume http://flume.apache.org/ is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to 's HDFS. It has a simple and flexible architecture based on streaming data flows. It originated at Cloudera and is written primarily in Java with the following components: • Event – a singular unit of data that is transported by Flume (typically a single log entry) • Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as logs and syslogs. • Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS. • Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel. • Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels. • Client – produces and transmits the Event to the Source operating within the Agent Flume

Flume is for high-volume ingestion into Hadoop of event- based data e.g collect logfiles from a bank of web servers, then move log events from those files to HDFS (clickstream)

Hadoop File Formats and Data 23 Ingestion Flume Example

Hadoop File Formats and Data 24 Ingestion Category Component Avro Flume configuration Exec HTTP agent1.sources = source1 JMS agent1.sinks = sink1 Source Netcat agent1.channels = channel1 Sequence generator Spooling directory agent1.sources.source1.channels = channel1 Syslog Thrift agent1.sinks.sink1.channel = channel1 Twitter Avro agent1.sources.source1.type = spooldir Elasticsearch agent1.sources.source1.spoolDir = /var/spooldir File roll HBase HDFS Sink agent1.sinks.sink1.type = hdfs IRC agent1.sinks.sink1.hdfs.path = hdfs://hostname:8020 Logger /user/flume/logs Morphline (Solr) agent1.sinks.sink1.hdfs.filetype = DataStream Null Thrift File agent1.channels.channel1.type = memory Channel JDBC agent1.channels.channel1.capacity = 10000 Memory agent1.channels.channel1.transactionCapacity = 100 flume-ng agent –name agent1 –conf $FLUME_HOME/conf

• Apache Sqoop http://sqoop.apache.org/ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. • Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache project. • Sqoop supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. – Exports can be used to put data from Hadoop into a relational database. • Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server databases to Hadoop. • Couchbase, Inc. also provides a -Hadoop connector by means of Sqoop • Couchbase Server, originally Membase, is an open source, distributed NoSQL document-oriented database that is optimized for interactive applications; related to but different from CouchDB.

Sqoop – SQL to Hadoop

- Open source tool to extract data from structured data store into Hadoop - Architecture Sqoop – contd.

- Sqoop schedules map reduce jobs to effect imports and exports - Sqoop always requires the connector and JDBC driver - Sqoop requires JDBC drivers for specific database server, these should be copied to /usr/lib/sqoop/lib - The command-line structure has the following structure Sqoop TOOL PROPERTY_ARGS SQOOP_ARGS

TOOL - indicates the operation that you want to perform, e.g import, export etc PROPERTY_ARGS - are a set of parameters that are entered as Java properties in the format -Dname=value. SQOOP_ARGS - all the various sqoop parameters. Sqoop – How to run sqoop

Example:

sqoop import \ --connect jdbc:oracle:thin:@devdb11- s.cern.ch:10121/devdb11_s.cern.ch \ --username hadoop_tutorial \ -P \ --num-mappers 1 \ --target-dir visitcount_rfidlog \ --table VISITCOUNT.RFIDLOG Sqoop – how to parallelize

-- table table_name

-- query select * from table_name where $CONDITIONS

-- table table_name -- split-by primary key -- num-mappers n

-- table table_name -- split-by primary key -- boundary-query select range from dual -- num-mappers n Data Processing Processing Processing

• Three stages of extract information from the data stored on the lake – data preparation (see data integration), – data analytics (or machine learning), – Result provisioning for consumption.

• How do you perform them? • General purpose processing frameworks – Hadoop MapReduce, , Flink or Storm – (micro-)batch (MapReduce, Spark) or stream-based (Flink, Storm Data flow

• To execute a set of data manipulation activities • Apache NiFi Basic building block File flow processor

• Routing • Tranformation – Split, aggregate, enrich, convet… • Mediation – Push/pull Connection

• Queuing • Expiration • Prioritize • Swapping Work flow

• To orchestrate several unit of works • , Falcon Data governance Data Governance Data lineange Metadata managment

• Apache atlas • Knowledge Graph In summary Architectural pattern Lambda architecture

• For real time data there is the need to capture real-time insights from data – generated by sensors, social media • There is the need to work with data streams from which results with minimal delay have to be generated. • While buffering the data on storage and applying batch-processing techniques to the buffered chunks is ok for non real time needs • Often there is the need to have both needs Lambda Architecture Limitation of lambda architecture

• The main limitation of the lambda architecture is that the application logic must be implemented twice • Often with different tools – Storm/flinkr, spark The Kappa architecture References

• http://www.oreilly.com/data/free/files/architectin g-data-lakes.pdf • Christian Mathis, Data Lakes, Datenbank Spektrum, 2017 (see on moodle)