Hands on Hadoop Proof of Concept for CALS-2.0

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0

Chris Roderick Marcin Sobieszek Piotr Sowinski Nikolay Tsvetkov Jakub Wozniak Courtesy IT-DB Agenda

• Intro to CALS System • Hadoop Ecosystem / IT Support At CERN • Data Formats • Proof Of Concept – Data Ingestion – Data Extraction – Performance Results • Challenges / Benefits / Conclusions • Questions?

6/15/2016 BE-CO-DS 2 (Short) Introduction to CALS

6/15/2016 BE-CO-DS 3 CERN Accelerator Logging Service

• Started in 2001 • Critical service for running the LHC (and others) • Mandate – Stores acquisition device/property data • Mainly NUMERIC timeseries, not text logs… – Information for acc. performance improvement – Decision support system for management – Avoids duplicate logging efforts

6/15/2016 BE-CO-DS 4 CALS Architecture

Timber Other Apps

Extraction Layer ExtractionExtraction Servers Servers

MDB LDB Persistence Layer

Logging Process LoggingLogging Process Process Oracle Scadar QPSR

Providers Layer Middleware WinCC (PVSS)

6/15/2016 BE-CO-DS 5 CALS In Numbers • MDB – 260,000 variables (signals) • LDB – 1,482,000 variables (signals) • Number of data points (dp) – 5,000,000,000 dp/day – 1.6E12 dp/year • 6,000,000 extraction requests per day • Storage – MDB -> ~40 TB total, 3 months of data, (daily ~700 GB), – LDB -> ~430 TB total (daily ~570 GB)

6/15/2016 BE-CO-DS 6 Storage Evolution

Size in GB / day 700

600 System designed for 1TB / year 500

400

300

200

100

2009 2012 2014 2008 2008 2008 2009 2009 2009 2010 2010 2010 2011 2011 2011 2012 2012 2013 2013 2013 2013 2014 2014 2015 2015 2015 2016

6/15/2016 BE-CO-DS 7 Current Issues - CALS & CO • Dramatic increase of data load (in/out) – Frequency increase to 10Hz (more in the future) – Very big vector data (~2E06 data-points) – Refusal to filter the data (QPS -> all needed) – Injectors data (request for 20k new variables), analog signals • Have to extract all data first to analyze it -> API limited • Transfer issues between MDB/LDB – Frequent custom manual transfers • Complex setup - 5 DB schemas (x DEV/TEST/INT) • Limited monitoring / configuration • Support -> 20+ issues / week (1.5 person) • Emerging custom logging systems (MDs, Collimators, BLMs, RF Transverse Damping / Instabilities observation boxes) – CO is involved -> maintenance problems in the future? • Rising Python “pressure”

6/15/2016 BE-CO-DS 8 CALS-2.0 Motivations Changing Landscape Brings New Challenges

• Make possible analysis of bigger data sets in longer time windows – Provide analytical functionalities – Increase bandwidth & processing power (16 machines cluster => 20x) – Use right tools for the job (BigData toolset) – Very difficult with current Oracle setup as it is stretched to its very limits • Limit external logging efforts • Allow for better API integration with outside community (Python) • Limit extensive Oracle expertise required currently • Limit costs (around 30%) with commodity hardware • Renovate (aging) system to meet evolving requirements – Improve configuration / maintenance / system monitoring

6/15/2016 BE-CO-DS 9 CALS-2.0 Proof Of Concept

• Team of roughly 2.5-3 people – Help from IT-DB-SAS (many thanks!) • Work from early February 2016 until early May 2016 (~3 months) • Mandate – Explore and learn Hadoop technology – Implement PoC for CALS-2.0 (store, extract) – Present results to community

6/15/2016 BE-CO-DS 10 Hadoop

6/15/2016 BE-CO-DS 11 Hadoop Open-source framework for large scale data processing – Distributed storage and processing – Shared nothing architecture – scales horizontally – Optimized for high throughput on sequential data access

Interconnect network

CPU CPU CPU CPU CPU CPU

MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY

Disks Disks Disks Disks Disks Disks

6/15/2016 BE-CO-DS 12 Hadoop

• Good for: – Parallel processing of large amounts of data – Perform analytics on a big scale – Dealing with diverse data: structured, semi- structured, unstructured • But not optimal for: – Random reads and real-time access – Small datasets – Updates / appends

6/15/2016 BE-CO-DS 13 Hadoop Service in IT • Setup and run the infrastructure • Provide consultancy • Build the community

• As per Oracle support

• Joint work – IT-DB and IT-ST

14 Hadoop Clusters in IT (Oct 2015)

• lxhadoop (22 nodes) – general purpose cluster (mainly used by ATLAS) – stable software setup – recent hardware • analytix (56 nodes) – for analysis of monitoring data – varied hardware specifications – the biggest in terms of number of nodes • hadalytic (14 nodes, 224 cores, 768 GB of RAM) – general purpose cluster with additional services – recent hardware

Hadoop Ecosystem

scale data proceesing scale data

Sqoop Data exchange with RDBMS with exchange Data

Pig Hive Scripting SQL Spark

Large MapReduce

Hbase NoSql store columnar

Flume Impala SQL Log data collector Log data YARN Cluster resource manager

HDFS Hadoop Distributed File System

Zookeeper Coordination

16 Data Formats

6/15/2016 BE-CO-DS 17 File Formats – Apache Avro

• Row oriented • Compact, fast, binary serialization format • Rich data structures – scalars, arrays, maps, structs, rows, etc • Can use compression

6/15/2016 BE-CO-DS 18 File Formats – Apache Parquet

• Google “Dremel” white-paper • Columnar storage • Very efficient compression algorithms – Delta encodings – Binary (bit) packing – Dictionary • Pushdowns – Projection pushdown – Predicate pushdown • Very efficient reads (avoid reading unwanted data)

6/15/2016 BE-CO-DS 19 Parquet Columnar storage

Pushdowns

6/15/2016 BE-CO-DS 20 Parquet vs. Avro Our Experience • Entropy in the data -> better compression • Avro file - 6 GB • Parquet file - 1.36 GB (same data) • More than 4x storage ratio on disk • Reads faster 4x-5x from Parquet – (Spark, with column projection)

• Our PoC scenario: – Data comes as Avro – “Data compactor” converts to Parquet – Compaction creates big files from many small ones – Parquet files are read by Impala / Spark

6/15/2016 BE-CO-DS 21 PoC (Lambda) Architecture

//1 - create HDFS file system Configuration conf = new Configuration(); FileSystem fileSystem = FileSystem.get(conf);

//2 - create Avro Schema SpeedHbase layer SchemaPUSH schema =vs. createAvroSchema PULL ();??? //3 - create Avro file writer DataFileWriter dataFileWriter = new DataFileWriter(new GenericDatumWriter(schema))); FSDataOutputStream outputStream = fileSystem.create(new Path(“/user/cals/data.tmp”); dataFileWriter.create(schema, outputStream); Spark Log. Compactor //4 – createLog. record DataKafka Ingestion?FlumeGobblin Data Extraction? GenericRecordProc.Proc. record = createRecord(…, schema); BatchHDFS layer //5 – write record Impala dataFileWriter.append(record); Storage?

Schema Partition Provider

CCDB 6/15/2016 BE-CO-DS 22 Data Ingestion

6/15/2016 BE-CO-DS 23 Data Ingestion Objectives

• Acquire data from different data sources and store it in a persistent storage • Data Latency <~30sec(from being published to being available to users) • No data losses possible • Data must be kept in the ingestion layer (for some time) in case of storage layer unavailability (i.e. maintenance) • Provide data transformation features – Enhancing by adding some context info i.e. beam mode – Filtering out – Distinct until changed, sampling etc.

6/15/2016 BE-CO-DS 24 Ingestion Architecture

Speed

HBase pull 30s 30s Log. Batch Log. 100ms Kafka Gobblin Proc.Proc. pull Compactor 7 min 7 min HDFS

Schema Storage? Partition Provider

CCDB

6/15/2016 BE-CO-DS 25 Data Collection Overview 1. Acquire data from JAPC 2. Convert to Avro - Get schema for given device and property 3. Serialize and send asynchronously to Kafka

Acquire Transform Publish Kafka Schema Provider 100ms

CCDB

6/15/2016 BE-CO-DS 26 APV -> Avro Conversion

APV Data Avro Record Avro Schema

6/15/2016 BE-CO-DS 27 Apache Kafka on 2 slides • A distributed, partitioned, replicated message broker • High throughput, low latency, scalable, centralized, awesome 

• Originally developed at LinkedIn in 2011 • Graduated Apache incubator in 2012

6/15/2016 BE-CO-DS 28 Apache Kafka on 2 slides • Producers, Brokers, Topics, Consumers • Topic are broken into (replicated) partitions • Messages are assigned sequential ID called offset • Messages are retained with configurable SLA

• Messages are stored on the file system • Optimized OS operations: page cache, sendfile(), zero copy • Replication of partitions as design default approach • Guarantee fault-tolerance

6/15/2016 BE-CO-DS 29 CALS-2.0 Kafka Setup

• Two Kafka brokers organized in one cluster using broadly available Hadoop Zookeper • Each holding 7 topics with 10 partitions • One topic maps to one Logging Process • Each topic created with replication factor = 2 • Each device/property is always stored in the same partition • Producers sent messages in Async mode with 100ms batches.

6/15/2016 BE-CO-DS 30 Ingestion Architecture

Speed

HBase 1min 1min Batch Log. Kafka Log. 100ms Gobblin Proc. Compactor Proc. 7 min 7 min HDFS

Storage? Schema Partition Provider

6/15/2016 BE-CO-DS 31 LinkedIn Gobblin • Universal data ingestion, ETL framework • Composed of: – Source – intelligent task-partition assignments – Work Unit – Extractor – pulls data from the source – Converter – filtering, projection, type conversion etc. – Quality Checker – schema compatibility, unique keys – Writer – one per task/work unit – Data Publisher – to final directories

Task Quality Work Extractor Converter Writer Unit Checker

Source Task Quality Data Work Extractor Converter Writer Unit Checker Publisher

6/15/2016 BE-CO-DS 32 CALS-2.0 Gobblin Setup • 2 standalone instances, each running 7 jobs pulling data from a dedicated topic • Each job composed of 10 tasks (one per partition) • Custom Extractor to convert Kafka record to Avro • Custom Writers to write data to HBase and HDFS • Custom Data Partitioners for: – HDFS class/version/property/yyyy-mm-dd directories – HBase class_version_property tables

Task Quality Work Extractor Converter Writer Unit Checker

Source Task Quality Data Work Extractor Converter Writer Unit Checker Publisher

6/15/2016 BE-CO-DS 33 CALS-2.0 Data Partitioning • Data stored in records {f1,f2,…} partitioned by class/version/property/yyyy-mm-dd • Pros – Convenient to gather data statistics i.e. about used space per client/system – Convenient to move/backup/restore on demand – More optimal for scanning (less data to process) • Cons – We need history of changes for a given device to know its data proper location over time

6/15/2016 BE-CO-DS 34 Data Ingestion Summary • System run for ~2 months without any problems • Data load between 7k and 10k events/second • Big chunks of data handled smoothly – Big records with ~100k fields – Big records with 1GB/min • No data losses observed

6/15/2016 BE-CO-DS 35 Data Extraction

6/15/2016 BE-CO-DS 36 Impala

6/15/2016 BE-CO-DS 37 What Is Cloudera Impala ?

• MPP Query engine running on Apache Hadoop • Low latency SQL queries on data in HDFS (Parquet/Avro) and Apache Hbase • For big data processing and analytics directly via SQL or business intelligence tools • Most common SQL-92 features of HiveQL • Supports Hadoop security (Kerberos, Sentry) • Provides JDBC/ODBC driver Extraction API

Client API - set of Java methods used by external applications for data retrieval Extraction API Implementation Details • Getting HDFS and HBase table names for signal • Creating references to the table on the fly (including all the partitions) • Handling of recent and long term data (union of different data sources) • Using basic filtering implementation • Using of analytic and aggregation functions • Effective usage of partitioning – query optimization Cache Performance

• Test performed: 7.1GB of data distributed over 14 machines • Cache management in HDFS with OS buffer cache • Query execution time between 26 and 29 seconds • Conclusion – CPU-bound (decoding), time for I/O is irrelevant – Cache does not help in our case Operation times in seconds

Real execution time

CPU time

Decoding (all threads)

I/O (all threads)

0 10 20 30 40 50 60 70 80

Both OS only HDFS only 6/15/2016 BE-CO-DS 41 What Is Still Missing In Impala

• Support for array type – Specific implementation resulting in Cartesian product:

– Only be used in tables with the Parquet file format (HBase issue – arrays not implemented). – Introducing some overhead (at most 2x slowdown compared to tables that do not have any complex types) – Cloudera contacted, development foreseen, no dates • Outer joins not implemented (works on one node only)

What Is Spark ?

Distributed data processing framework – Easy to use (comparing to its predecessor MapReduce) – In-memory and fast (10-100x faster than MapReduce) – General purpose • Unified platform for different type of data processing jobs • Supports several APIs (Scala, Python, Java, R) – Scalable • Increase the data processing capacity by extending the cluster – Fault Tolerant • Automatically handles failure of node in the cluster High-level Architecture

• Spark application involves five key entities: – Driver program Worker Node – Cluster manager Executor – Workers Task Task Driver – Program Executors Worker Node – Tasks Executor Task Task

Cluster • Cluster managers: Manager Worker Node – Standalone Executor – YARN Task Task – Mesos RDD • RDD (Resilient Distributed Dataset) – Main Spark abstraction – Collection of rows partitioned across nodes of cluster that can be operated on in parallel • There are two ways of creating RDD: – Parallelizing a collection – Reading from external source (HDFS, HBase, Amazon S3, etc.) RDD

• Transformations and Actions Node Node Node

Task Task Task

Task Task Spark SQL and DataFrame API

• Higher-level abstraction for processing structured data • Makes Spark easier to use as providing SQL interface • Increase the performance of Spark applications

DataFrames SQL/HiveQL Spark MLlib Streaming Spark SQL

RDD API

Spark Core

Data Sources Spark in CALS-2.0

• Data Compaction • Data Extraction Server – Interactive API for the clients – Implemented representative CALS methods (as with Impala) – Support of complex types • SparkSQL DataFrame df = sqlContext.read().load("examples/src/main/resources/users.parquet"); df.registerTempTable("people");

DataFrame names = sqlContext.sql("select name from people where age > 21");

Row[] result = names.collect(); • DataFrames

DataFrame df = sqlContext.read().load("examples/src/main/resources/users.parquet"); DataFrame res = df.filter(df.col(“age").gt(21)) .select(df.col(“name”)); List resultList = res.collectAsList()); Data analysis in CALS-2.0

• Spark opens many doors for data analysis over Logging data – Could be used for direct access of the HDFS data – Supports Python, R and Scala – Possible replacement of the Logging API – Provides access through open source data analysis Notebooks

• Spark Notebooks – Web interface with built-in Spark integration – Data visualisation (tables, charts, etc.) – Dynamic input forms and data widgets – Support work in collaboration and publishing results online Jupyter Notebook Apache Zeppelin Notebook Initial Extraction Performance Tests

6/15/2016 BE-CO-DS 52 Performance Tests Setup

• Real data distribution – KB / day -> 35% of data logged – MB / day -> 60% of data logged – GB / day -> 5% of data logged

• Generated data GB / day – 10 devices – 1.2 sec updates -> 263 millions of rows / year – 1 day = 1.36 GB, 0.5 TB of total data / year

• Shared hadalytic cluster • Impala 2.3 (max ~224 cores, adaptive) • Spark 1.6.1 (120 cores, fixed) • No in-depth optimization done so far

6/15/2016 BE-CO-DS 53 Real Data Scan (MB/day) (Lower Is Better, Log. Scale) 1000

Oracle Impala Spark 100

10 Time(s)

1 42 1950 2400 2927 48843 57747 1452472 1791787 3000000

0.1 Number of Records

6/15/2016 BE-CO-DS 54 1 Year Scan On 0.5 TB Demo Data (Lower Is Better, Log. Scale)

1000

Oracle Impala Spark

100 Time(s)

1 23000 57747 230000 1350000 2650000 Number of Records

6/15/2016 BE-CO-DS 55 Real Data – Arrays Scan (Lower Is Better, Log. Scale)

100

Oracle Spark

10 Time Time (s)

1 893 3568 80000 601016 Number of Records

6/15/2016 BE-CO-DS 56 Performance Results

• Hadoop outperforms Oracle for BigData queries • Improvement needed for SmallData queries – Hbase? • Spark seems a little bit better than Impala • Arrays / structs are not a problem (with Spark) • Promising but further analysis is required

6/15/2016 BE-CO-DS 57 Challenges / Benefits Planning /Conclusions

6/15/2016 BE-CO-DS 58 Challenges • Evolving Hadoop ecosystem – Embrace it (whilst keeping data storage formats) – Architecture of inter-changeable micro-modules – Expose little of it to users • Spark as user analytics API in long term – Open-source, major players in the game (IBM, Cloudera, Databricks) – 300M $ from IBM on Spark development – http://fortune.com/2015/09/09/cloudera-spark-mapreduce/ • No experience with backup technology – IT commitment for Hadoop in general • CALS-2.0 Extraction API change (impact on users) – Keep old API backwards compatible (internally using new API) – Keep Timber – Allow smooth transition period (keep both APIs, etc) • More maintenance during transition period – Partial feature-freeze of old system necessary

6/15/2016 BE-CO-DS 59 Benefits • Open door for BigData Queries, Tools & Techniques • More generic & accommodating architecture… – … that just logs records {f1,f2…} – Avoids many scattered “logging” systems – Keeps together & shares all data • Acc. Devices, Industrial Controls, Tracing, Post-Mortem, MDs, Alarms – Allows client-oriented data governance (policy per client) • Spark analytics – Python & Notebooks/Dashboards, Machine Learning, R • Horizontal scaling to cover ever growing needs • Cost & maintenance reduction • Strengthen collaborations between groups • “Attractive” technology -> easier to recruit for

6/15/2016 BE-CO-DS 60 CALS-2.0 Planning (for 4 people) • 2016 – Establish “partners” (TE-MPE, ICS) and grounds for collaboration – Full architecture (design, Impala/Spark) – Find “sponsors” and implement their use-cases (LHC BI BLMs, Collimators, ABP Studies, MDs, WinCC, Tracing, etc) – First semi-operational results end of 2016 / early 2017 • 2017-2018 – APIs, migration, configuration, monitoring, further sponsors • March 2018 (CALS-2.0 bullet-proof) – Deadline for DB HW storage replacement decision • LS2 (2019-2020) – March 2019 end of maintenance of current DB HW storage – Full decommissioning of old CALS – CALS-2.0 in production (and good for the next 15 years!)

6/15/2016 BE-CO-DS 61 IT-DB Future SLA & Infrastructure Requirements • Critical system => needs 24/7 support • Stability during the Acc. Run (upgrades in TS only) • Dedicated Hadoop cluster required (~16, TBD) • Storage of 1PB is a good starting point – prepare for 1TB / day – Total volume/retention TBD… • Backup is essential

6/15/2016 BE-CO-DS 62 Conclusions

• CALS has been a successful offering from BE-CO for many years • As the accelerators have evolved and matured - Logging and analysis needs have also evolved • Current CALS reaching hard limits

• Technically possible to replace old system satisfying known requirements better • Productive collaboration with IT-DB • Positive response from BE/CO management • Planned to be ready by LS2

6/15/2016 BE-CO-DS 63