Hadoop and Spark services at CERN

CERN IT-DB-SAS Hadoop, Kafka and Spark Service Aug 14th, 2019

Jose Carlos Luna Duran

(shortened version for MAPF discussion by Alberto Di Meglio)

1 Streaming and Analytics Service (SAS) at CERN IT

• Setup and run the infrastructure • Support user community • Provide consultancy • Train on the technologies

• Facilitate use • Package libraries and configuration • Docker clients • Notebook service • https://hadoop-user-guide.web.cern.ch 2 CERN Hadoop Service - Timeline

 2013  2014  2015  2016  2017  2018  2019

• Starting Kafka Production Adoption of pilot Adoption Start of Adoption of SQL- ready Hadoop 3 Hadoop 2 • Rolling out HDFS of Jupyter Hadoop pilot based solutions High availability, production (YARN) and backups Notebooks service (Hive, Impala) Custom CERN ready • XRD connector (SWAN) and monitoring distribution

LHC critical Central IT Forth cluster First secured Started R&D projects Start of CMS Spark cluster on Monitoring cluster with other commits Big Data installed Kubernetes project deployed RDBMS- to use the project (for LHC mission moves to based service critical systems) First projects Hadoop projects by ATLAS Second IT Security Start of series Third cluster of and cluster project installed introduction CASTOR installed moves to 3 Hadoop training Hadoop and Spark for big data analytics

• Distributed systems for data processing • Storage and multiple data processing interfaces • Can operate at scale by design (shared nothing) • Typically on clusters of commodity-type servers/cloud • Many solutions target data analytics and data warehousing • Can do much more: stream processing, machine learning • Already well established in the industry and open source

Scale-out data processing 4 Hadoop service in numbers

 6 clusters  20+ TB of Memory  4 production (bare-metal)  2 QA clusters (VMs) CPU 1500+ physical cores

 65 physical servers  HDDs and SSDs

 40+ virtual machines  Data growth: 4 TB per day

 18+ PBs of Storage 5 Analytix (General Purpose) Cluster

 Bare-Metal  13+TB of Memory

 50+ Servers CPU 800+ physical cores

 10+PB of Storage  HDDs and SSDs

6 Overview of Available Components used

Kafka: streaming and ingestion

7 Hadoop and Spark production deployment

 Software distribution  Rolling change deployment  Cloudera (since 2013)  no service downtime  Vanilla Apache (since 2017)  transparent in most of the cases

 Installation and configuration  Host monitoring and alerting  CentOS 7.4  via CERN IT Monitoring infrastructure  custom Puppet module

 Security  Service level monitoring (since 2017)  authentication Kerberos  metrics integrated with: Elastic + Grafana  fine-grained authorization integrated with ldap/e-groups (since 2018)  custom scripts for availability and alerting

 HDFS backups (since 2016)  High availability (since 2017)  automatic master failover  Daily incremental snapshots  for HDFS, YARN and HBASE  Sent to tapes (CASTOR) 8 SWAN – Jupyter Notebooks On Demand

• Service for web based analysis (SWAN) • Developed at CERN, initially for physics analysis by EP-SFT

• An interactive platform that combines code, equations, text and visualizations • Ideal for exploration, reproducibility, collaboration

• Fully integrated with Spark and Hadoop at CERN (2018) • Python on Spark (PySpark) at scale • Modern, powerful and scalable platform for data analysis • Web-based: no need to install any software

9 Analytix Platform Outlook

Integrating with existing infrastructure: • Software • Data

Experiments storage

HDFS

HEP software Personal storage Future work

• Spark on Kubernetes (continuation). For those not profiting from data locality. • Ephemeral cluster • Bring your own cluster

• SQL on top of Big Data • Scale-out

• Service web portal for users (in-progress) • Integrate multiple components of the service • Single place for monitoring and requesting the service resources • HDFS quota, CPUs, memory, Kafka topics etc.

• Explore further the big data technology landscape • Presto, Phoenix, Apache Kudu, , Drill etc.

11 Next Gen. CERN Accelerator Logging • A control system with: Streaming, Online System, API for Data Extraction • Critical system for running LHC - 700 TB today, growing 200 TB/year • Challenge: service level for critical production

Credit: Jakub Wozniak, BE-CO-DS 12 New CERN IT Monitoring infrastructure Critical for CC operations and WLCG Data Storage & Data Sources Transport Search Access

Flume AMQ AMQ Kafka cluster HDFS (buffering) * FTS Flume DB DB Rucio

XRootD Flume Flume HTTP Flume Kafka sinks Elastic feed HTTP sink Jobs Processing Search

… Data enrichment Logs Flume Lemon Log GW Data aggregation Batch Processing syslog Flume … Lemon Metric CLI, API app log metrics GW Others (influxdb)

• Data now 200 GB/day, 200M events/day • At scale 500 GB/day • Proved to be effective in several occasions Credits: Alberto Aimar, IT-CM-MM 13 The ATLAS EventIndex

• Catalogue of all collisions in the ATLAS detector • Over 120 billion of records, 150TB of data • Current ingestion rates 5kHz, 60TB/year

WLCG CERN

Event Events Data Analytics Web extraction Web metadata enrichment UI extraction UI

Object Mapfiles Data file MetaFile Table Grid job Grid Store + HBase Hadoop RDBMS 14 CMS Data Reduction Facility

• R&D, CMS Bigdata project, CERN openlab, Intel: • Reduce time to physics for PB-sized datasets • Exploring a possible new way to do HEP analysis Improve computing resource utilization • Enable physicists to use tools and methods from “Big Data” and open source communities • CMS Data Reduction Facility: • Goal: produce reduced data n-tuples for analysis in a more agile way than current methods • Currently testing: scale up with larger data sets, first prototype successful but only used 1TB

15 Data Processing: CMS Open Data Example Let’s calculate the invariant mass of a di-muon system?! # read in the data • Transform a collection of muons to an df = sqlContext.read\ invariant mass for each Row (Event). .format(“org.dianahep.sparkroot.experimental”)\ • Aggregate (histogram) over the entire dataset. .load(“hdfs:/path/to/files/*.root”) # count the number of rows: df.count()

# select only muons muons = df.select(“patMuons_slimmedMuons__RECO_.patMuons_slimmedMuo ns__RECO_obj.m_state”).toDF(“muons”)

# map each event to an invariant mass inv_masses = muons.rdd.map(toInvMass)

# Use histogrammar to perform aggregations empty = histogrammar.Bin(200, 0, 200, lambda row: row.mass) h_inv_masses = inv_masses.aggregate(empty, histogrammar.increment, histogrammar.combine)

16 Credits: V. Khristenko, J. Pivarski, diana-hep and CMS Big Data project Machine Learning

Data and models from Physics AUC= 0.9867 Input: Feature Distributed Output: particle labeled Hyperparameter engineer optimization model training selector model data and ing at DL models (Random/Grid scale search)

Machine Learning Pipelines with Apache Spark and Intel BigDL:

https://indico.cern.ch/event/755842/contributions/3293911/attachments/1784423/2904604/posterMatteo.pdf Medical Applications Resources • Proposal in 2018 to inject resources in SAS dedicated to MA projects • Proposal was discussed and approved • 80K CHF were allocated in 2018 • Agreement to “hook” the funds onto standard hardware procurement in IT to exploit special pricing conditions

18 Medical Applications Resources • Initial purchase of CPU and disks for Analytix (Jan 2019): • 4 node with standard CPU (same as for physics analysis) • 512 GB RAM per node = 2 TB RAM • 1 TB SSD per node = 4 TB storage • Pooled as part of the existing cluster, possibility of “spilling over” for limited-time activities

19 Medical Applications Resources

• How to access: • A “project request” mechanism is being set up • For now, just contact me or the IT-DB-SAS lead

20 Medical Applications Resources

• Planning for additional resources: • Increasing requests for GPU/FPGA for Deep Learning training • If there is interest for MA and funds can be allocated in 2019, a purchase can be attached to the September procurement round and be available beginning 2020

21