Hadoop and Spark services at CERN
CERN IT-DB-SAS Hadoop, Kafka and Spark Service Aug 14th, 2019
Jose Carlos Luna Duran
(shortened version for MAPF discussion by Alberto Di Meglio)
1 Streaming and Analytics Service (SAS) at CERN IT
• Setup and run the infrastructure • Support user community • Provide consultancy • Train on the technologies
• Facilitate use • Package libraries and configuration • Docker clients • Notebook service • https://hadoop-user-guide.web.cern.ch 2 CERN Hadoop Service - Timeline
2013 2014 2015 2016 2017 2018 2019
• Starting Kafka Production Adoption of pilot Adoption Start of Adoption of SQL- ready Hadoop 3 Hadoop 2 • Rolling out HDFS of Jupyter Hadoop pilot based solutions High availability, production (YARN) and backups Notebooks service (Hive, Impala) Custom CERN ready Apache Spark • XRD connector (SWAN) and monitoring distribution
LHC critical Central IT Forth cluster First secured Started R&D projects Start of CMS Spark cluster on Monitoring cluster with other commits Big Data installed Kubernetes project deployed RDBMS- to use the project (for LHC mission moves to based service critical systems) First projects Hadoop projects by ATLAS Second IT Security Start of series Third cluster of and cluster project installed introduction CASTOR installed moves to 3 Hadoop training Hadoop and Spark for big data analytics
• Distributed systems for data processing • Storage and multiple data processing interfaces • Can operate at scale by design (shared nothing) • Typically on clusters of commodity-type servers/cloud • Many solutions target data analytics and data warehousing • Can do much more: stream processing, machine learning • Already well established in the industry and open source
Scale-out data processing 4 Hadoop service in numbers
6 clusters 20+ TB of Memory 4 production (bare-metal) 2 QA clusters (VMs) CPU 1500+ physical cores
65 physical servers HDDs and SSDs
40+ virtual machines Data growth: 4 TB per day
18+ PBs of Storage 5 Analytix (General Purpose) Cluster
Bare-Metal 13+TB of Memory
50+ Servers CPU 800+ physical cores
10+PB of Storage HDDs and SSDs
6 Overview of Available Components used
Kafka: streaming and ingestion
7 Hadoop and Spark production deployment
Software distribution Rolling change deployment Cloudera (since 2013) no service downtime Vanilla Apache (since 2017) transparent in most of the cases
Installation and configuration Host monitoring and alerting CentOS 7.4 via CERN IT Monitoring infrastructure custom Puppet module
Security Service level monitoring (since 2017) authentication Kerberos metrics integrated with: Elastic + Grafana fine-grained authorization integrated with ldap/e-groups (since 2018) custom scripts for availability and alerting
HDFS backups (since 2016) High availability (since 2017) automatic master failover Daily incremental snapshots for HDFS, YARN and HBASE Sent to tapes (CASTOR) 8 SWAN – Jupyter Notebooks On Demand
• Service for web based analysis (SWAN) • Developed at CERN, initially for physics analysis by EP-SFT
• An interactive platform that combines code, equations, text and visualizations • Ideal for exploration, reproducibility, collaboration
• Fully integrated with Spark and Hadoop at CERN (2018) • Python on Spark (PySpark) at scale • Modern, powerful and scalable platform for data analysis • Web-based: no need to install any software
9 Analytix Platform Outlook
Integrating with existing infrastructure: • Software • Data
Experiments storage
HDFS
HEP software Personal storage Future work
• Spark on Kubernetes (continuation). For those not profiting from data locality. • Ephemeral cluster • Bring your own cluster
• SQL on top of Big Data • Scale-out database
• Service web portal for users (in-progress) • Integrate multiple components of the service • Single place for monitoring and requesting the service resources • HDFS quota, CPUs, memory, Kafka topics etc.
• Explore further the big data technology landscape • Presto, Phoenix, Apache Kudu, Apache Beam, Drill etc.
11 Next Gen. CERN Accelerator Logging • A control system with: Streaming, Online System, API for Data Extraction • Critical system for running LHC - 700 TB today, growing 200 TB/year • Challenge: service level for critical production
Credit: Jakub Wozniak, BE-CO-DS 12 New CERN IT Monitoring infrastructure Critical for CC operations and WLCG Data Storage & Data Sources Transport Search Access
Flume AMQ AMQ Kafka cluster HDFS (buffering) * FTS Flume DB DB Rucio
XRootD Flume Flume HTTP Flume Kafka sinks Elastic feed HTTP sink Jobs Processing Search
… Data enrichment Logs Flume Lemon Log GW Data aggregation Batch Processing syslog Flume … Lemon Metric CLI, API app log metrics GW Others (influxdb)
• Data now 200 GB/day, 200M events/day • At scale 500 GB/day • Proved to be effective in several occasions Credits: Alberto Aimar, IT-CM-MM 13 The ATLAS EventIndex
• Catalogue of all collisions in the ATLAS detector • Over 120 billion of records, 150TB of data • Current ingestion rates 5kHz, 60TB/year
WLCG CERN
Event Events Data Analytics Web extraction Web metadata enrichment UI extraction UI
Object Mapfiles Data file MetaFile Table Grid job Grid Store + HBase Hadoop RDBMS 14 CMS Data Reduction Facility
• R&D, CMS Bigdata project, CERN openlab, Intel: • Reduce time to physics for PB-sized datasets • Exploring a possible new way to do HEP analysis Improve computing resource utilization • Enable physicists to use tools and methods from “Big Data” and open source communities • CMS Data Reduction Facility: • Goal: produce reduced data n-tuples for analysis in a more agile way than current methods • Currently testing: scale up with larger data sets, first prototype successful but only used 1TB
15 Data Processing: CMS Open Data Example Let’s calculate the invariant mass of a di-muon system?! # read in the data • Transform a collection of muons to an df = sqlContext.read\ invariant mass for each Row (Event). .format(“org.dianahep.sparkroot.experimental”)\ • Aggregate (histogram) over the entire dataset. .load(“hdfs:/path/to/files/*.root”) # count the number of rows: df.count()
# select only muons muons = df.select(“patMuons_slimmedMuons__RECO_.patMuons_slimmedMuo ns__RECO_obj.m_state”).toDF(“muons”)
# map each event to an invariant mass inv_masses = muons.rdd.map(toInvMass)
# Use histogrammar to perform aggregations empty = histogrammar.Bin(200, 0, 200, lambda row: row.mass) h_inv_masses = inv_masses.aggregate(empty, histogrammar.increment, histogrammar.combine)
16 Credits: V. Khristenko, J. Pivarski, diana-hep and CMS Big Data project Machine Learning
Data and models from Physics AUC= 0.9867 Input: Feature Distributed Output: particle labeled Hyperparameter engineer optimization model training selector model data and ing at DL models (Random/Grid scale search)
Machine Learning Pipelines with Apache Spark and Intel BigDL:
https://indico.cern.ch/event/755842/contributions/3293911/attachments/1784423/2904604/posterMatteo.pdf Medical Applications Resources • Proposal in 2018 to inject resources in SAS dedicated to MA projects • Proposal was discussed and approved • 80K CHF were allocated in 2018 • Agreement to “hook” the funds onto standard hardware procurement in IT to exploit special pricing conditions
18 Medical Applications Resources • Initial purchase of CPU and disks for Analytix (Jan 2019): • 4 node with standard CPU (same as for physics analysis) • 512 GB RAM per node = 2 TB RAM • 1 TB SSD per node = 4 TB storage • Pooled as part of the existing cluster, possibility of “spilling over” for limited-time activities
19 Medical Applications Resources
• How to access: • A “project request” mechanism is being set up • For now, just contact me or the IT-DB-SAS lead
20 Medical Applications Resources
• Planning for additional resources: • Increasing requests for GPU/FPGA for Deep Learning training • If there is interest for MA and funds can be allocated in 2019, a purchase can be attached to the September procurement round and be available beginning 2020
21