Big Data for Big Discoveries How the LHC looks for Needles by Burning Haystacks Alberto Di Meglio CERN openlab Head

DOI: 10.5281/zenodo.45449, CC-BY-SA, images courtesy of CERN The (LHC) Burning Haystacks

The Detectors: 7000 tons “microscopes” 150 million sensors Data generated 40 million times per second  Peta Bytes / sec !

Low-level Filters and Triggers 100,000 selections per second  Tera Bytes / sec !

High-level filters (HLT) 100 selections per second  Giga Bytes / sec !

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 3 Storage, Reconstruction, Simulation, Distribution

6 GB/s

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 4 Worldwide LHC Computing Grid

Tier-0 (CERN): Tier-2 (68 •Data recording Federations, ~140 •Initial data reconstruction centres): •Data distribution • Simulation • End-user analysis Tier-1 (12 centres): •Permanent storage •Re-processing •525,000 cores •Analysis •450 PB

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 5 1 PB/s of data generated by the detectors

Up to 30 PB/year of stored data

A distributed computing infrastructure of half a million cores working 24/7

An average of 40M jobs/month

An continuous data transfer rate of 4-6 GB/s across the Worldwide LHC Grid (WLCG) Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 6 The Higgs Boson completes the Standard Model, but the Model explains only about 5% of our Universe

What is the other 95% of the Universe made of? How does gravity really works? Why there is no antimatter in nature? Exotic particles? Dark matter? Extra dimensions?

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 7 LHC Schedule

2009 2010 2011 2011 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 … 2030? First run LS1 Second run LS2 Third run LS3 HL-LHC FCC?

Phase-0 Upgrade Phase-1 Upgrade LHC startup Phase-2 Upgrade (design energy, (design energy, 900 GeV (High Luminosity) nominal luminosity) design luminosity) 50 times more data than today in the next 10 years 50 PB/s out of the detectors 5 PB/day to be stored

7 TeV 14 TeV 14 TeV 14 TeV L=6x1033 cm-2s-1 L=1x1034 cm-2s-1 L=2x1034 cm-2s-1 L=1x1035 cm-2s-1 Bunch spacing = 50 ns Bunch spacing = 25 ns Bunch spacing = 25 ns Spacing = 12.5 ns

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 8 Information Technology Research Areas

Data acquisition and filtering

Computing platforms, data analysis, simulation

Data storage and long-term data preservation

Compute provisioning (cloud)

Networks

Data analytics

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 9 How can we Address the Challenges?

Flat Budget

Education Technology

Collaborations

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 10 Technology Evolution

Computing . More performance, less cost . Redesign of DAQ • Less custom hardware, more off-the-shelf Storage CPUs, fast interconnects, optimized SW . Redesign of SW to exploit modern Infrastructure CPU/GPU platforms (vectorization, parallelization) Data Analytics . New architectures (e.g. automata)

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 11 Technology Evolution

Computing . Reduce data to be stored • Move computation from offline to online . More tapes, less disks Storage • Dynamic data placements . New media Infrastructure • Object-disks, shingled . Persistent meta-data, in-memory ops Data Analytics • Flash/SSD, NVRAM, NVDIMM

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 12 Technology Evolution

Computing . Economies of scale . Large-scale, joint procurement of services and equipment Storage . Hybrid infrastructure models • Allow commercial resource procurement Infrastructure • Large-scale cloud procurement models are not yet clear Data Analytics

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 13 Technology Evolution

Computing . Reduce analysis cost, decrease time . More efficient operations of the LHC systems Storage • Log analysis, proactive maintenance . Infrastructure monitoring, optimization Infrastructure . Data reduction, event filtering and identification on large data sets Data Analytics (~10PB)

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 14 Collaborations HEP Community Laboratories, academia, research

Industrial Collaboration Frameworks Joint Infrastructure Initiatives

Alberto Di Meglio – CERN openlab 15 New educational requirements

Multicore CPU Data analysis Applications of programming, technologies, physics to other graphical tools, domains (hadron processors data storage, therapy, etc.), (GPU), visualization, Knowledge multithreaded monitoring, transfer software security, etc.

Software & Multidisciplinary Computing Data Scientists applications Engineers

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 16 Take-Away Messages . LHC is a long-term project • Need to adapt and evolve in time . Look for cost-effective solutions • Continuous technology tracking • Aim for “just as good as necessary” based on concrete scientific objectives . Collaboration and education • People is the most important asset to make this endeavour sustainable over time

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 17 EXECUTIVE CONTACT Alberto Di Meglio, CERN openlab Head alberto.di.meglio@.ch

TECHNICAL CONTACT Maria Girone, CERN openlab CTO [email protected] Fons Rademakers, CERN openlab CRO [email protected]

COMMUNICATION CONTACT Andrew Purcell, CERN openlab Communication Officer [email protected] Mélissa Gaillard, CERN IT Communication Officer [email protected]

ADMIN CONTACT Kristina Gunne, CERN openlab Administration Officer This work is licensed under a Creative Commons Attribution- [email protected] ShareAlike 3.0 Unported License. It includes photos, models and videos courtesy of CERN and uses contents provided by CERN and CERN openlab

Alberto Di Meglio – CERN LHC - HPC & Big Data 2016 18