<<

The Scientific Data and Computing Center at BNL Benedikt Hegner, CERN SFT Group Meeting, Nov 2 2019 Outline

● Brookhaven National Laboratory ● Particle at BNL ● Scientific Data and Computing Center

New York BNL

~ 100 km 2 Brookhaven National Laboratory (BNL)

Founded in 1947 it is the oldest national lab not rooting in the

● First a facility for finding peaceful uses for nuclear power, they quickly decided in ‘48 to build the first proton-accelerator worldwide - the Cosmotron reached its full design energy of 3.3 GeV in 1953 ● Since then one of the leading HEP and NP labs in the US with the Alternating Gradient Synchrotron (AGS) and the Relativistic Heavy Ion Collider (RHIC) ● Based on the accelerator experience, two light sources (NSLS and NSLS-II) ● BNL nowadays is a multi-disciplinary lab covering plenty of High Flux Beam sciences Reactor

3 BNL in numbers

● 21 square kilometers ● 2750 staff ● 4000 guest scientists annually ● > 600M US$ annual budget ● Plenty of facilities ○ RHIC ○ National Synchrotron Light Source II ○ Center for Functional Nanomaterials ○ NASA Space Radiation Laboratory ○ Scientific Data and Computing Center ○ Long Island Solar Farm ○ Accelerator Test Facility ○ Linac Isotope Producer ○ Tandem Van de Graaf Facility ● Coordinating US ATLAS contributions 4 Nobel Prizes connected to BNL in Physics

● 1957 – Chen Ning Yang and Tsung-Dao Lee – laws based on Cosmotron measurements in ‘56 ● 1976 – Samuel C. C. Ting – J/Psi particle at the AGS published in ‘74 ● 1980 – and – CP-violation with the AGS published in ‘63 ● 1988 – Leon M. Lederman, , with AGS published in `62 ● 2002 – Raymond Davis, Jr. – Solar neutrino detection

5 Nobel Prizes connected to BNL in Chemistry

● 2003 – Roderick MacKinnon – Ion channel with measurements at the NSLS ● 2009 – Venkatraman Ramakrishnan and Thomas A. Steitz – Ribosome function with measurements at the NSLS

6 The First Video Game - Tennis for Two

Prepared for an outreach event in 1958 on an analog computer

7 BNL’s Relativistic Heavy-Ion Collider (RHIC)

Started in 2000

200 - 500 GeV depending on kind of beam (p or heavy-ion)

STAR and PHENIX as major experiments

8 ● Part of the NPP directorate and the Computational Science Initiative (CSI) ● About 40 FTE ● Currently moving from a NPP focussed facility to serving all science communities at BNL ● Extending to supporting more and more off-site efforts ● New, bigger data center in the works

[* formerly known as RHIC and ATLAS Computing Facility - RACF] 9 Provided Computing Resources

High-Throughput Computing

● ~ 2000 nodes w/ 65000 logical cores ● ~ 970mkHSpec

High-Performance Computing

● “Institutional Cluster” ○ 216 * Dual Intel Xeon E5-2695v4 CPUs @ 2.1 GHz - 256 GB DDR4-2400 MHz RAM ○ Dual Tesla K80 GPUs in 1/2 the systems - Dual P100 GPUs in the other half ○ EDR Infiniband ● KNL cluster ○ 142 * Intel Xeon Phi 7230 CPU @ 1.3 GHz - 192 GB DDR4-1200 MHz RAM ○ EDR Infiniband ● ML Cluster ○ 5 * Dual Intel Xeon Gold 6248 CPUs @ 2.5 GHz - 768 GB DDR-2933 MHz RAM ○ EDR Infiniband ML Cluster ○ 8 V100 GPUs (per node) 4kW per node 10 Provided Storage Central Disk Storage

● GPFS based 14 PB with > 1 billion files for HEP & NP ● NSLS-II with additional 1 PB of storage in GPFS ● Based on license cost changes, considering migration to Lustre dCache/XROOTD

● dCache with > 50 PB for ATLAS, Belle II, Phenix and Simons Foundation ● XROOTD with ~ 11 PB total for STAR

Tape storage

● ~165 PB total data on tape managed by HPSS

Tape Statistics (from Oct 04 - Oct 19) 11 SDCC activities in

Serving host-lab experiments at RHIC as Tier-0

● STAR Data taking ● PHENIX analysis and sPHENIX preparations for 2023 ● Simulation studies for the -Ion Collider (EIC) ○ Decision on site (BNL or Jefferson Lab) still pending ○ Start planned for late 2020s

Serving remote experiments

● Tier 1 and Tier 3 for ATLAS at CERN ● (sole) Tier 1 for Belle II at KEK ● Contributions to DUNE Computing 12 What did I do over there?

From September 2018 to October 2019 on a sabbatical in Brookhaven serving as

● Deputy Director of the Scientific Data and Computing Center ● US-Belle II Computing Coordinator and Belle II Conditions DB Convener

And enjoyed life in busy New York and calm https://www.bnl.gov/newsroom/news.php?a=214486 Long Island :-) about Belle II activities at BNL

13 Belle II

Belle II is a b-physics experiment located at KEK in Japan

● Data taking started in 2018 ● Still in ramp-up phase ● Computing Model drastically more distributed than LHC ○ Central services like Data Management and Conditions DB hosted off-site at BNL ○ Collaborative tools and user management served by DESY

If there is interest I can talk a bit more about Belle II in a later presentation 14 SDCC Vision for the Future

Based on the experience as host-lab and with providing services to off-site experiments, SDCC aims to become a Remote Host Lab/Superfacility for future Physics Experiments, providing

● CPU and Storage Resources ● (Quasi-) online streaming ● (Interactive) Analysis Facilities ● Collaborative Tools

Possible future projects are LSST, DUNE or other smaller scale projects

15 Collaborative Tools

Last year the SDCC increased the emphasis on collaborative tools, providing and developing for

Invenio, Indico, BNLBox, Gitea, Mattermost, remote authentication, JupyterLab

Rucio as data management tool for NSLS-II

Community not yet a notion of Feature article about Collaborative Tools efforts: https://www.bnl.gov/newsroom/news.php?a=216638 big-data and big-data management

16 JupyterHub I/II

Providing JupyterHub instances for several of supported experiments - Both on HTC/HTCondor and HPC/Slurm resources

HTCondor batch-spawned instance online for NSLS-II

● Most light-source analysis uses Python-based library developed at BNL ● Link-based notebook sharing/copying between users functional

Question of kernel management still under discussion at BNL

● Maintenance by SDCC or experiments/groups? ● Base environment provided, plus custom additions? ○ Could play CVMFS a role here?

17 JupyterHub II/II

Example for nanomaterial analysis 18 CVMFS at BNL

Based on the various user communities, BNL provides

● A Stratum One with 26 TB of data in 81 replicated repositories ● A Stratum Zero for local experiments and groups ○ 12 local repositories for local experiments and groups ○ Some of the local repository content has now been re-published to OSG via sphenix.opensciencegrid.org ● Service running extremely well, and as expected the Stratum Zero sees almost no direct load from remote ● Other (soon to be) data intense communities at BNL will hopefully take advantage of CVMFS too ○ Part of an effort to exchange knowledge between science domains

19 Interaction between CERN and BNL - Invenio

Invenio V3 based custom applications for two different scientific communities

National Nuclear Security Administration (NNSD)

● SET, Smuggling Detection and Deterrence Science and Engineering Team (btw - one of the reasons why Cyber Security is so big at BNL)

Materials Science community

● GENESIS (Next-Generation Synthesis Center)

Expect sPHENIX to utilize Invenio as well

Invenio V3 Research Data Management (RDM)

SDCC is working to build a research data management platform called InvenioRDM along with CERN and ten other multidisciplinary and commercial institutions 20 Single Sing-On and Federated Access

● The most basic functionality to serve as remote host lab ● The most difficult one to set up ○ Rules, regulations, trust relationships, technical implementations, cyber security ○ Every app behaves differently ○ Kerberos/Shibboleth increasingly difficult ■ Replaced consistently with IPA ● Deployed Keycloak as an SSO/Federated access solution ○ FreeOTP MFA AuthN for interactive apps and services ○ Allows use of ActiveDirectory accounts, IPA accounts, and federated ID via CILogon ● Pioneering interaction with InCommon / CILogon for science community in the US ● A topic of its own being relevant for CERN soon ○ Removal of Microsoft products and active directory will require some rethinking

More details in a recent HEPiX presentation by Mizuki Karawasa

21 Other interesting CS activities on Long Island

BNL

● Computational Science Initiative invests heavily in GPU research ○ Dedicated GPU training and hackathon once a year ○ Constant consultation for interested users ● Physics Department has a new NPP Software Group focussing on improving exchange between BNL’s physics user groups ○ Sort of BNL’s SFT group ○ Lead by Torre Wenaus

Stony Brook University (SBU)

● Institute for Advanced Computational Science covers various fields from Ecology, Materials Science, Physics, Linguistics, … and their data science problems ○ Quite a thrilling atmosphere with plenty of interdiscipliary discussions ○ Since this year I am adjunct professor at the Institute 22 Lessons Learned - a random remark

In a computing facility you have a very different view on scientific applications

● You do not care about what it produces, but whether it behaves correctly ● An application is just a state-machine, which you can start/stop/kill/pause/… - and inspect for its status ○ Rarely an application behaves in a sane way ⇒ that’s why containers come so handy ;-) ○ Checkpointing anyone? ● If an application uses 100 % CPU, you want to know, whether ○ It hangs and does crap ○ It is extremely efficient in using resources ● If it crashes, you need to have proper logs to know whether ○ The application crashed (application problem) ○ The computing node or storage system had a problem (site problem) ○ Or some weird combination of the two resulted in a problem ● Let’s try to help our computing colleagues in having an easier life…

23 Comments and Questions?

My street at my last day - Halloween :-) 24