Collaborative Workflow-Driven Science in a Rapidly Evolving Cyberinfrastructure Ecosystem

2018 Symposium of the Center for Network and Storage Enabled Collaborative – October 16, 2018

İlkay ALTINTAŞ, Ph.D.

UC San Diego

Chief Officer & Division Director of Cyberinfrastructure Research, Education and Development, San Diego Center

Fellow and Chair of Cyberinfrastructure, Halıcıoğlu Data Science Institute

Director, Workflows for Data Science Center of Excellence

İlkay ALTINTAŞ, Ph.D. [email protected] A little about me…

İlkay ALTINTAŞ, Ph.D. [email protected] SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and Education • Established as a national supercomputer resource center in 1985 by NSF • A world leader in HPC, data-intensive computing, and scientific data management • Current strategic focus on “”, “versatile computing”, and “life sciences applications”

Recent Innovative Architectures • Gordon: First Flash-based Supercomputer for Data-intensive Apps • Comet: Serving the Long Tail of Science

İlkay ALTINTAŞ, Ph.D. [email protected] Halıcıoğlu Data Science Institute https://datascience.ucsd.edu

An academic unit that provides a home for curating the growth in Data Sciences as a discipline • Deep expertise organized into clusters • Manage new engagements to seed, cultivate and grow the practice of Data Sciences

İlkay ALTINTAŞ, Ph.D. [email protected] Data Science Hub at SDSC -- Expertise, Systems and Training for DSH Data Science Research and Applications -- http://datascience.sdsc.edu • Serving as a community hub for collective and lasting innovation • Leading innovative solutions and

applications

Industry

Training Analytics

• Creating top of the line computing and Applications

data platforms Data Platforms Big • Educating and establishing a modern data science workforce SDSC Expertise and Strengths

• Connecting research initiatives with SDSC Data Science Hub (DSH) since 2015 industry and entrepreneurial ventures A collaborative organization and a community hub for collective lasting innovation in data science research, development and education.

İlkay ALTINTAŞ, Ph.D. [email protected] Common Theme…

“Big” Data, Computational Science, Data Science, Cyberinfrastructure,

and Their Applications

İlkay ALTINTAŞ, Ph.D. [email protected] In most applications, utilization of Big Data often needs to be combined COMPUTING AT with “BIG” DATA Scalable DIVERSE SCALES Computing. Enables dynamic data-driven applications

Computer-Aided Drug Discovery Smart Cities Disaster Resilience and Response

Smart Manufacturing Personalized Precision Medicine Smart Grid and Energy Management

İlkay ALTINTAŞ, Ph.D. [email protected] How do we amplify the value of Big Data?

İlkay ALTINTAŞ, Ph.D. [email protected] How do we find the connections and answer questions that benefit science and society?

“We are drowning in information and starving for knowledge”

– John Naisbitt Source: Megatrends, 1982

İlkay ALTINTAŞ, Ph.D. [email protected] Going from Data to Discovery to Impact

Benefit Y for Amplifying the Science, Value of Data Business, Society or Related to X Education

We need to focus on the problems to solve.

İlkay ALTINTAŞ, Ph.D. [email protected] When MD met sea spray aerosols…

Large scale PCA and clustering GPU enabled molecular dynamics Markov State Model Optimization

Informed Feedback to Experimental Continuous Data ~1M++ Design Access, Integration core and Transformation hours

Collaborators: Rommie Amaro, Kim Prather, Amarnath Gupta, UC San Diego Built-in Scientific Communication and İlkay ALTINTAŞ, Ph.D. Reproducibility [email protected] Assurance Problem solving happens at the AMBER GPU MD Workflow application Minimization Actor integration level... A Kepler Workflow Tool for Reproducible AMBER GPU Molecular Dynamics, Purawat, Ieong, Malmstrom, Chan, Yeung, Walker, Altintas, Amaro.. DOI: 10.1016/j.bpj.2017.04.055

İlkay ALTINTAŞ, Ph.D. [email protected] Application integration requires the expertise of a collaborative team.

• Multi-disciplinary scientific and technological expertise • Integration of many scales of computing • Integration of big and small experimental datasets • Can be historical or real time • Usage of individual or community developed legacy tools • Methods to manage and interpret data • Modeling and simulation • Gateways for visualization and dashboard • Long term active and passive storage needs

İlkay ALTINTAŞ, Ph.D. [email protected] The problems we are solving are … • data-driven • heterogeneous • collaborative • data systems • on demand distributed computing • high speed connectivity

What if the whole world was our computer?

İlkay ALTINTAŞ, Ph.D. [email protected] Create an Ecosystem that Enables Needs and Best Practices

• data-driven • accountable • scalable • reproducible • dynamic • interactive • process-driven • heterogeneous • collaborative • includes many different expertise

İlkay ALTINTAŞ, Ph.D. [email protected] A Typical Collaborative Data Science Ecosystem

İlkay ALTINTAŞ, Ph.D. [email protected] Data-driven problem solving requires:

• Heterogenous systems • Data management • Data-driven methods • Scalable tools for dynamic coordination and resource optimization • Skilled interdisciplinary workforce • Collaborative culture and tools that enable groups to communicate

İlkay ALTINTAŞ, Ph.D. [email protected] Data Engineering Computational Data Science

ACQUIRE PREPARE ANALYZE REPORT ACT STORE PUBLISH … Scale Scale Scale Scale

Continuous Iteration, Integration and Programmability

İlkay ALTINTAŞ, Ph.D. [email protected] Systems integration requirements…

Dynamic composability matters.

Systems are only useful if groups can integrate them into applications.

Tools that enhance teamwork need to be coupled with AI systems.

İlkay ALTINTAŞ, Ph.D. [email protected] ENABLING INTERFACES COLLABORATIONMANAGEMENT e.g., Gateways and other online tools for research and teaching

WORKFLOW MANAGEMENT e.g., Application Integration, Coordination, Optimization, Communication, Reporting COMPOSABLE SERVICES e.g., Model and Data Archives, Deep Learning, Analytics, HPC, Training, Notebooks

REPRODUCIBILITY RESOURCE MANAGEMENT

e.g., Kubernetes Container Cloud SECURITY PRIVACY and SECURITY

COMPOSABLE SYSTEMS DATA LIFECYCLE MANAGEMENT LIFECYCLE DATA e.g., GPU, CPU, Big Data, Neuromorphic, Networks, Storage, …

İlkay ALTINTAŞ, Ph.D. [email protected] ENABLING INTERFACES COLLABORATIONMANAGEMENT e.g., Gateways and other online tools for research and teaching

WORKFLOW MANAGEMENT e.g., Application Integration, Coordination, Optimization, Communication, Reporting COMPOSABLE SERVICES e.g., Model and Data Archives, Deep Learning, Analytics, HPC,

Training, Notebooks SECURITY

REPRODUCIBILITY RESOURCE MANAGEMENT e.g., Kubernetes Container Cloud

COMPOSABLE SYSTEMS DATA LIFECYCLE MANAGEMENT LIFECYCLE DATA e.g., GPU, CPU, Big Data, Neuromorphic, Networks, Storage, …

İlkay ALTINTAŞ, Ph.D. [email protected] Smart Composability from Systems to Services • Composable Systems • dynamically measured and resourced • used as a resource based on need and availability COLLABORATION MANAGEMENT • Resource Management • Kubernetes integration WORKFLOW MANAGEMENT • mapping tools for resource identification per task e.g., Application Integration, Coordination, • Composable Services Optimization, Communication, Reporting • runs on composable systems (e.g., containers) • exposes a parametric interface for integration COMPOSABLE SERVICES • continuously measured and profiled e.g., Model and Data Archives, Deep Learning, Analytics, HPC, Training, Notebooks • Workflow Management • focused on coordination and resource optimization RESOURCE MANAGEMENT • requires a number of AI-based tools and data processing e.g., Kubernetes Container Cloud to function • Collaboration Management COMPOSABLE SYSTEMS • focused on expertise integration and goal setting e.g., GPU, CPU, Big Data, Neuromorphic, • applies methodologies for effective communication Networks, Storage, … • provides tools measure and validate team activity

İlkay ALTINTAŞ, Ph.D. [email protected] The Rest of It

ENABLING INTERFACES e.g., Gateways and other online tools for research and teaching

• Enabling Interfaces • front end to communicate and explore with data products • examples are SUAVE, maps, specialized dashboards, etc. • Security and Privacy

• compliance through Sherlock REPRODUCIBILITY

SECURITY and and PRIVACY SECURITY • Blockchain research and integration

• Reproducibility DATA LIFECYCLE MANAGEMENT DATA • containerization and repeatability of work • Data Lifecycle Management • curation and long term storage of research datasets (FAIR data principles) • active data management

İlkay ALTINTAŞ, Ph.D. [email protected] Dynamic composition requires dynamic network, compute, storage and resource combined with intelligent software for steering applications.

İlkay ALTINTAŞ, Ph.D. [email protected] Collaborative Workflow Process for Management using Composable Practice of Data Systems and Services Science

• PPoDS: Methodology and tools for collaboration • Measurable collaboration management • Web based metric setting and testing for composable units • Ability to design parametric workflows that run on multiple PODs

• SmartFlows: Performance measurement and application integration tools for composable units • Web-based container integration interface and deployment of scientific workflows on customizable infrastructure • Notebook Workflow Orchestration • PODBank for Kubernetes based container management of large scale clusters • Select combination of memory, CPUs, GPUs, Disk Space and deploy remotely • AI-based investigation of performance effectivity and prediction on a variety of resources • For applications on data, compute and network infrastructure • Instrumentation of workflows and infrastructure components for performance data collection, correlation, and aggregation • Workflow performance monitoring, prediction and optimization tools

• Tested on existing projects and local infrastructure as experimental tools

İlkay ALTINTAŞ, Ph.D. 25 [email protected] Unit Testing for Composable Blocks using the PPODS Approach Treat Each Step in the Solution Process as a Conceptual Pod • Each step in a data pipelines is a separate pod • Define success metrics for calling Pod  sub-process each pod done • Pods can be atomic or hierarchical Defined by: • Purpose and goal • Stakeholders Metrics for accountability should be built into • Expectations the process. • Key questions to be answered, production/consumption relationships, needs, Cost Timeline dependencies, limits, … • Contracts • Performance, economic, accuracy, policy, privacy, Planning of deliverables reproducibility, political, … • Knowns Purpose Expectations • Known unknowns

İlkay ALTINTAŞ, Ph.D. [email protected] Linking these to networked science…

İlkay ALTINTAŞ, Ph.D. [email protected] Utilizing “Advanced Cyberinfrastructure”

Compute + Storage + Network

İlkay ALTINTAŞ, Ph.D. [email protected] The Pacific Research Platform Creates a Regional End-to-End Science-Driven “Big Data Superhighway” System NSF CC*DNI Grant $5M 10/2015-10/2020

PI: Larry Smarr, UC San Diego Calit2 Co-PIs: • Camille Crittenden, UC Berkeley CITRIS, • Tom DeFanti, UC San Diego Calit2, • Philip Papadopoulos, UCSD SDSC, • Frank Wuerthwein, UCSD Physics and SDSC Letters of Commitment from: • 50 Researchers from 15 Campuses • 32 IT/Network Organization Leaders

Slide Source: Larry Smarr, UCSD Disk-to-Disk 10-100 Gbps

İlkay ALTINTAŞ, Ph.D. [email protected] Running Kubernetes/Rook/Ceph On PRP Allows Deployment of Distributed PB+ of Storage for Posting Science Data.

UCLA USC Caltech UCR 40G 160TB 40G 160TB 100G NVMe 6.4T 40G 160TB UCSB UCI 40G 160TB Five Racked FIONAs at Calit2 FIONA8 Kubernetes • Each Contains: FIONA8 • Dual 12-Core CPUs FIONA8 Centos7 UCSC Can be integrated with any • 96GB RAM SDSC 40G 160TB • 1TB SSD 100G NVMe 6.4T • 2 10GbE interfaces 100G Gold FIONA8 cloud, data lake or other • Total ~$10,500 Rook/Ceph - Block/Object/FS FIONA8 100G Gold NVMe FIONA8 • With 8 GPUs storageSwift for API compatiblefurther scalability.with • total ~$18,500 100G Epyc NVMe SDSC, AWS, and Rackspace Stanford Calit2 40G 160TB Phil Papadopoulos, SDSC & sdx-controller controller-0 Tom DeFanti, Joe Keefe & FIONA8 UCAR John Graham, Calit2 FIONA8 SDSU Hawaii FIONA8FIONA8 40G 160TB FIONA8 100G NVMe 6.4T 40G 160TB FIONA8 40G 160TB March 2018 John Graham, UCSD Source: Larry Smarr, Calit2

İlkay ALTINTAŞ, Ph.D. [email protected] NSF CHASE-CI Grant Creates a Community Cyberinfrastructure Adding a Layer Built on Top of the Pacific Research Platform MSU

UCB UCM Stanford UCSC

Caltech UCI UCR UCSD SDSU Slide Source: Larry Smarr, UCSD

NSF Grant for High Speed “Cloud” of 256 GPUs For 30 ML Faculty & Their Students at 10 Campuses for Training AI Algorithms on Big Data

İlkay ALTINTAŞ, Ph.D. [email protected] Next Step: Surrounding the CHASE-CI Machine Learning Platform With Clouds of GPUs and Non Von Neumann Processors

64-TrueNorth Cluster CHASE-CI

Microsoft Installs FPGAs into Bing Servers & 432 into TAAC for Academic Access

Slide Source: Larry Smarr, UCSD

İlkay ALTINTAŞ, Ph.D. [email protected] Streaming Image Processing using PRP and CHASE-CI

UC San Diego Jaffe Lab (SIO) Scripps Plankton Camera Off the SIO Pier with Fiber Optic Network Over 1 Billion Images So Far! Requires Machine Learning for Automated Image Analysis and Classification ”We are using the FIONAs for image processing... this includes doing Particle Tracking Velocimetry that is very computationally intense.”-Jules Jaffe

Source: Larry Smarr, Calit2, Jules Jaffe, SIO

İlkay ALTINTAŞ, Ph.D. [email protected] Product

Open Water

Developed, Open Space

Developed, Low Intensity

Developed, Medium Intensity Vegetation Classification Developed High Intensity Barren Land (Rock/Sand/Clay)

Mixed Forest using Satellite Imagery Shrub/Scrub Grassland/Herbaceous

Sedge/Herbaceous using CHASE-CI Cultivated Crops Pipeline Woody Wetlands

Data Processing

Cluster Architecture

İlkay ALTINTAŞ, Ph.D. [email protected] Cardiac MRI Analysis on PRP Problem Statement Overall Goal: Develop algorithms to automate and optimize the steps within the analyIsisn of fMRrI Iamagsest tor ucture Architecture include: Image Processing, Region of Interest Extraction, Left Ventricle Segmentation, & Volume Calculation

Deep Learning End-Systolic Volume End-Diastolic Volume Model Cardiac Image Analysis

Impact: ● Reduces diagnosis time Cardiac Image Analysis ● Improves consistency in assessing cardiac structural parameters and reduces the variation in the cardiac image analysis ● Produce a repeatable process to classify, segment, or detect the LV End to End Process Methodology 9 June 2018

6/9/2018 DataB ySpcasies n&c cero p&p inEgn tog ineering, DSE 260 CZaoopmsintgo, nRoeta Ptinrgo ajnedc t 3 176x176 Shifting S1, S2, Capstone Group 4: Acknowledgements: S3 Data Augmentation Ehab Abdelmaguid Advisors: Marcus Bobar Image Preprocessing ROI Extraction Jolene Huang Eric Carruth 1 2 3 4 Mai Nguyen 9 JSuannjaey K2e0nc1ha8reddy Dr. Evan Muse

Ilkay Altintas

1

2

7

5 2

6 Disha Singla Dylan Uys 6

5 ROI Images

6

2 5

6 256 2 176 Laura Wilke Gary Cottrell 256 5 6 Processed / Normalized Images Ca1p0 sxt Imoange Group 4: Acknowledgements: DICOM & & Labels Files 1 Nifti files Ehab Abdel6m/9ag/2u0id18 AdvisorsD: ata Science & EngineeMrianrgc,u Ds SBEo b2a6r0 Capstone Project Segmentation Jolene Huang Eric Carruth Segmentation post Area/Vol computation Mai Nguyen 4 5 processing 6 Sanjay Kencharedd7y Dr. Evan Muse Ilkay Altintas Disha Singla Dylan Uys Results 6/9/2018 Data Science & Engineering, DSE 260 Capstone Project 16 Processed Laura Wilke Gary Cottrell Segmented Segmented Image Files Image Files 6/9/2018 Data Science & Engineering, DSE 260 Capstone Project 1 6/9/2018 Data Science & Engineering, DSE 260 Capstone Project 6 İlkay ALTINTAŞ, Ph.D. [email protected] Summary

• New applications need intelligent dynamic behavior • Integrated research in sciences is hard to achieve with existing CI as it requires new middleware and software • Multi-disciplinary engineering, generalizable techniques and systems approaches would scale to different applications • Collaboration needs to be managed and augmented as a measurable system component • Networked science can be achieved by making networks a composable service along with computing and storage

İlkay ALTINTAŞ, Ph.D. [email protected] Some Other Slightly Distracting Final Points

• The science of computing itself needs to be data-driven! • I’m not talking about the science it is enabling • Team integration and collaborative design is needed from the beginning • We need tools to match it! • Industry is already doing everything I have been talking about at different levels • The problem is the ability to combine and create tools for the HPC problems that only exist in science problems

İlkay ALTINTAŞ, Ph.D. [email protected] İlkay ALTINTAŞ, Ph.D. [email protected] Contact: Ilkay Altintas, Ph.D. Email: [email protected]

The presented work is collaborative work with many wonderful individuals, and parts of it are funded by NSF, DOE, NIH, UC San Diego and various industry partners.

İlkay ALTINTAŞ, Ph.D. [email protected]