Empowering Data-driven Discovery with a Lightweight Provenance Service for High Performance Computing

Yong Chen Associate Professor, Computer Science Department Director, Data-Intensive Scalable Computing Laboratory Site Director, Cloud and Autonomic Computing Center Texas Tech University

NITRD’s MAGIC Webinar, April 3, 2019 Data Challenges

Scientific discovery becomes highly data intensive (“big data”) Both experimental data and observational data More real-world examples

Factor of 1000x increase in less than a decade!

Present day real world: Phones: 100+ Gigabytes Science and Business: 100s to 10,000s of Petabytes a ~~g 2 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 Reasons behind Data Revolution

Rapid growth in computing capability has made data acquisition and generation much easier § Esp. when compared with a much slower increase in I/O system bandwidth High-resolution, multi-model scientific discovery requires and produces much more data The needs that insights can be mined out of large amounts of low- entropy data have substantially increased over years § Data-driven science v.s. model-driven (computational) science Scientific breakthroughs are increasingly powered by advanced computing (HPC) plus data understanding capabilities a ~~g 3 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 Our Vision

To create a holistic collection, management, and analysis software infrastructure of provenance data § Lightweight Provenance Service for high performance computing

Objectives § Run as an always-on service to collect and manage provenance for batch jobs transparently § Capture comprehensive provenance with accurate causality to support a wide range of use cases, and § Provide easy-to-use analysis tools for scientists and system administers to explore and utilize the provenance a ~~g 4 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 What is Provenance

In general, provenance is documented history of an object and particularly useful to provide evidence for the originality of an art work

Little Dancer Aged Fourteen 1. Degas, Edgar (created1878-1881) 2. René De Gas (heritage 1917) 3. Adrien-Aurélien Hébrard (a contract 5/3/1918) 4. Nelly Hébrard (heritage 1937) 5. M. Knoedler & Company, Inc. (cosigned 1955) 6. Paul Mellon (purchased 5/1956) 7. National Gallery of Art (bequest 1999).

- From National Gallery of Art website In computer science, provenance means the lineage of data, including processes that act on data and agents responsible for those processes. a ~~g 5 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 What is Provenance

name, id, group, permission, … machine name, ip_addr, dc, rack, …

Relationships process id, job, machine, reads, writes, start_ts, finish_ts, … job id, params, config, inputs, outputs, start_ts, finish_ts, …

IHPC

file name, location, permission, parent, children, …

Provenance is data lineage from all entities and the relationships among all elements that contribute to the existence of the data. 6 NITRD’s MAGIC Webinar, April 3, 2019 How to Represent Provenance: A Graph-based Model for HPC Provenance Data Based on Property Graph ModelVertex

knows: 3 name: Alice type: partners name: Bob age:31 age:27 Mapping HPC provenance knows: 4 onto a property graph model type:colleagues name: Chris a e:33

name:sam @ User Entity group:cgr oup Edge name: john group: admin @ Exec11tion Entity • ,,,,, @ @ File Entity Properties/Attributes @◄~------@'l'UII ~ • • ~ n-:job20140S .,,,,.1'!' ,NU. ,, , par-,... ., ...-n 102' @. . ' ,'writ,,, a~' , J/ ta'. 2 0l4OS01 " write.Si•• • 7H re' n~:app-01 n~:dset-1 • •• ' • • • ~ ■....111:e:.. , 2S6KB.. ... 11111:e .1020M .. ... , ...... 7 e ~~gComputing Center NITRD’s MAGIC Webinar, April 3, 2019 How to Represent Provenance: A Graph-based Model for HPC Provenance Data

Entity => Vertex § Data Object: represents the basic data unit in storage § Executions: represents applications including Jobs, Processes, Threads § User: represents real end user of a system § Allow to define your own entities

Relationship => Edge Data User Execution § The relationship from Row to Column Object U ·er run § Reversed relationships also are defined exe, belong, Execution wa RunB read, § belongs/contains is general contain write § Allow to define your own relationships exedB , Data belong , wasReadB , Object contain. Attributes => Property was Written By § Work on both Entity and Relationship § Stored as Key-Value pairs attached on vertices and edges § Allow to define your own properties a ~~g 8 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 Requirements on Managing Provenance in HPC

• Performance Requirements:

• HPC users are performance sensitive • Managing overhead should be less than 1% slowdown and less than 1MB memory footprint per core • Coverage Requirements:

• Provenance generated from multiple physical locations • Provenance could have various granularities • Transparency Requirements:

• Users should not change or recompile their codes for provenance • More aggressively, users should not disable it when provenance is used in critical missions a ~~g 9 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 How to Collect and Manage Provenance

LPS Overall Architecture in HPC

r------1 Compute Nodes A Single Server User Space I A Single Server LPS Aggregator LPS Aggregator user processes Local LPS

System Call Layer /Proc

LPS Tracer

Login Nodes Login Provenance Kernel Space

Distributed LPS Cluster Distributed LPS Cluster Data Provenance LPS Builder LPS Builder LPS Builder LPS Builder ...,... Parallel File System Server Server Server Server

a ~~g 10 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 How to Collect and Manage Provenance

LPS Tracer LPS leverages kernel instrument to collect detailed runtime events to build provenance [among three methods]

- Method Transparency No Privilege Dynamic Provenance AP/s X ✓ X Library Wrapper X ✓ X Kernel Instrumentation ✓ X ✓ To support flexible granularity, it needs to enable/disable probing read/write events § Dynamic Probe § Two kernel instrument scripts (Systemtap) § The second one only probes read/write events § Can be disabled/enabled accordingly in runtime a ~~g 11 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 How to Collect and Manage Provenance

LPS Aggregator 1. Monitoring overhead and direct granularity change 2. Pruning noisy events to improve performance

• Instrumentation introduces

overheads 500 • Instrument Read/Write towards an application issuing 1M 1-byte writes 400 + .svi' QJ 300 • The aggregator monitors read/write E i=

frequency 200 • a counter records the events 100 • a timer that resets the counter No Trace Empty Read/Write Probe LAPS Read/Write Probe • notify and change granularity a ~~g 12 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 How to Collect and Manage Provenance

LPS Aggregator 1. Monitoring overhead and direct granularity change 2. Pruning noisy events to improve performance

libselinux.so.1 UNDEFINED UNDEFINED icon-theme.cacheUNDEFINED /usr/share/emacs/24.5/lisp/emacs-lisp/cl-extra.elc 1952+0UNDEFINED /usr/share/pixmaps bash UNDEFINED /usr/share/emacs/site-lisp/site-start.d/systemtap-init.el UNDEFINED /usr/bin/sed UNDEFINED libelf-0.161.so bash UNDEFINED libcap.so.2.24 UNDEFINED/usr/share/icons/Adwaita/cursors/watch libtinfo.so.5.9 /usr/share/icons/Adwaita/cursors/sb_v_double_arrow UNDEFINED bash /usr/share/emacs/24.5/lisp/progmodes/cc-engine.elc libpcre.so.1.2.5 /usr/share/emacs/24.5/lisp/emacs-lisp/cl-seq.elc UNDEFINED UNDEFINEDUNDEFINED /usr/share/emacs/site-lisp/site-start.d/desktop-entry-mode-init.el /usr/share/icons/Adwaita/24x24/actions/edit-find.png /usr/share/emacs/24.5/etc/charsets/MULE-is13194.map UNDEFINED UNDEFINED/usr/share/emacs/24.5/etc/images/checked.xpm/home/daidong/.cache/mozilla//0vphsm5b.default/cache2/index.log /home/daidong/.mozilla/firefox/Crash Reports/InstallTime20150518070114bash dirname /usr/share/emacs/24.5/lisp/emacs-lisp/cl-loaddefs.el /usr/share/emacs/24.5/etc/images/search.xpmbasename DejaVuSans.ttf libpciaccess.so.0.11.1 UNDEFINED /usr/share/icons/Adwaita/24x24/actions/document-new.png /usr/share/emacs/24.5/etc/images/close.xpm UNDEFINED /home/daidong/DocumentsUNDEFINED UNDEFINED /usr/share/icons/Adwaita/cursors/left_ptr /usr/share/icons/Adwaita/24x24/actions/document-save.pngUNDEFINED/usr/share/emacs/24.5/lisp/emacs-lisp/gv.elc /usr/share/emacs/24.5/etc/images/icons/hicolor/scalable/apps/emacs.svgUNDEFINED /usr/share/emacs/24.5/lisp/progmodes/cc-mode.elc /usr/share/emacs/24.5/lisp/emacs-lisp/derived.elcUNDEFINED UNDEFINED gtk30.mo /usr/share/emacs/24.5/lisp/progmodes/cc-fonts.elc /usr/share/emacs/24.5/lisp/emacs-lisp/cconv.elc/usr/share/emacs/24.5/etc/charsets/symbol.map locale-archive bash /usr/share/icons/Adwaita/24x24/apps/system-file-manager.png /dev/nullUNDEFINED mime.cache /usr/share/icons/Adwaita/24x24/actions/edit-undo.png UNDEFINED UNDEFINED UNDEFINEDUTF-16.so UNDEFINED UNDEFINED /usr/share/emacs/site-lisp/systemtap-mode.el /usr/share/emacs/24.5/etc/images/open.xpm /home/daidong UNDEFINED /usr/share/emacs/24.5/etc/images/cut.xpm /run/user/1000/gdm/Xauthority /home/daidong/.emacs.d/auto-save-list UNDEFINED /usr/share/emacs/24.5/etc/images/save.xpm firefox UNDEFINED libdrm_intel.so.1.0.0 /usr/share/emacs/24.5/lisp/emacs-lisp/easymenu.elc Raw system events from UNDEFINED /usr/share/emacs/24.5/etc/images/diropen.xpmUNDEFINED/usr/share/icons/Adwaita/index.theme libLLVM-3.5.so /usr/share/icons/Adwaita/24x24/actions/document-open.png/usr/share/zoneinfo/America/Chicago/proc/15458/task/15459/stat/usr/share/icons/default/index.theme UNDEFINED /sys/devices/system/cpu/online UNDEFINED /usr/share/emacs/24.5/etc/tutorials/TUTORIAL UNDEFINED /usr/share/emacs/24.5/lisp/emacs-lisp/cl.elc /usr/share/emacs/24.5/etc/images/undo.xpm /usr/share/emacs/24.5/lisp/calendar/time-date.elc/usr/share/emacs/site-lisp/site-start.el /home/daidong/.emacsUNDEFINED swrast_dri.soDejaVuSansMono.ttf bash /home/daidong/Documents/fs.stp libpixbufloader-svg.so/usr/share/icons/Adwaita/24x24/actions/edit-copy.png /usr/share/emacs/24.5/etc/images/unchecked.xpm kernel instrumentation libedit.so.0.0.53 /usr/lib64/pango/1.8.0/modules.cache /usr/share/emacs/24.5/lisp/progmodes/cc-menus.elc 15453+1441600856998997 firefox /usr/share/icons/Adwaita/cursors/sb_h_double_arrowUNDEFINED/proc/15453/mountinfo UNDEFINED /usr/share/emacs/24.5/lisp/progmodes/cc-defs.elc UNDEFINEDUNDEFINEDlibdrm_nouveau.so.2.0.0 bash /usr/share/iconsUNDEFINED libacl.so.1.1.0 UNDEFINED /usr/share/icons/Adwaita/cursors/xterm firefox bash libdl-2.21.so DejaVuSansMono-Bold.ttf /usr/share/emacs/24.5/etc/images/paste.xpm 15589+1441600998209413 /etc/ld UNDEFINED /usr/share/icons//scalable/places /usr/share/emacs/site-lisp/site-start.d/autoconf-init.el uname libc-2.21.so DejaVuSansMono-Oblique.ttf libpthread-2.21.so /usr/share/icons/Adwaita/cursors/hand2 /etc/drirc UNDEFINED /usr/share/emacs/24.5/lisp/progmodes/cc-vars.elc libdrm_radeon.so.1.0.1 /usr/share/locale/locale.alias UNDEFINED UNDEFINED UNDEFINED /usr/share/emacs/24.5/lisp/progmodes/cc-align.elc/proc/15458/stat run UNDEFINED UNDEFINED

UNDEFINED /usr/share/emacs/24.5/lisp/progmodes/cc-cmds.elc /usr/share/emacs/24.5/lisp/emacs-lisp/byte-opt.elc /home/daidong/.cache/mozilla/firefox/0vphsm5b.default/cache2/entries/85222C3CBB346ADAAB5A6E9B7FB98BAB19631C1914879+0 mkdir /usr/share/emacs/site-lisp/site-start.d /usr/share/emacs/24.5/lisp/emacs-lisp/cl-macs.elcUNDEFINED /usr/share/icons/Adwaita/24x24/actions/window-close.png UNDEFINED -Regular.otf UNDEFINEDUNDEFINED UNDEFINED UNDEFINED run /usr/share/icons/hicolor/index.themeUNDEFINEDUNDEFINED firefox /usr/share/emacs/24.5/lisp/progmodes/cc-guess.elc/usr/lib64/ld-2.21.so /usr/share/emacs/24.5/lisp/progmodes/cc-styles.elc gconv-modules.cache /usr/share/emacs/site-lisp/default.el UNDEFINED run UNDEFINED /home/daidong/.cache/mozilla/firefox/0vphsm5b.default/cache2/indexUNDEFINED /usr/share/emacs/24.5/lisp/progmodes/cc-awk.elclibattr.so.1.1.0 15453+1441600853121328/usr/share/icons/Adwaita/24x24/actions/edit-cut.png firefox /usr/share/emacs/24.5/lisp/emacs-lisp/bytecomp.elc/usr/share/emacs/24.5/etc/images/copy.xpm /usr/share/emacs/24.5/lisp/emacs-lisp/cl-lib.elc /usr/share/emacs/24.5/lisp/progmodes/cc-langs.elcUNDEFINED /home/daidong/.ICEauthority /home/daidong/.cache/mozilla/firefox/0vphsm5b.default/cache2/entries/F18D85F52EBBBA2AB081EF739ED0D6E8A76D497C UNDEFINED ls icon-theme.cache DejaVuSans-Oblique.ttf UNDEFINED /usr/share/icons/Adwaita/24x24/actions/edit-paste.png/usr/share/emacs/24.5/etc/images/new.xpm /usr/share/emacs/24.5/etc/images/splash.svg /home/daidong/.emacs.d/auto-save-list/.saves-15582-localhost.localdomain~

13 a ~~g9/11/17 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 How to Collect and Manage Provenance

LPS Aggregator 1. Monitoring overhead and direct granularity change 2. Pruning noisy events to improve performance

• Representative Executions • Executions that users care the Bash (pid: 2322) most (pid: 6444) (pid: 6447) • Eliminate unimportant child mpiexec processes (pid: 6439) (pid: 6448) • Eliminate helper child hydra_pmi_proxy processes sed (pid: 6445) (pid: 6440) (pid: 6446) • Events of non-R executions are counted to their ancestor ./mpitest ./mpitest ./mpitest G (pid: 6449) R executions (pid: 6441) (pid: 6442) (pid: 6443) a ~~g 14 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 How to Collect and Manage Provenance

LPS Builder

• Local aggregators generate isolated provenance events • Workflows or jobs that are across multiple servers • A global identifier challenge • To match identities in different machines needs a unique ID • Unique IDs are generated by specific software, no transparency • A compromised solution • LPS relies on specific environmental variables to match identities • LPS should be notified about the name of these env variables • Build provenance with versioning

a ~~g 15 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 What Can We Do with Provenance

HPC provenance is useful in the simulate-analyze-publish science discovery cycle • •I Evaluate a new system

/scratch/joe/ior.conf § Repeatedly run the same benchmark Version: 1 (typically time consuming) ior § Calculate avg and std for comparing ior Questions Provenance Helps

§ If unexpected variations occur, how Shared Libs emacs to ensure they are from your ior system or from your evaluations? ior

§ Can others easily repeat the same /scratch/joe/ior.conf Version: 0 evaluations? ior § ... ior

16 e ~~gComputing Center NITRD’s MAGIC Webinar, April 3, 2019 What Can We Do with Provenance

Other use cases (not limited to) User/project/job audit Search -> Traversal -> Filter -> Traversal § Provenance graph contains • run relationships between Users and Executions • read/write relationships between Executions and Data Objects • additional attributes are also recorded with these relationships Organization of data space § Present a logical layout of data sets to users • In addition to traditional POSIX-style tree-structure directory Data sharing, publishing Reproducibility, workflow management a ~~g 17 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 What Can We Do with Provenance

The purpose of computing is insight, not numbers.

- Richard Hamming, 1962

a ~~g 18 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 Summary

§ Data-driven discovery has become the new driving force for sciences, widely cited as the 4th paradigm

§ We envision a holistic collection, management, and analysis software infrastructure of provenance data can be helpful for understanding data and mining insights

§ We are working on a Lightweight Provenance Service infrastructure for HPC systems based upon prior R&D

§ Call for more R&D efforts in this space and address challenges collectively from the community a ~~g 19 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 Acknowledgements

§ Prof. Dong Dai, UNCC § Dr. Robert Ross, ANL § Dr. Philip Carns, ANL § Prof. William L. Hase, TTU § Prof. Brian Ancell, TTU § Prof. Alan Sill, TTU § Ph.D. student: Mr. Misha Ahmadian

This project is supported in part by NSF Office of Advanced Cyberinfrastructure (OAC) through the Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Program (Program Director: Dr. Vipin Chaudhary).

National Scie F w H e R e o {;~ 1 5 ER IoundationES B IEG I N 20 e ~~gComputing Center NITRD’s MAGIC Webinar, April 3, 2019 DISCL@TTU Team

Acknowledgements to our colleagues and collaborators a ~~g 21 ~ Computing Center NITRD’s MAGIC Webinar, April 3, 2019 22 Thank You and Q&A

For more information please visit: http://discl.cs.ttu.eduThank You / https://nsfcac.org/ [email protected]

22 NITRD’s MAGIC Webinar, April 3, 2019

"Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Networking and Information Technology Research and Development Program."

The Networking and Information Technology Research and Development (NITRD) Program

Mailing Address: NCO/NITRD, 2415 Eisenhower Avenue, Alexandria, VA 22314

Physical Address: 490 L'Enfant Plaza SW, Suite 8001, Washington, DC 20024, USA Tel: 202-459-9674, Fax: 202-459-9673, Email: [email protected], Website: https://www.nitrd.gov