Gerard Lemson Alex Szalay, Mike Rippin DIBBS/Sciserver Collaborative Data-Driven Science
Total Page:16
File Type:pdf, Size:1020Kb
Collaborative data-driven science Gerard Lemson Alex Szalay, Mike Rippin DIBBS/SciServer Collaborative data-driven science } Started with the SDSS SkyServer } Built in a few months in 2001 } Goal: instant access to rich content } Idea: bring the analysis to the data } Interac@ve access at the core } Much of the scien@fic process is about data ◦ Data collec@on, data cleaning, data archiving, data organizaon, data publishing, mirroring, data distribu@on, data analy@cs, data curaon… 2 Collaborative data-driven science Form Based Queries 3 Collaborative data-driven science Image Access Collaborative data-driven science Custom SQL Collaborative data-driven science Batch Queries, MyDB Collaborative data-driven science Cosmological Simulations Collaborative data-driven science Turbulence Database Collaborative data-driven science Web Service Access through Python Collaborative data-driven science } Interac@ve science on petascale data } Sustain and enhance our astronomy effort } Create scalable open numerical laboratories } Scale system to many petabytes } Deep integraon with the “Long Tail” } Large footprint across many disciplines ◦ Also: Genomics, Oceanography, Materials Science } Use commonly shared building blocks } Major naonal and internaonal impact 10 Collaborative data-driven science } Offer more compung resources server side } Augment and combine SQL queries with easy- to-use scrip@ng tools } Heavy use of virtual machines } Interac@ve portal via iPython/Matlab/R } Batch jobs } Enhanced visualizaon tools 11 Collaborative data-driven science } CasJobs ◦ SQL, MyDB, batch ◦ FileDB: Raw data access from within RDB } SciDrive ◦ Dropbox-like, on-drop event handling } SciServer/compute ◦ Interac@ve/batch python, R, Matlab in Docker container } MyScratch (File & DB) } SSO on all components } All published through REST 12 Collaborative data-driven science MyScratch Files Login Portal SkyServer MyScratch DB REST API SciDrive OpenStack REST API SciScript Turbulence Keystone & Swift REST API REST API Cosmology WEB UI CasJobs UI Client REST CasJobs Job GLUSEEN BatchAdmin Scheduler Service API WS Client SkyQuery USNOB IRAS DR7 DR8 GLUSEEN DR10 FIRST ROSAT DR5 DR6 Parallel X-Match Engine Cosmology 2MASS Galex DR3 DR4 SkyQuery Scheduler DR9 SDSS WISE DR1 DR2 Turbulence SkyNode REST Registry SDSS DB Misc. DB Servers API MyDB Server Servers Servers Linked Server Connections 13 Collaborative data-driven science } Jupyter Notebooks in Docker ◦ h`p://www.nature.com/news/interac@ve-notebooks-sharing-the- code-1.16261 ◦ h`ps://developer.rackspace.com/blog/how-did-we-serve-more- than-20000-ipython-notebooks-for-nature/ } Python, R, Matlab } Flexible way to aach data sets in volume containers } Extended to batch jobs 14 Collaborative data-driven science 15 Collaborative data-driven science Astronomy Collaborative data-driven science Collaborative data-driven science Collaborative data-driven science Materials Science Collaborative data-driven science Materials Science 20 Collaborative data-driven science Turbulence Collaborative data-driven science Genomics Collaborative data-driven science 23 Collaborative data-driven science I’ll be very happy to demo and discuss our services 24 .