Collaborative data-driven science
Gerard Lemson Alex Szalay, Mike Rippin DIBBS/SciServer Collaborative data-driven science
} Started with the SDSS SkyServer } Built in a few months in 2001 } Goal: instant access to rich content } Idea: bring the analysis to the data } Interac ve access at the core } Much of the scien fic process is about data ◦ Data collec on, data cleaning, data archiving, data organiza on, data publishing, mirroring, data distribu on, data analy cs, data cura on…
2 Collaborative data-driven science
Form Based Queries
3 Collaborative data-driven science
Image Access Collaborative data-driven science
Custom SQL Collaborative data-driven science
Batch Queries, MyDB Collaborative data-driven science
Cosmological Simulations Collaborative data-driven science
Turbulence Database Collaborative data-driven science
Web Service Access through Python Collaborative data-driven science
} Interac ve science on petascale data } Sustain and enhance our astronomy effort } Create scalable open numerical laboratories } Scale system to many petabytes } Deep integra on with the “Long Tail” } Large footprint across many disciplines ◦ Also: Genomics, Oceanography, Materials Science } Use commonly shared building blocks } Major na onal and interna onal impact
10 Collaborative data-driven science
} Offer more compu ng resources server side } Augment and combine SQL queries with easy- to-use scrip ng tools } Heavy use of virtual machines } Interac ve portal via iPython/Matlab/R } Batch jobs } Enhanced visualiza on tools
11 Collaborative data-driven science
} CasJobs ◦ SQL, MyDB, batch ◦ FileDB: Raw data access from within RDB } SciDrive ◦ Dropbox-like, on-drop event handling } SciServer/compute ◦ Interac ve/batch python, R, Matlab in Docker container } MyScratch (File & DB) } SSO on all components } All published through REST
12 Collaborative data-driven science
MyScratch Files Login Portal SkyServer MyScratch DB REST API SciDrive OpenStack REST API SciScript Turbulence Keystone & Swift REST API REST API
Cosmology WEB UI CasJobs UI Client
REST CasJobs Job GLUSEEN BatchAdmin Scheduler Service API WS Client
SkyQuery USNOB IRAS DR7 DR8 GLUSEEN DR10 FIRST ROSAT DR5 DR6 Parallel X-Match Engine Cosmology 2MASS Galex DR3 DR4 SkyQuery Scheduler DR9 SDSS WISE DR1 DR2 Turbulence
SkyNode REST Registry SDSS DB Misc. DB Servers API MyDB Server Servers Servers
Linked Server Connections 13 Collaborative data-driven science
} Jupyter Notebooks in Docker ◦ h p://www.nature.com/news/interac ve-notebooks-sharing-the- code-1.16261 ◦ h ps://developer.rackspace.com/blog/how-did-we-serve-more- than-20000-ipython-notebooks-for-nature/ } Python, R, Matlab } Flexible way to a ach data sets in volume containers } Extended to batch jobs
14 Collaborative data-driven science
15 Collaborative data-driven science
Astronomy Collaborative data-driven science Collaborative data-driven science Collaborative data-driven science
Materials Science Collaborative data-driven science
Materials Science
20 Collaborative data-driven science
Turbulence Collaborative data-driven science
Genomics Collaborative data-driven science
23 Collaborative data-driven science
I’ll be very happy to demo and discuss our services
24