Unleash your inner (data) scientist : The ability and audacity to scale your science with extensible

Nirav Merchant The & iPlant Collaborative [email protected] Topic Coverage

• The “Big Data” and “Data Scientist” wave • What is cyberinfrastructure (CI) • Delivering pragmatic CI ecosystem • What has the community built with our CI • Lifecycle of research and innovation • Continuing education and learning with CI • Future thoughts and challenges Science Paradigms 1. Thousand years ago: science was empirical describing natural phenomena, observations 2. Last few hundred years: theoretical branch using models, generalizations 3. Last few decades: a computational branch simulating complex phenomena 4. Today: data exploration (eScience) unify theory, experiment, and simulation

Based on the transcript of a talk given by the late Jim Gray to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on 3January 11, 2007 The Fourth Paradigm: Data-Intensive Scientific Discovery

• Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.

• The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and technologies.

http://research.microsoft.com/en-us/collaboration/fourthparadigm/ 4 The Discovery Lifecycle

The Fourth Paradigm: Data-Intensive Scientific Discovery 5 Evolution of X-Info • The evolution of X-Info and Comp-X for each discipline X e.g. (Bio-Informatics , Computational-Biology) • How to codify and represent our knowledge • The Generic Problems: • Data ingest • How to share it with others • Managing a petabyte • Query and Vis tools • Common schema • Building and executing models • How to organize it • Integrating data and literature • How to reorganize it • Documenting experiments • Curation and long-term preservation

6The Fourth Paradigm: Data-Intensive Scientific Discovery Paradigm Shift •Classic paradigm: You produce data, analyze, interpret (end to end) •Conventional paradigm: Consortium/centers produce data and you consume it •New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge”

7 ∧

8 Big Data

• Extracting meaningful results from vast amount of data (linked data) • Big data “information assets” demand cost-effective, innovative forms of information processing for enhanced insight and decision making. • “Big Data” Is only the Beginning of Extreme Information Management • Big Data Technology, all Is Not New

9 Attributed to Gartner Consulting A few word about “Big Data” and “Data Science”

The 2014 Gartner Technology Hype-Cycle http://www.gartner.com/newsroom/id/2819918 Simple Formula for Success

+ =

11 The Reality

• Excel, R • PERL • Amazon • Python • Azure • ARCGIS + + • Rackspace • Java Ruby • Campus HPC • Fortran C C# • XSEDE C++ Matlab • Etc. • etc. and lots of glue….. 12 Simple Formula

+ = http://cloudtweaks.com/2011/05/the-lighter-side-of-the-cloud-data-transfer/ Rise of the “data janitors”

15 The relevance

has become too central to biology to be left to specialist bioinformaticians. • Biologists are all bioinformaticians now - Lincoln Stein Dec. 2008

http://genomebiology.com/2008/9/12/114 iPlant Collaborative: Vision Enable life science researchers and educators to use and extend cyberinfrastructure

www.iPlantCollaborative.org The iPlant Collaborative We are a Cyberinfrastructure

Platforms, tools, datasets Storage and compute Training and support From data to discovery The iPlant Collaborative And a virtual organization

• Developer Expertise • Computational Capacity • Science Domain Expertise • Training • Administrative and Organization iPlant Collaborative: CI for Scalable Science • Facilitating the 4A’s of “Computational Thinking” approaches for Life Sciences: Abstraction, Automation, Ability and Audacity • Allowing researchers and educators to establish and manage data driven collaborations: Supporting distributed teams and virtual organizations (VO) at global scale • Making efficient and coordinated use of CI resources from national, regional, institutional and commercial providers: NSF XSEDE, iPlant, campus HPC and high bandwidth connections to commercial cloud providers • Adopting best practices from science domains where key CI challenges have been solved: Astronomy, Particle Physics etc. • Community driven, self-provisioning, extensible and open source: Development and prioritization driven through community engagement, active engagement with CISE communities iPlant Collaborative: Platform Philosophy • Strive to provide the CI Lego blocks • Danish 'leg godt' - 'play well’ • Also translates as 'I put together' in Latin • If desired functionality is not available, the community can craft their own by using and extending iPlant CI components (like lego blocks) • Through these extensible and customized platforms create a ecosystem of interoperable tools that benefit the broad community (and not few lab groups) • Provide the tools to allow community to manage their digital assets (cloud, HPC etc.) • Improve Computational Productivity Who did we build it for ? iPlant: Platform for Big Data Collaborations iPlant Collaborative: Products

Ready to use Platforms

Extensible Services

Established CI Components

Ease of use Foundational Capabilities iPlant: Cohesive Platform for Big Data lifecycle Researchers like to share ! • User Statistics • ~27000 user accounts • 4900 users with data • 2600 users (53% of users with data) made at least 1 share • 2100 shares per user • 42 million files (58% shared) • 59 million (1.1 million/month) shares • Community Data Statistics • 5 million files • 55 million (1.0 million/month) shares • ~1.1PB of User Managed data • Our users consume 5M+ SU annually and more (we graduate them to compete for their own allocations from XSEDE) How is it being used ?

• User build their own systems (powered by iPlant components) but managed by them • Consume specific components (a la carte, data store, Atmosphere) • Directly use applications (DE) • Custom design appliances (Atmosphere) • Publish their findings (PNAS, Nature) • Advocate use • Create learning material and courses iPlant CI: What is the community building ? • Many 1000’s omes project manage their data & analysis • Execute large scale workflows (25-50TB data , Million+ CPU hours) • Data infrastructure to coordinate digitization efforts for multiple sites • Sharing, Visualizing (3D) & Analyzing high resolution microscopy images (40K x 40K) via web browser • Learning material, new course work, custom applications And it goes way beyond plants and life science iPlant Collaborative: Training data scientists • Partnership with Software Carpentry and Data Carpentry to provide best practices necessary to make efficient use of CI • Allowing individual researchers and educators to utilize data and computational infrastructure at scale (and encounter real challenges) • Community contributed material (built on iPlant CI) Applied Cyberinfrastructure Concepts (ACIC) • Semester long project based learning course: introduces fundamental concepts, tools and resources for effectively managing common tasks associated with analyzing large datasets. • Graduate + Undergraduate course working on a REAL research workflows where scalability is a bottleneck • Provide familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers, Cloud (Future Grid and commercial providers such as Amazon). • Learning to apply relevant CI skills (for final project) and developing wiki based documentation of these best practices. • Learning how to effectively collaborate in interdisciplinary team settings. • Deliver a functional solution to the stakeholder From research question to reality Why is it valuable ?

• Users are able to over come data and computational bottle necks • Share data of ANY size with ANYONE • Connect data and compute on single platform • Manage their data and computations regardless of scale • Build their own apps and solutions (create their own community iAnimal, iVirome) • Create custom appliances iPlant: What worked

• All major CI components have seen steady adoption (few exception) • “Think tank to do tank” transition was rapid • Evolved to a technology proving ground • Take research products (NSF funded) to production use for our community • Running infrastructure is not fun, building is. Allowing people to focus on science (while stream line CI) iPlant: What worked

• Evolution of training (software carpentry) • Sharing/collaboration • Give people exit strategy (options) and they are happy adopt solution • Provide feedback to CI component creators to improve (usability) • Expectation management: Do not expect the same experience (cable cord cutting v/s netflix/hulu) What did not work • Managing distributed teams is harder in VO (load balancing, enthusiasm etc) • Technology lifecycle is not synchronized across all products • Relying on multiple providers for solution is challenging (downtimes) • Changing/Evolving needs of community are hard to predict • Growth of users out paces our cloud capabilities (see tweets) Even the tech geeks notice Connect with iPlant!

Get a account: http://user.iplantcollaborative.org Email us: [email protected] Questions: http://ask.iplantcollaborative.org Twitter: @iPlantCollab #iPlant Facebook: facebook.com/iPlantCollab LinkedIn: iplant.co/iPlantCollabLinkedIn Google+: iplant.com/iPlantGooglePlus Luck favors the brave Analysis favors the organized