The iPlant Collaborative: A Cloud Infrastructure for Plant Biology

Dan Stanzione Virtual Cloud Summer School August, 2012. What is iPlant

• The iPlant Collaborative is building a comprehensive informatics infrastructure for plant biology. • (and lately, some animals as well). • This rapidly evolving infrastructure is sometimes very visible in your work, and sometimes hides in the background. The iPlant Collaborave Cyberinfrastructure Philosophy

We have designed iPlant to be consistent with the pillars of CIF21*

ü High Performance Compung ü Data and Data Analysis ü Virtual Organizaon ü Learning and Workforce Why iPlant?

• You’ve heard about KBASE already today from Rick Stevens (and we have some similarities), so I’ll skip the heavy biology. • Just a few slides on why everyone should care specifically about an infrastructure for studying plants… Cereal Consumption

Rice in Asia = 0.9-1.1 lb/day/person

Corn in US = ~3.3 lb/day/person*

Wheat in Europe = 0.8-1.2 lb/day/person

* Includes entire food chain World Population Projections

20 UN High

16

12 UN Medium

8 UN Low 4

0 1950 1975 2000 2025 2050 2075 2100 Land Requirements

• Pre- (>10k years ago) • 1 Person Required 6,000 Acres

• Current Demand: • 1 Person Requires 1/2 Acre

• Predicted Demand (Year 2050): • 1 Person Will Have < 1/3 Acre Moore’s law for corn

Microprocessors aren’t the only thing that doubles periodically: Farm technology has doubled the corn per acre every 25 years since 1935 (8x improvement in 75 years).

Corn Production by Year, Bushels per Acre, 1866- Present Biotech Crops 180 160 140 Increased Use of 120 Hybrids 100 80 Yield was flat for centuries 60 40 before 1900 20 0 Year Year 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009

Source: USDA Quick Stats Database Wild versus Early Domesticated Corn

Zea mays Zea mays ssp. parviglumis ssp. mays Domestication has altered Modern Crops www.iplantcollaborative.org Slide # 9 Yield Gains Are Slowing (Percent change)

30 1960s World World excluding Transition Countries 1970s 25 1980s 1990s 20

15

10

5

0 Food Prices and Political Instability

• Data strongly correlates commodity food prices with revolutions (dating to at least 1792 – “Let them Eat Cake”). • When prices spiked in 2006-2007, there were food riots around the world, notably in Egypt. – Recession brought prices temporarily down. • Nomura Food Vulnerability Index in September, 2010, rated Egypt, Libya, and Tunisia near the top as candidates for food-related unrest. • Commodity food prices ran up 40% after this report in months leading up to “Arab Spring” Trends & MegaTrends

• Western Diet going Global • Crops used for Bio-Energy • Population - Growing & Aging (~9 billion by 2050) • Agricultural Acreage Decreasing • Climate Changing • Prices Increasing – Leading to political instability…

Trends Indicate Importance of Understanding Plant Biology What Do These Genes Do?

= total genes 62% ~30% = genes with experimentally- demonstrated function E coli Yeast

Arabidopsis ~15% Rice or maize ~1% Keeping up with Science

Publications in PubMed May 10th, 2010) United States 7.6 million England 2.9 million Germany 1.2 million Japan 606 thousand France 491 thousand Korea 243 thousand China 227 thousand India 116 thousand

Published globally in 2009: 846K (>2,300 pubs per day) How many papers can anyone read in a day? Biologists Became Developers

Result: An ad hoc software ecosystem

1. Tools separated by compute platform, data format, integration issues, and programming model.

2. Mixture of desktop, command line, database, and web-based tools

3. Labor intensive, fragile solutions devised to reach scientific objectives

4. Little ability to share results, analytical methods

5. Lack of reproducibility We’ve established…

• Lots of work to do in plant science to meet the global demand for food (and maybe energy, air, pharma). • Most of this work will be driven by data and computation. Big Data! Data-intensive biology will mean geng biologists comfortable with new technology… One key goal in our infrastructure, training and outreach is to minimize the emphasis on technology and return the focus to the biology.

1973 Sharp, Sambrook, Sugden 1958 Gel Electrophoresis Chamber, Ma Meselson & $250 Ultracentrifuge, $500,000 What does iPlant Provide • DATA – iPlant Data Storage: All data large and small • COMPUTING: – Large Scale: Up to hundreds of thousands of processors – Virtual: “Cloud Style” server hosting • A Programmer’s Interface – Easily embed iPlant resources in your applications • User Interfaces – The Discovery Environment: Integrated Web apps. – More than 200 applications – MyPlant, DNASubway, TNRS, TreeViewer, PhytoBisque, etc 20 The iPlant Cyberinfrastructure

End Users

Teragrid XSEDE Computaonal Users iPlant as a Cloud • The design notions of iPlant (5 years ago, before “Cloud” was the cool word) all fit the cloud philosophy: – Data and computation is (already?) too big for the desktop. – Store in remote datacenters, put the compute there. – Download nothing, all interfaces assume “run elsewhere”. – Don’t worry about client apps – Underlie applications with web-based wherever possible. iPlant Cloud Services • The iPlant data store is “cloud-based” – Coherent, huge, remote, lots of interfaces. • The Discovery Environment is the web- based portal to analysis on remote resources, including the original cloud – Supercomputer centers. • The Atmosphere platform provides user control of “traditional” cloud services (virtual machines a la EC2). Ways to Access iPlant • Atmosphere: a free cloud compung plaorm

• Data Store: secure, cloud-based data storage

• Discovery Environment: a web portal to many integrated applicaons

• DNA Subway: annotaon, DNA bar-coding (and more) for science educators

• The API: For programmers embedding iPlant infrastructure capabilies

• Command line: for expert access (thru TeraGrid/XSEDE) The iPlant Discovery Environment

• A rich web client – Consistent interface to bioinformacs tools – Portal for users who won’t want to interact with lower level infrastructure • An integrated, extensible system of applicaons and services – Addional intelligence above low level APIs – , Collaboraon, etc. Workflows within the DE; Phylogenecs Trees also present computaonal challenges

It can take weeks or months to analyze data sets with > 100, 000 species. Example of iPlant contribuon:

NINJA/WINDJAMMER (Neighbor-Joining) -- NINJA 216K species, ~8 days -- WINDJAMMER 216K species, ~4 hours

Scalable Computaon for High-Throughput Inquiry

High Speed Data Intake

• 90,000 Trinity SOAPdenovo Abyss Compute Cores Contig Assessment • Up to 1TB TACC Lonestar TACC Ranger shared BLASTX Assessment memory Hybrid Assembly via AMOS • Growing to Contig Assessment Translation and filtering ~500,000 cores by end of 2012 BLASTX Assessment InterproScan

PSC Blacklight TACC Corral EBI Web Services Case Study: Community Extensibility iPlant Collaborator Lin Wang (BTI/Cornell) is interested in how genes are regulated along developmental gradient and circadian control; what type of regulatory factors (cis and trans) are modulating these processes. His ultimate goal is to use the developmental/circadian dataset as guide/ resource to increase crop yield through genetic engineering.

Tuxedo package has limited support for strand-specific libraries

BWA aligner has higher sensitivity (2-5%) plus native gapped alignment support

Solution: Develop a customized, cross-validated pipeline for plant RNAseq

“How can I share my code with the plant science community?”

28 Integrating community-authored code into a cyberinfrastructure

29 Take-home Messages

Lin’s code integrated into the iPlant DE in 45 minutes of work, including testing, with no significant changes needed to his scripts

Multiple RNAseq pipelines in the DE represent a marketplace of ideas

iPlant CI as a platform for publishing functional software instances

30 Case Study: Draft assembly of H. texanum transcriptome

• Part of OneKP project • 10.9 million 100 bp paired- end RNA reads • Data housed with iPlant since project inception • Emblematic of the new class of questions people are working on

31 A Workflow

• Paired-end assembly using ABySS 1.2.7 on TACC Ranger • Assess contig metrics (N50, etc) • Filter contigs by size (150 bp) • BLASTX against Plant RefSeq on TACC Ranger • Compile annotation results on TACC Ranger

32 A Workflow

>Time to break out the terminal emulator, right?

33 Draft assembly of H. texanum transcriptome The DNA Subway The iPlant Data Store

Fast data transfers via parallel, non-TCP file transfer • Move large (>2 GB) files with ease

Mulple, consistent access modes • iPlant API • iPlant web apps • Desktop mount (FUSE/DAV) • Java applet (iDrop) • Command line

Fine-grained ACL permissions • Sharing made simple

Access and a storage allocaon is automac with your iPlant account Powered by iPlant • The iPlant CI is designed as infrastructure. This means it is a platform upon which other projects can build. • Use of the iPlant infrastructure can take one of several forms: – Storage – Computation – Hosting – Web Services – Scalability

37 Powered by iPlant • Other major projects are beginning to adopt the iPlant CI as their underlying infrastructure (some completely, some in limited ways): – BioExtract (web service platform) – CiPRES (computation) – Gates Integrated Breeding Platform (hosting, development) – Galaxy (storage, for now) – CoGE (authentication, hosting0

– TAIR 38 Using the iPlant CI as a Foundation

CoGE BioExtract

Galaxy TAIR?

IBP

39 Science Success Stories

XSEDE Direct Access Program

• Diverse institutions – Cornell, Iowa State University, , JCVI, Penn State University, CSIRO, Purdue, more • Massive scale computing – >7000 HPC jobs ; > 1.5 million hours – >500 were on machines with 1 TB RAM – >2000 were run on > 128 processors • Wide impact – Genome and transcriptome assembly – Functional annotation – Phylogeny reconstruction – Genetic simulation – New HPC algorithm development The Texas Advanced Computing Center: A World Leader in High Performance Computing Providing the Computation and Storage Foundation for iPlant and iPlant Partners Ranger: 62,976 Processor Cores, 123TB RAM, 579 TeraFlops, Fastest Open Science Machine in the World, 2008

Lonestar: 23,000 processors, 44TB RAM, Shared Mem and GPU subsystems, #25 in the world 2011

Stampede: Fastest in the World 2012? Somewhere around half a million processor cores with Intel Sandy Bridge and Intel MIC, Dell: >10 Petaflops. Data Storage Systems @ TACC

• Corral: More than 6 Petabytes of online, replicated disk storage. • Ranch: 100 Petabytes of tape storage (16 drives, 8 robots). • Permanent collections, long term archiving, high speed scratch, relational, flat file, etc.

42 Biological Range Maps Objective: Compute range Maps for >120k species in Botanical Information and Ecology Network (BIEN) database

• This is Big Data – Over 120k species – Multiple approaches required: Maximum entropy, convex hull, one and two-point algorithms – 11 map products (convex hull, latitude extent, etc) – Estimated >400 days of desktop computation • Using TACC Longhorn: 6 hours for ~72k species test – Scale-up and increase in efficiency underway CIPRES Portal Federation Serious Expansion of Discovery Environment Tool List

• GWAS/QTL/Genotyping By Sequencing – QTLCartographer, TASSEL, and more • Bioinformatic analysis – BEDTools, SAMTools, EMBOSS, and more • MetaGenomics – Assembly, Functional annotation, Phylogenetic profiling • De novo assembly – Velvet, SOAPdenovo, Newbler, Trinity, and more – Powered by XSEDE systems Lonestar and Blacklight

The iPlant Collaborave Project Atmosphere™: Custom Cloud Compung

• API-compable implementaon of Amazon EC2/S3 interfaces • Virtualize the execuon environment for applicaons and services • Up to 12 core / 48 GB instances • Access to Cloud Storage + EBS >60 hosted applicaons in • Big data and the desktop are co-local Atmosphere today, including again users from USDA, Forest Service, database providers, – Bring your data to Atmosphere VM for interacve access and analysis etc.

– Send it back to the DE for transaconal (30 more for postdocs and analysis grad students for training classes) Atmosphere: Motivations

• Standalone GUI-based applications are frequently required for analysis • GUI apps not easily to transform into web apps • Need to handle complex software dependencies (e.g specific bioperl version and R modules) • Users needing full control of their software stack (occasional sudo access) • Need to share desktop/applications for collaborative analysis (remote collaborators)

Atmosphere: What is it?

• Self-service cloud infrastructure • Designed to make underlying cloud infrastructure easy to use by novice user • Built on open source Eucalyptus • Fully integrated into iPlant authentication and storage and HPC capabilities • Enables users to build custom images/ appliances and share with community • Cross-platform desktop access to GUI applications in the cloud (using VNC) • Provide easy web based access to resources Atmosphere: Launch a new VM Atmosphere: Access a running VM Atmosphere: Log in via shell Stop doing this… tophat -r 160 -o top_SRR027863-65 ../../../reference/hg19 SRR027863_1.fastq,SRR027864_1.fastq,SRR027865_1.fastq SRR027863_2.fastq,SRR027864_2.fastq,SRR027865_2.fastq tophat -r 160 -o top_SRR027866-67 ../../../reference/hg19 SRR027866_1.fastq,SRR027867_1.fastq SRR027866_2.fastq,SRR027867_2.fastq cufflinks -o cuff_SRR027863-65 top_SRR027863-65/accepted_hits.bam cufflinks -o cuff_SRR027866-67 top_SRR027866-67/accepted_hits.bam cuffmerge -s ../../../reference/hg19.fa assemblies.txt cuffdiff merged_asm/merged.gtf top_SRR027863-65/accepted_hits.bam top_SRR027866-67/accepted_hits.bam

54 Start doing this…

55 Questions?

[email protected] [email protected]

56