The iPlant Collaborative: A Cloud Infrastructure for Plant Biology
Dan Stanzione Virtual Cloud Summer School August, 2012. What is iPlant
• The iPlant Cyberinfrastructure Collaborative is building a comprehensive informatics infrastructure for plant biology. • (and lately, some animals as well). • This rapidly evolving infrastructure is sometimes very visible in your work, and sometimes hides in the background. The iPlant Collabora ve Cyberinfrastructure Philosophy
We have designed iPlant to be consistent with the pillars of CIF21*
ü High Performance Compu ng ü Data and Data Analysis ü Virtual Organiza on ü Learning and Workforce Why iPlant?
• You’ve heard about KBASE already today from Rick Stevens (and we have some similarities), so I’ll skip the heavy biology. • Just a few slides on why everyone should care specifically about an infrastructure for studying plants… Cereal Consumption
Rice in Asia = 0.9-1.1 lb/day/person
Corn in US = ~3.3 lb/day/person*
Wheat in Europe = 0.8-1.2 lb/day/person
* Includes entire food chain World Population Projections
20 UN High
16
12 UN Medium
8 UN Low 4
0 1950 1975 2000 2025 2050 2075 2100 Land Requirements
• Pre-Agriculture (>10k years ago) • 1 Person Required 6,000 Acres
• Current Demand: • 1 Person Requires 1/2 Acre
• Predicted Demand (Year 2050): • 1 Person Will Have < 1/3 Acre Moore’s law for corn
Microprocessors aren’t the only thing that doubles periodically: Farm technology has doubled the corn per acre every 25 years since 1935 (8x improvement in 75 years).
Corn Production by Year, Bushels per Acre, 1866- Present Biotech Crops 180 160 140 Increased Use of 120 Hybrids 100 80 Yield was flat for centuries 60 40 before 1900 20 0 Year Year 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009
Source: USDA Quick Stats Database Wild versus Early Domesticated Corn
Zea mays Zea mays ssp. parviglumis ssp. mays Domestication has altered Modern Crops www.iplantcollaborative.org Slide # 9 Yield Gains Are Slowing (Percent change)
30 1960s World World excluding Transition Countries 1970s 25 1980s 1990s 20
15
10
5
0 Food Prices and Political Instability
• Data strongly correlates commodity food prices with revolutions (dating to at least 1792 – “Let them Eat Cake”). • When prices spiked in 2006-2007, there were food riots around the world, notably in Egypt. – Recession brought prices temporarily down. • Nomura Food Vulnerability Index in September, 2010, rated Egypt, Libya, and Tunisia near the top as candidates for food-related unrest. • Commodity food prices ran up 40% after this report in months leading up to “Arab Spring” Trends & MegaTrends
• Western Diet going Global • Crops used for Bio-Energy • Population - Growing & Aging (~9 billion by 2050) • Agricultural Acreage Decreasing • Climate Changing • Prices Increasing – Leading to political instability…
Trends Indicate Importance of Understanding Plant Biology What Do These Genes Do?
= total genes 62% ~30% = genes with experimentally- demonstrated function E coli Yeast
Arabidopsis ~15% Rice or maize ~1% Keeping up with Science
Publications in PubMed May 10th, 2010) United States 7.6 million England 2.9 million Germany 1.2 million Japan 606 thousand France 491 thousand Korea 243 thousand China 227 thousand India 116 thousand
Published globally in 2009: 846K (>2,300 pubs per day) How many papers can anyone read in a day? Biologists Became Developers
Result: An ad hoc software ecosystem
1. Tools separated by compute platform, data format, integration issues, and programming model.
2. Mixture of desktop, command line, database, and web-based tools
3. Labor intensive, fragile solutions devised to reach scientific objectives
4. Little ability to share results, analytical methods
5. Lack of reproducibility We’ve established…
• Lots of work to do in plant science to meet the global demand for food (and maybe energy, air, pharma). • Most of this work will be driven by data and computation. Big Data! Data-intensive biology will mean ge ng biologists comfortable with new technology… One key goal in our infrastructure, training and outreach is to minimize the emphasis on technology and return the focus to the biology.
1973 Sharp, Sambrook, Sugden 1958 Gel Electrophoresis Chamber, Ma Meselson & $250 Ultracentrifuge, $500,000 What does iPlant Provide • DATA – iPlant Data Storage: All data large and small • COMPUTING: – Large Scale: Up to hundreds of thousands of processors – Virtual: “Cloud Style” server hosting • A Programmer’s Interface – Easily embed iPlant resources in your applications • User Interfaces – The Discovery Environment: Integrated Web apps. – More than 200 bioinformatics applications – MyPlant, DNASubway, TNRS, TreeViewer, PhytoBisque, etc 20 The iPlant Cyberinfrastructure
End Users
Teragrid XSEDE Computa onal Users iPlant as a Cloud • The design notions of iPlant (5 years ago, before “Cloud” was the cool word) all fit the cloud philosophy: – Data and computation is (already?) too big for the desktop. – Store in remote datacenters, put the compute there. – Download nothing, all interfaces assume “run elsewhere”. – Don’t worry about client apps – Underlie applications with web-based APIs wherever possible. iPlant Cloud Services • The iPlant data store is “cloud-based” – Coherent, huge, remote, lots of interfaces. • The Discovery Environment is the web- based portal to analysis on remote resources, including the original cloud – Supercomputer centers. • The Atmosphere platform provides user control of “traditional” cloud services (virtual machines a la EC2). Ways to Access iPlant • Atmosphere: a free cloud compu ng pla orm
• Data Store: secure, cloud-based data storage
• Discovery Environment: a web portal to many integrated applica ons
• DNA Subway: genome annota on, DNA bar-coding (and more) for science educators
• The API: For programmers embedding iPlant infrastructure capabili es
• Command line: for expert access (thru TeraGrid/XSEDE) The iPlant Discovery Environment
• A rich web client – Consistent interface to bioinforma cs tools – Portal for users who won’t want to interact with lower level infrastructure • An integrated, extensible system of applica ons and services – Addi onal intelligence above low level APIs – Provenance, Collabora on, etc. Workflows within the DE; Phylogene cs Trees also present computa onal challenges
It can take weeks or months to analyze data sets with > 100, 000 species. Example of iPlant contribu on:
NINJA/WINDJAMMER (Neighbor-Joining) -- NINJA 216K species, ~8 days -- WINDJAMMER 216K species, ~4 hours
Scalable Computa on for High-Throughput Inquiry
High Speed Data Intake
• 90,000 Trinity SOAPdenovo Abyss Compute Cores Contig Assessment • Up to 1TB TACC Lonestar TACC Ranger shared BLASTX Assessment memory Hybrid Assembly via AMOS • Growing to Contig Assessment Translation and filtering ~500,000 cores by end of 2012 BLASTX Assessment InterproScan
PSC Blacklight TACC Corral EBI Web Services Case Study: Community Extensibility iPlant Collaborator Lin Wang (BTI/Cornell) is interested in how genes are regulated along developmental gradient and circadian control; what type of regulatory factors (cis and trans) are modulating these processes. His ultimate goal is to use the developmental/circadian dataset as guide/ resource to increase crop yield through genetic engineering.
Tuxedo package has limited support for strand-specific libraries
BWA aligner has higher sensitivity (2-5%) plus native gapped alignment support
Solution: Develop a customized, cross-validated pipeline for plant RNAseq
“How can I share my code with the plant science community?”
28 Integrating community-authored code into a cyberinfrastructure
29 Take-home Messages
Lin’s code integrated into the iPlant DE in 45 minutes of work, including testing, with no significant changes needed to his scripts
Multiple RNAseq pipelines in the DE represent a marketplace of ideas
iPlant CI as a platform for publishing functional software instances
30 Case Study: Draft assembly of H. texanum transcriptome
• Part of OneKP project • 10.9 million 100 bp paired- end RNA reads • Data housed with iPlant since project inception • Emblematic of the new class of questions people are working on
31 A Workflow
• Paired-end assembly using ABySS 1.2.7 on TACC Ranger • Assess contig metrics (N50, etc) • Filter contigs by size (150 bp) • BLASTX against Plant RefSeq on TACC Ranger • Compile annotation results on TACC Ranger
32 A Workflow
>Time to break out the terminal emulator, right?
33 Draft assembly of H. texanum transcriptome The DNA Subway The iPlant Data Store
Fast data transfers via parallel, non-TCP file transfer • Move large (>2 GB) files with ease
Mul ple, consistent access modes • iPlant API • iPlant web apps • Desktop mount (FUSE/DAV) • Java applet (iDrop) • Command line
Fine-grained ACL permissions • Sharing made simple
Access and a storage alloca on is automa c with your iPlant account Powered by iPlant • The iPlant CI is designed as infrastructure. This means it is a platform upon which other projects can build. • Use of the iPlant infrastructure can take one of several forms: – Storage – Computation – Hosting – Web Services – Scalability
37 Powered by iPlant • Other major projects are beginning to adopt the iPlant CI as their underlying infrastructure (some completely, some in limited ways): – BioExtract (web service platform) – CiPRES (computation) – Gates Integrated Breeding Platform (hosting, development) – Galaxy (storage, for now) – CoGE (authentication, hosting0
– TAIR 38 Using the iPlant CI as a Foundation
CoGE BioExtract
Galaxy TAIR?
IBP
39 Science Success Stories
XSEDE Direct Access Program
• Diverse institutions – Cornell, Iowa State University, University of Florida, JCVI, Penn State University, CSIRO, Purdue, more • Massive scale computing – >7000 HPC jobs ; > 1.5 million hours – >500 were on machines with 1 TB RAM – >2000 were run on > 128 processors • Wide impact – Genome and transcriptome assembly – Functional annotation – Phylogeny reconstruction – Genetic simulation – New HPC algorithm development The Texas Advanced Computing Center: A World Leader in High Performance Computing Providing the Computation and Storage Foundation for iPlant and iPlant Partners Ranger: 62,976 Processor Cores, 123TB RAM, 579 TeraFlops, Fastest Open Science Machine in the World, 2008
Lonestar: 23,000 processors, 44TB RAM, Shared Mem and GPU subsystems, #25 in the world 2011
Stampede: Fastest in the World 2012? Somewhere around half a million processor cores with Intel Sandy Bridge and Intel MIC, Dell: >10 Petaflops. Data Storage Systems @ TACC
• Corral: More than 6 Petabytes of online, replicated disk storage. • Ranch: 100 Petabytes of tape storage (16 drives, 8 robots). • Permanent collections, long term archiving, high speed scratch, relational, flat file, etc.
42 Biological Range Maps Objective: Compute range Maps for >120k species in Botanical Information and Ecology Network (BIEN) database
• This is Big Data – Over 120k species – Multiple approaches required: Maximum entropy, convex hull, one and two-point algorithms – 11 map products (convex hull, latitude extent, etc) – Estimated >400 days of desktop computation • Using TACC Longhorn: 6 hours for ~72k species test – Scale-up and increase in efficiency underway CIPRES Portal Federation Serious Expansion of Discovery Environment Tool List
• GWAS/QTL/Genotyping By Sequencing – QTLCartographer, TASSEL, and more • Bioinformatic analysis – BEDTools, SAMTools, EMBOSS, and more • MetaGenomics – Assembly, Functional annotation, Phylogenetic profiling • De novo assembly – Velvet, SOAPdenovo, Newbler, Trinity, and more – Powered by XSEDE systems Lonestar and Blacklight
The iPlant Collabora ve Project Atmosphere™: Custom Cloud Compu ng
• API-compa ble implementa on of Amazon EC2/S3 interfaces • Virtualize the execu on environment for applica ons and services • Up to 12 core / 48 GB instances • Access to Cloud Storage + EBS >60 hosted applica ons in • Big data and the desktop are co-local Atmosphere today, including again users from USDA, Forest Service, database providers, – Bring your data to Atmosphere VM for interac ve access and analysis etc.
– Send it back to the DE for transac onal (30 more for postdocs and analysis grad students for training classes) Atmosphere: Motivations
• Standalone GUI-based applications are frequently required for analysis • GUI apps not easily to transform into web apps • Need to handle complex software dependencies (e.g specific bioperl version and R modules) • Users needing full control of their software stack (occasional sudo access) • Need to share desktop/applications for collaborative analysis (remote collaborators)
Atmosphere: What is it?
• Self-service cloud infrastructure • Designed to make underlying cloud infrastructure easy to use by novice user • Built on open source Eucalyptus • Fully integrated into iPlant authentication and storage and HPC capabilities • Enables users to build custom images/ appliances and share with community • Cross-platform desktop access to GUI applications in the cloud (using VNC) • Provide easy web based access to resources Atmosphere: Launch a new VM Atmosphere: Access a running VM Atmosphere: Log in via shell Stop doing this… tophat -r 160 -o top_SRR027863-65 ../../../reference/hg19 SRR027863_1.fastq,SRR027864_1.fastq,SRR027865_1.fastq SRR027863_2.fastq,SRR027864_2.fastq,SRR027865_2.fastq tophat -r 160 -o top_SRR027866-67 ../../../reference/hg19 SRR027866_1.fastq,SRR027867_1.fastq SRR027866_2.fastq,SRR027867_2.fastq cufflinks -o cuff_SRR027863-65 top_SRR027863-65/accepted_hits.bam cufflinks -o cuff_SRR027866-67 top_SRR027866-67/accepted_hits.bam cuffmerge -s ../../../reference/hg19.fa assemblies.txt cuffdiff merged_asm/merged.gtf top_SRR027863-65/accepted_hits.bam top_SRR027866-67/accepted_hits.bam
54 Start doing this…
55 Questions?
[email protected] [email protected]
56