Bioinformatics Resource Centers Systems Biology (Brcs) Centers
Total Page:16
File Type:pdf, Size:1020Kb
Fondation Merieux – J Craig Venter Institute Bioinformatics Workshop December 5 – 8, 2017 Module 3: Genomic Data & Sequence Annotations in Public Databases NIH/NIAID Genomics and Bioinformatics Program SlideSource:A.S.Fauci SlideSource:A.S.Fauci Conducts and supports basic and applied research to better understand, treat, and ultimately prevent infectious, immunologic, and allergic diseases. NIAIDGenomicsProgram Proteomics Systems Sequencing Functional Structural Biology Genomics Genomics Genomic Clinical Functional Systems Sequencing Proteomics Structural Genomic Biology Centers Centers Genomics Research Centers Centers Centers Bioinformatics BioinformaticsResource Centers GenomicResearchResources Genomic/OmicsDataSets,Databases,BioinformaticsTools,Biomarkers,3DStructures,ProteinClones,PredictiveModels Toaddresskeyquestionsin microbiologyandinfectious disease NIAID Genome Sequencing Center Influenza Genome Sequencing Project at JCVI • 2004: 80 influenza genomes in GenBank • 3OCT2017: ~20,000 influenza genomes sequenced at JCVI • 75% complete influenza genomes in GenBank by JCVI Slide source: Maria Giovanni * Genome Sequencing Centers Bioinformatics Resource Centers Systems Biology (BRCs) Centers Structure Genomics Centers Clinical Proteomics Centers Courtesy of Alison Yao, DMID *Bioinformatics Resource Centers (BRCs) Goal: Provide integrated bioinformatics resources in support of basic and applied infectious diseases research • Data and metadata management and integration solutions • Computational analysis and visualization tools • Work spaces and web interfaces • Training and outreach activities • Free bioinformatics services • Rapid response to new and emerging pandemic threats Courtesy of Alison Yao, DMID Influenza Research Database (IRD) • Comprehensive, integrated www.fludb.org database about influenza virus research and surveillance • Funded through the U.S. NIH, specifically the National Institute of Allergy and Infectious Diseases (NIAID) • Free and open access with no restrictions • Focus on data curation, aggregation, integration and novel data generation • Suite of analysis and visualization tools • Personal workbench areas • Developed by a team of research • Cited in 569 scientific publications (as of 7NOV2017) scientists, bioinformaticians and • 1484 sessions per week (Google Analytics - 2016 average) professional software developers • 3.7 million sequences downloaded per month Virus Pathogenwww.viprbrc.org Resource (ViPR) • Cited in 244 scientific publications (as of 7NOV2017) • 1638 sessions per week (Google Analytics - 2016 average) Bacterial Bioinformatics Resource Center Sequence Annotations Database Data Available Enriched Sequence Annotations IRD/ViPR vs GenBank? • metadata standardization • genomic sequence curation ¡ potential sequencing artifacts • protein prediction ¡ influenza virus: variant proteins ¡ polyprotein-generating viruses: mature peptides • important protein region prediction ¡ Sequence Features (phenotype markers), domains, epitopes • genome browser • virus classification ¡ clade, subtype, genotype metadata standardization protein prediction clade classification Metadata Standardization Influenza Sequence Auto-curation curated reference alignment • captures the natural variation in the appropriate type/segment/ subtype category flags potential sequencing artifacts: • Conserved Terminal Sequences (CTS) • Non-Coding Regions (NCR) • Coding Sequence (CDS) 11/9/2017 Influenza Research Database - Nucleotide Sequence Search Results Loading Influenza Research Database...About Us Community Announcements Links Resources Support Sign Out [email protected] SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP Home My Workbench Working Set (auto_curation_examples) Working Set - auto_curation_examples-Segment Data Type: Segment Created: 11/09/2017 Modified: 11/09/2017 Access: Private Description: Edit Working Set Details11/9/2017 Influenza Research Database - Nucleotide Sequence Search Results 11/9/2017 Influenza Research Database - Phylogenetic Tree IRD uses an automated pipeline to detect potential sequence artifacts or poor quality by aligning Influenza virus nucleotide sequences Loading Influenza Research Database...About Us Community Announcements Links Resources Support Sign Out submitted to the resource to a curated profile of like sequences. The pipeline sets flags indicating the category and location of artifacts, or the type of poor quality sequence. These flags are summarized in sequence search results and working sets as follows: AmbigSeq (excessive Ns or ambiguity symbols, or insufficient similarity to the profile); FlagNCR (issues only in noncoding regions);[email protected] Loading Influenza Research Database...About Us Community Announcements Links Resources Support Sign Out FlagCDS (issues in CDS, possibly also in NCR); Pass (no issues). SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DASee TA SOPHELP for further details. Home My Workbench Working Set (auto_curation_examples) Your Selected Items:Working Set4 items selected - auto_curation_examples-Segment | Deselect All [email protected] Remove Data TSEARCH DAype: Segment Copy to WT Created: A 11/09/2017ANALorking Set YZE & VISUALIZE Modified: 11/09/2017 Edit W Access: PrivateWORKBENCHorking Set Using TSUBMIT DAree TA Convert WHELP orking Set Run Analysis ▼ Download Description: Home Edit W My Working Set DetailsorkbenchInfluenza Working Set (auto_curation_examples) Generate Phylogenetic TSequenceree Displaying Auto50 records per page, sorted by -curationStrain Name in ascending order. IRD uses an automated pipeline to detect potential sequence artifacts or poor quality by aligning Influenza virus nucleotide sequences Tutorial submitted to the resource to a curated Display Settings Generateprofile of like sequences. The pipeline sets flags indicating the category and location of artifacts, or the type of poor quality sequence. Phylogenetic Tree These flags are summarized in sequence search results and working sets as follows: AmbigSeq (excessive Ns or ambiguity symbols, or insufficient similarity to the profile); FlagNCR (issues only in noncoding regions); Select all 4 segmentsFlagCDS (issues in CDS, possibly also in NCR); Pass (no issues). See SOP for further details. The "Quick Tree" option uses PhyML [ Guindon, S. and Gascuel, O., (2003) Syst Biol. 52: 696704 ] and IRDdefined settings to infer phylogenies based on sequences for Your Selected Items: 4 items selected | Deselect All Curation datasets of at most 1000 sequences. Protein Sequence The "Custom TComplete ree" option ofSegment fers a choice between the PhyML or RaxML [Stamatakis, A. et al. (2005) Bioinformatics 21: 456463]Collection Segment Remove Copy to Working Set Edit Working Set Using Tree Subtype Convert Working Set * Run AnalysisHost Species▼ Download Country Strain Name Report algorithms, and the ability to define parameter settings. The Custom TName Accession Genome Length ree option, using RaxML, must be used for datasets exceeding 1000 sequences. Date Click here to view a Displaying 50 records per page, sorted by Strain Name in ascending order. (SOP) tutorial on generating a phylogenetic tree using IRD tools. Display Settings 4 HA CY191675 Yes 1708 H7N9 2013 Unknown China A/Anhui/1YK_RG03/2013 Pass Select all 4 segments Curation Protein Sequence Complete Segment Collection 4 SegmentHA KF297293 No Subtype * 1683 Host SpeciesH7N9 Country2013 Strain NameChicken/Avian ReportChina A/chicken/Rizhao/719b/2013 Ambig ANALYSIS NAMEName Accession Genome Length Date (SOP) Seq 4 HA CY191675 Yes 1708 H7N9 2013 Unknown China A/Anhui/1YK_RG03/2013 Pass 4 4 HAHA KF297293MF357804No 1683Yes H7N9 17262013 Chicken/AH7N9vian China03/13/2017A/chicken/Rizhao/719b/2013*Chicken/Avian AmbigUSA A/chicken/Tennessee/170081521/2017 Flag Seq TREE GENERATION CDS 4 HA MF357804 Yes 1726 H7N9 03/13/2017 *Chicken/Avian USA A/chicken/Tennessee/170081521/2017 Flag 4 Quick Tree (Let IRD set all parameters HA KF226105 Yesview all parameters1723 ) H7N9 04/20/2013 Human CDSChina A/Jiangsu/2/2013 Flag 4 HA KF226105 Yes 1723 H7N9 04/20/2013 Human China A/Jiangsu/2/2013 Flag Custom Tree ( for setting of custom parameters and for large NCR NCR datasets ) Your Selected Items: 4 items selected Your Selected Items: Remove 4 items selected Copy to Working Set Edit Working Set Using Tree Convert Working Set Run Analysis ▼ Download INPUT Displaying 50 records per page, sorted by Strain NameLeverage in ascending order. the auto-curation results for Remove 4 SEGMENTS SELECTED FOR TREE Copy to Working Set Edit Working Set Using Tree Convert WDisplay Settingsorking Set Run Analysis ▼ Download Sequence Type: NA sequence analysis in IRD Displaying 50 records per page, sorted by Strain Name in ascending order. For accuracy and efficiency, IRD recommends using our precomputed Release Date: Oct 21, 2017• use edited version This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to law Display Settings enforcement ofalignments in analyses. Prealignments are available for records with Pficials. ASS • NCR-ext: trimmed off superfluous This project is funded by the flags, or with flags in the CTS or NCR onlyNational Institute of Allergy and Infectious Diseases. (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop Grumman Health IT, J. Craig Venter Institute, and Vecna Technologies. bases Your selected data includes 4 records. • NCR-del,