<<

Fondation Merieux – J Craig Venter Institute Bioinformatics Workshop

December 5 – 8, 2017 Module 3: Genomic Data & Sequence Annotations in Public Databases NIH/NIAID Genomics and Bioinformatics Program SlideSource:A.S.Fauci Slide Source: A. S. Fauci Conducts and supports basic and applied research to better understand, treat, and ultimately prevent infectious, immunologic, and allergic diseases. NIAIDGenomicsProgram Proteomics Systems Sequencing Functional Structural Biology Genomics Genomics Genomic Clinical Functional Systems Sequencing Proteomics Structural Genomic Biology Centers Centers Genomics Research Centers Centers Centers

Bioinformatics BioinformaticsResource Centers

GenomicResearchResources Genomic/OmicsDataSets,Databases,BioinformaticsTools,Biomarkers,3DStructures,ProteinClones,PredictiveModels

Toaddresskeyquestionsin andinfectious disease NIAID Genome Sequencing Center Influenza Genome Sequencing Project at JCVI

• 2004: 80 influenza genomes in GenBank • 3OCT2017: ~20,000 influenza genomes sequenced at JCVI • 75% complete influenza genomes in GenBank by JCVI Slide source: Maria Giovanni *

Genome Sequencing Centers Bioinformatics Resource Centers Systems Biology (BRCs) Centers

Structure Genomics Centers

Clinical Proteomics Centers

Courtesy of Alison Yao, DMID *Bioinformatics Resource Centers (BRCs)

Goal: Provide integrated bioinformatics resources in support of basic and applied infectious diseases research • Data and metadata management and integration solutions • Computational analysis and visualization tools • Work spaces and web interfaces • Training and outreach activities • Free bioinformatics services • Rapid response to new and emerging pandemic threats

Courtesy of Alison Yao, DMID Influenza Research Database (IRD)

• Comprehensive, integrated www.fludb.org database about influenza virus research and surveillance • Funded through the U.S. NIH, specifically the National Institute of Allergy and Infectious Diseases (NIAID) • Free and open access with no restrictions • Focus on data curation, aggregation, integration and novel data generation • Suite of analysis and visualization tools • Personal workbench areas • Developed by a team of research • Cited in 569 scientific publications (as of 7NOV2017) scientists, bioinformaticians and • 1484 sessions per week (Google Analytics - 2016 average) professional software developers • 3.7 million sequences downloaded per month Virus Pathogenwww.viprbrc.org Resource (ViPR)

• Cited in 244 scientific publications (as of 7NOV2017) • 1638 sessions per week (Google Analytics - 2016 average) Bacterial Bioinformatics Resource Center Sequence Annotations Database Data Available

Enriched Sequence Annotations

IRD/ViPR vs GenBank? • metadata standardization • genomic sequence curation

¡ potential sequencing artifacts • protein prediction

¡ influenza virus: variant proteins

¡ polyprotein-generating viruses: mature peptides • important protein region prediction

¡ Sequence Features (phenotype markers), domains, epitopes • genome browser • virus classification

¡ clade, subtype, genotype metadata standardization protein prediction

clade classification Metadata Standardization Influenza Sequence Auto-curation

curated reference alignment • captures the natural variation in the appropriate type/segment/ subtype category

flags potential sequencing artifacts: • Conserved Terminal Sequences (CTS) • Non-Coding Regions (NCR) • Coding Sequence (CDS) 11/9/2017 Influenza Research Database - Nucleotide Sequence Search Results

Loading Influenza Research Database...About Us Community Announcements Links Resources Support Sign Out

[email protected]

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP

Home My Workbench Working Set (auto_curation_examples) Working Set - auto_curation_examples-Segment

Data Type: Segment Created: 11/09/2017 Modified: 11/09/2017 Access: Private

Description: Edit Working Set Details11/9/2017 Influenza Research Database - Nucleotide Sequence Search Results

11/9/2017 Influenza Research Database - Phylogenetic Tree IRD uses an automated pipeline to detect potential sequence artifacts or poor quality by aligning Influenza virus nucleotide sequences Loading Influenza Research Database...About Us Community Announcements Links Resources Support Sign Out submitted to the resource to a curated profile of like sequences. The pipeline sets flags indicating the category and location of artifacts, or the type of poor quality sequence. These flags are summarized in sequence search results and working sets as follows: Ambig­Seq (excessive Ns or ambiguity symbols, or insufficient similarity to the profile); Flag­NCR (issues only in non­coding regions);[email protected] Loading Influenza Research Database...About Us Community Announcements Links Resources Support Sign Out Flag­CDS (issues in CDS, possibly also in NCR); Pass (no issues). SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DASee TA SOPHELP for further details. Home My Workbench Working Set (auto_curation_examples)

Your Selected Items:Working Set4 items selected - auto_curation_examples-Segment | Deselect All [email protected]

Remove Data TSEARCH DAype: Segment Copy to WT Created: A 11/09/2017ANALorking Set YZE & VISUALIZE Modified: 11/09/2017 Edit W Access: PrivateWORKBENCHorking Set Using TSUBMIT DAree TA Convert WHELP orking Set Run Analysis ▼ Download Description: Home Edit W My Working Set DetailsorkbenchInfluenza Working Set (auto_curation_examples) Generate Phylogenetic TSequenceree Displaying Auto50 records per page, sorted by -curationStrain Name in ascending order. IRD uses an automated pipeline to detect potential sequence artifacts or poor quality by aligning Influenza virus nucleotide sequences Tutorial submitted to the resource to a curated Display Settings Generateprofile of like sequences. The pipeline sets flags indicating the category and location of artifacts, or the type of poor quality sequence. Phylogenetic Tree These flags are summarized in sequence search results and working sets as follows: Ambig­Seq (excessive Ns or ambiguity symbols, or insufficient similarity to the profile); Flag­NCR (issues only in non­coding regions); Select all 4 segmentsFlag­CDS (issues in CDS, possibly also in NCR); Pass (no issues). See SOP for further details. The "Quick Tree" option uses PhyML [ Guindon, S. and Gascuel, O., (2003) Syst Biol. 52: 696­704 ] and IRD­defined settings to infer phylogenies based on sequences for Your Selected Items: 4 items selected | Deselect All Curation datasets of at most 1000 sequences. Protein Sequence The "Custom TComplete ree" option ofSegment fers a choice between the PhyML or RaxML [Stamatakis, A. et al. (2005) Bioinformatics 21: 456­463]Collection Segment Remove Copy to Working Set Edit Working Set Using Tree Subtype Convert Working Set * Run AnalysisHost Species▼ Download Country Strain Name Report algorithms, and the ability to define parameter settings. The Custom TName Accession Genome Length ree option, using RaxML, must be used for datasets exceeding 1000 sequences. Date Click here to view a Displaying 50 records per page, sorted by Strain Name in ascending order. (SOP) tutorial on generating a phylogenetic tree using IRD tools. Display Settings 4 HA CY191675 Yes 1708 H7N9 2013 Unknown China A/Anhui/1­YK_RG03/2013 Pass Select all 4 segments Curation Protein Sequence Complete Segment Collection 4 SegmentHA KF297293 No Subtype * 1683 Host SpeciesH7N9 Country2013 Strain NameChicken/Avian ReportChina A/chicken/Rizhao/719b/2013 Ambig­ ANALYSIS NAMEName Accession Genome Length Date (SOP) Seq 4 HA CY191675 Yes 1708 H7N9 2013 Unknown China A/Anhui/1­YK_RG03/2013 Pass

4 4 HAHA KF297293MF357804No 1683Yes H7N9 17262013 Chicken/AH7N9vian China03/13/2017A/chicken/Rizhao/719b/2013*Chicken/Avian Ambig­USA A/chicken/Tennessee/17­008152­1/2017 Flag­ Seq TREE GENERATION CDS 4 HA MF357804 Yes 1726 H7N9 03/13/2017 *Chicken/Avian USA A/chicken/Tennessee/17­008152­1/2017 Flag­ 4 Quick Tree (Let IRD set all parameters ­ HA KF226105 Yesview all parameters1723 ) H7N9 04/20/2013 Human CDSChina A/Jiangsu/2/2013 Flag­ 4 HA KF226105 Yes 1723 H7N9 04/20/2013 Human China A/Jiangsu/2/2013 Flag­ Custom Tree ( for setting of custom parameters and for large NCR NCR datasets ) Your Selected Items: 4 items selected Your Selected Items: Remove 4 items selected Copy to Working Set Edit Working Set Using Tree Convert Working Set Run Analysis ▼ Download INPUT Displaying 50 records per page, sorted by Strain NameLeverage in ascending order. the auto-curation results for Remove 4 SEGMENTS SELECTED FOR TREE Copy to Working Set Edit Working Set Using Tree Convert WDisplay Settingsorking Set Run Analysis ▼ Download Sequence Type: NA sequence analysis in IRD Displaying 50 records per page, sorted by Strain Name in ascending order. For accuracy and efficiency, IRD recommends using our pre­computed Release Date: Oct 21, 2017• use edited version This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to law Display Settings enforcement ofalignments in analyses. Pre­alignments are available for records with Pficials. ASS • NCR-ext: trimmed off superfluous This project is funded by the flags, or with flags in the CTS or NCR onlyNational Institute of Allergy and Infectious Diseases. (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop Grumman Health IT, J. Craig Venter Institute, and Vecna Technologies. bases Your selected data includes 4 records. • NCR-del, NCR-ins, CTSRelease Date: Oct 21, 2017-del, CTS- • 1 have NCR/CTS flags only; edited versions of these records are available This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to lawins, CTS-mut: affected NCR for analyses (SOP). enforcement officials. • 2 records have errors in the CDS or other evidence of poor quality; no regions are replaced by blanks This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop Grumman Health ITedited versions of these records are available., J. Craig Venter Institute, and Vecna Technologies . • exclude flagged sequences:

How would you like to proceed? • Flag-NCR, Flag-CDS, Ambig-Seq. (Recommended) Use pre­alignments. Include edited versions of 1 • include flagged sequences sequences; exclude 2 sequences without edited versions (Not recommended) Include 3 sequences with concerns in the analysis • Not recommended and realign everything with MUSCLE https://www.fludb.org/brc/influenza_sequence_search_segment_display.spg 1/1 LABEL TREE TIPS (ENDS) WITH Strain Name Specify custom format of tip label (max 4) Strain Name Accession Number Date Country USA State Segment Protein Symbol Season SubType Host Species 2009 pH1N1­like https://www.fludb.or g/brc/influenza_sequence_search_segment_display Phenotype Markers .spg 1/1 US Swine H1 Clade Global Swine H1 Clade H5 Clade

Clear Build Tree

https://www.fludb.org/brc/tree.spg 1/2 Influenza Variant Protein Prediction 11/8/2017 Virus Pathogen Database and Analysis Resource (ViPR) - Flaviviridae - ${pageHeading}

Loading Virus Pathogen Database and Analysis Resource (ViPR)... Zika Virus About Us Community Announcements Links Resources Support

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH VIRUS FAMILIES HELP [email protected]

ViPR Home Zika Virus Home Gene/Protein Search Gene/Protein Search

Search for virus protein/gene sequences and related information. You can also find your strain or genome record if you have its information, such as strain name, accession. To compare Zika virus proteins/genes with those in other members of the Flavivirus genus or Flaviviridae Family, use the gene/protein search for the Flaviviridae family.

Results matching your criteria: 9,397 DATA TO RETURN COLLECTION GEOGRAPHIC HOST Genome YEAR GROUPING SELECTION Protein Mature Peptide PredictionStart: YYYY Choose a Geographic.. Choose a Host... Strain End: YYYY COUNTRY SELECT VIRUS(ES) TO INCLUDE IN SEARCH To add month to search, see Advance Choose a Country... Jump to subfamily, genus, species or strain in : Search Options: Start to type subfamily, genus, species or strain to get suggestions Month Range Deselect All

Species: Zika virus Select All (0/855 strains selected) (855 Strains ­ 502 complete genomes)

GenBank: incomplete mature peptideCOMPLETE GENOME annotations Complete Genome Only

GENE SYMBOL( SOP )

Tip: To select multiple or deselect, Ctrl­click (Windows) or Cmd­click (MacOS) ADVANCED OPTIONS Show All Clear Search

Release Date: Oct 21, 2017 This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to law enforcement officials. This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop Grumman Health IT, J. Craig Venter Institute, and Vecna Technologies. Virus images courtesy of CDC Public Health Image Library, Wellcome Images, U.S. Department of Veterans Affairs, Science of the Invisible and ViralZone, Swiss Institute of Bioinformatics.

https://www.viprbrc.org/brc/vipr_genome_search.spg?method=ShowCleanSearch&decorator=flavi_zika 1/1 11/7/2017 Influenza Research Database - A/Chicken/Agam/BBPVI/2005 - HA Hemagglutinin - EU124095 Segment Annotation*1 Protein Name Complete Sequence CDS Start CDS End Protein Length (aa) Source View Sequence and Design Primers HA Hemagglutinin Partial 1 1695 565 ­N/A­ Nucleotide

Protein: HA Hemagglutinin

Protein Information *2

Protein Name: HA Hemagglutinin Gene Symbol: ­N/A­ UniProtKB Accession: A7Y833 GenBank Protein Accession: ABU99029.1 GenBank Protein GI: 156989941 Protein Sequence: View Sequence CDS completeness: Partial Clathrin­ and caveolin­independent endocytosis of virus by host; Clathrin­ mediated endocytosis of virus by host; Disulfide bond; Fusion of virus membrane with host endosomal membrane; Fusion of virus membrane with host membrane; Keywords: Hemagglutinin; Host cell membrane; Host membrane; Host­virus interaction; Membrane; Transmembrane; Transmembrane helix; Viral attachment to host cell; Viral envelope Sequenceprotein; Viral penetration into host cytoplasm; Feature Annotations Virion; Virus endocytosis by host; Virus entry into host cell

Sequence Derived Phenotype Marker Protein Phenotypic Variant Type Name Substitution Present Comments Citation HA Influenza A_H5_species­ 110N Yes Introduction of Asp110Asn substitution in PubMed:19020946 adaptation_110(1)_110N_Increased­binding­to­alpha2­6 the A/chicken/Fujian/1042/05 backbone conferred increased binding to alpha 2­6 receptor as indicated by the hemadsorption assay with horse and guinea pig erythrocytes. HA Influenza A_H5_transmissibility_119(4)_119Y, 172A, 119Y, 172A, No Introduction of the His119Tyr, Thr172Ala, PubMed:22723413 238L, 240S_Increased­airborne­transmission 238L, 240S Gln238Leu, Gly240Ser naturally occurring substitutions in the A/Indonesia/5/2005 backbone conferred increased airborne transmission in ferrets using paired transmission cages. HA Influenza A_H5_species­ 137N No Introduction of Ser137Asn naturally PubMed:20427525 adaptation_137(1)_137N_Increased­binding­to­alpha2­6 occurring substitution in the A/Vietnam/1203/2004 backbone conferred increased binding to alpha 2­6 by measuring hemagglutination activities using enzymatically modified chicken Protein Sequence Features (SOP) Sequence Feature Variant Type Details Sequence Feature Category Count functional 80 epitope 371 structural 11 sequence alteration 8

Isoelectric Point/Molecular Weight (SOP) Isoelectric pt Molecular Weight Evidence Code 7.3 64030.3 RCA

HMM/Pfam Domains (SOP) Accession Name Description Start End PF00509 Hemagglutinin Hemagglutinin 19 565

Other Domains/Motifs (SOP) Domain/Motif Start End Program https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=EU124095&decorator=influenza&context=1510018012234 2/3 11/7/2017 Influenza Research Database - A/Chicken/Agam/BBPVI/2005 - HA Hemagglutinin - EU124095 Segment Annotation*1 Protein Name Complete Sequence CDS Start CDS End Protein Length (aa) Source View Sequence and Design Primers HA Hemagglutinin Partial 1 1695 565 ­N/A­ Nucleotide

Protein: HA Hemagglutinin

Protein Information *2

Protein Name: HA Hemagglutinin Gene Symbol: ­N/A­ UniProtKB Accession: A7Y833 GenBank Protein Accession: ABU99029.1 GenBank Protein GI: 156989941 Protein Sequence: View Sequence CDS completeness: Partial Clathrin­ and caveolin­independent endocytosis of virus by host; Clathrin­ mediated endocytosis of virus by host; Disulfide bond; Fusion of virus membrane with host endosomal membrane; Fusion of virus membrane with host membrane; Keywords: Hemagglutinin; Host cell membrane; Host membrane; Host­virus interaction; Membrane; Transmembrane; Transmembrane helix; Viral attachment to host cell; Viral envelope protein; Viral penetration into host cytoplasm; Virion; Virus endocytosis by host; Virus entry into host cell

Sequence Derived Phenotype Marker Protein Phenotypic Variant Type Name Substitution Present Comments Citation HA Influenza A_H5_species­ 110N Yes Introduction of Asp110Asn substitution in PubMed:19020946 adaptation_110(1)_110N_Increased­binding­to­alpha2­6 the A/chicken/Fujian/1042/05 backbone conferred increased binding to alpha 2­6 receptor as indicated by the hemadsorption assay with horse and guinea pig erythrocytes. HA Influenza A_H5_transmissibility_119(4)_119Y, 172A, 119Y, 172A, No Introduction of the His119Tyr, Thr172Ala, PubMed:22723413 238L, 240S_Increased­airborne­transmission 238L, 240S Gln238Leu, Gly240Ser naturally occurring substitutions in the A/Indonesia/5/2005 backbone conferred increased airborne transmission in ferrets using paired transmission cages. HA Influenza A_H5_species­ 137N No Introduction of Ser137Asn naturally PubMed:20427525 adaptation_137(1)_137N_Increased­binding­to­alpha2­6 occurring substitution in the A/Vietnam/1203/2004 backbone conferred increased binding to alpha 2­6 by measuring hemagglutination activities using enzymatically modified chicken Protein Sequence Features (SOP) Sequence Feature Variant Type Details Sequence Feature Category Count functional Domain &80 Epitope Annotations epitope 371 structural 11 sequence alteration 8

Isoelectric Point/Molecular Weight (SOP) Isoelectric pt Molecular Weight Evidence Code 7.3 64030.3 RCA

HMM/Pfam Domains (SOP) Accession Name Description Start End PF00509 Hemagglutinin Hemagglutinin 19 565

11/7/2017Other Domains/Motifs (SOP) Influenza Research Database - A/Chicken/Agam/BBPVI/2005 - HA Hemagglutinin - EU124095 Domain/Motif Start End ProgramProgram transmembrane 533 555 tmhmm https://wwwcoiled_coil.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=EU124095&decorator=influenza&context=1510018012234402 430 ncoils 2/3

Predicted Epitopes (SOP) Prediction Details MHC Supertype # Predictions A3 50 A2 32 A24 70 B7 20 B44 32 Total 204

Gene Ontology Classification

Name GO ID Annotation Source Evidence Similar Sequences Biological Process viral envelope fusion with host membrane GO:0019064 UniProtKB IEA virion attachment to host cell surface receptor GO:0019062 UniProtKB ­ N/A ­ Cellular Component host cell plasma membrane GO:0020002 UniProtKB ­ N/A ­ integral to membrane GO:0016021 UniProtKB IEA viral envelope GO:0019031 UniProtKB IEA

Database Cross References*2 Database Name Accession Description INTERPRO IPR000149 Hemagglutn_1 INTERPRO IPR001364 Hemagglutn INTERPRO IPR008980 Capsid_hemag INTERPRO IPR013827 Haemagglutn_HA1_b­ribbon INTERPRO IPR013828 Haemagglutn_HA1_a/b INTERPRO IPR013829 Haemagglutn_stalk PFAM PF00509 Hemagglutinin PRINTS PR00329 HEMAGGLUTN12 PRINTS PR00330 HEMAGGLUTN1

References*1,*2 PubMed Annotation Journal Name Title Author Year ID Source ­N/A­ Submitted (27­AUG­2007) WHO Collaborating Centre for Reference and Research on Influenza Direct Komadina,N., Usman,T.B. ­ GenBank Centre, 45 Poplar Rd., Parkville, Vic 3052, Australian Submission and Selleck,P. et al. N/A­

Data Sources ID Source 1 NCBI 2 UniProtKB 3 IEDB

Release Date: Oct 21, 2017 This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to law enforcement officials. This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop Grumman Health IT, J. Craig Venter Institute, and Vecna Technologies.

https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=EU124095&decorator=influenza&context=1510018012234 3/3 Data Options View Annotations in GBrowse

• DNA viruses

• Annotation types

¡ genes, Pfam domains, epitopes, SNPs, expression data, custom data

• View various types of annotations along the genome View Annotations in GBrowse Virus Classification Tools Hepatitis C Virus Typing Tool

• Background

¡ Subtype assignment is critical

¡ Subtype classification by ICTV – Reference sequences: many singletons – Reference alignment – As of June 2017, 7 genotypes, 86 subtypes

¡ Other HCV resources are not in business

¡ Many partials

• Goals

¡ An automated subtype annotation tool

¡ Comprehensive annotations in ViPR Hepatitis C Virus Typing Tool

Placing query onto a “fixed” hierarchically annotated reference tree Q is of unknown type Q is A-type Q is of A-type (bracketed by A (bracketed by A and A) (bracketed by A.1 and B) and A.2) [in reference to Q, A is called “down-tree”, B is called “up-tree”]

Naïvely, it looks like Q might be of A-type, but we do not X A Q A X A.1 Q A.2 X A Q B know at which point along the branch going from AB- ancestor to A, the type changes from AB-ancestor- Figure 1. cladinator analysis of query placements in hierarchically annotated artificial type to A-type trees. (A) Query is A-type (bracketed by A and A). (B) Query is A-type (bracketed by A.1 and A.2). (C) Query is of unknown type (bracketed by A and B). In reference to Q, A is called “down-tree”, while B is called “up-tree.” Naïvely, it looks like Q might be of A- type, but we do not know at which point along the branch going from AB-ancestor to A, the type changes from AB-ancestor-type to A-type. Therefore, Q is of unknown type. 12/1/2017 Virus Pathogen Database and Analysis Resource (ViPR) - Flaviviridae - ${pageHeading}

Loading Virus Pathogen Database and Analysis Resource (ViPR)... Hepatitis C virus About Us Community Announcements Links Resources Support

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA VIRUS FAMILIES HELP [email protected]

ViPR Home Hepatitis C virus Home Genome Search Genome Search

Search for virus genomic sequences and related information. You can search for the whole virus family or search for specified genus, species etc. You can also find your strain or genome record if you have its information, such as strain name, accession.

Genome searches for Dengue virus or Hepatitis C virus can be augmented with clinical metadata criteria. Selecting the appropriate nodes in the taxonomy browser (Flavivirus, Dengue virus, Hepacivirus, Hepatitis C virus) will add metadata search panels and enable you to include these criteria. Some sequences have more metadata fields defined than others. Queries based on metadata only retrieve sequences for which those fields are defined.

Results matching your criteria: 221,174 12/1/2017 Virus Pathogen Database and Analysis Resource (ViPR) - Flaviviridae - Genotype Recombination Result DATA TO RETURN COLLECTION GEOGRAPHIC HOST Loading Virus Pathogen Database and Analysis Resource (ViPR)... Genome Hepatitis C VirusYEAR Hep aTypingtitis C GROUPINGvirus ToolSELECTION About Us Community Announcements Links Resources Support Protein Start: YYYY Choose a Geographic.. Choose a Host... Strain SEARCH DA TA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA VIRUS FAMILIES HELP [email protected] ViPR Home Hepatitis C virus Home Genome Search Genotype determination and Recombination detection Results End: YYYY HCV Genotyping/SubtypingCOUNTR ReportY (Beta) (SOP) SELECT VIRUS(ES) TO INCLUDE IN SEARCH To add month to search, see Advance Save Analysis Jump to subfamily, genus, species or strain in taxonomy: Choose a Country... Search Options:Your analysis contains 2 records

Start to type subfamily, genus, species or strain to get suggestions Month Range Phylogenetic Query Identifier Consensus Assignment Support Report Deselect All Tree KU871281 5a 1.000 Input alignment (FASTA) Output tree (Newick) 12/1/2017 Subtype: 1a Virus Pathogen DatabaseSelect All and Analysis Resource (ViPR) - Flaviviridae - Hepacivirus Hepatitis C virus Strain NED5-1 Subtype assignment (text)

(0/28671 strains selected) (28671 Strains ­ 588 complete genomes) DQ164588 5a 0.979 Input alignment (FASTA) Output tree (Newick) Subtype assignment (text) Subtype: 1b Select All (0/28236 strains selected) (28236 Strains ­ 830 complete genomes)Loading Virus Pathogen Database and Analysis Resource (ViPR)... Subtype: 1c Select All Release Date: Nov 30, 2017 Hepatitis C virus This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to law enforcement officials. (0/85 strains selected) (85 Strains ­ 19 complete genomes) This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop About Us Community AnnouncementsGrumman Health IT, J. Craig VLinksenter InstituteResources, and Vecna TechnologiesSupport. Virus images courtesy of CDC Public Health Image Library, Wellcome Images, U.S. Department of Veterans Subtype: 1e Select All Affairs, Science of the Invisible and ViralZone, Swiss Institute of Bioinformatics. SEARCH DA(0/31 strains selected)TA ANALYZE & VISUALIZE(31 Strains ­ 2 complete genomes)WORKBENCH SUBMIT DATA VIRUS FAMILIES HELP [email protected]

ViPR Home Hepatitis C virus HomeSubtype: 1g Genome Search Strain Details (NED5­1)Select All (0/29 strains selected) (29 Strains 2 complete genomes) Strain Details for Hepatitis C virus Strain NED5-1-5a comprehensiveCOMPLETE GENOME subtype annotations in ViPR Complete Genome Only Send Comments to Curator Specify genome length range toStrain Information Min: include (full length 8500­15000) Max: Strain Name: NED5­1 Organism: Hepatitis C virus HOST ATaxonomy:TTRIBUTES Flaviviridae ­> Hepacivirus ­> Hepatitis C virus ­> Type 5a SAMPLE ATTRIBUTES Subtype/Genotype (ViPR) (SOP): 5a Subtype/Genotype (Genbank): 5 Host Gender Sample Source GenBank Host: Homo sapiens All Host: Human Male Isolation Country: Netherlands Female Collection Date: 2014­01­01

https://www.bacpathbrc.org/brc/genotypeRecombination.spg?decorator=flavi_hcv&method=RetrieveResults&ticketNumber=GR_325740562552# 1/1 VIRUS AGenome: KU871281TTRIBUTES

GenBank Definition: Hepatitis C virus isolate NED5­1 polyprotein gene, partial cds. Subtype Bull,R.A., Eltahla,A.A., Rodrigo,C., Koekkoek,S.M., WInfection Type alker,M., Pirozyan,M.R., Betz­Stablein,B., Toepfer,A., Laird,M., Oh,S., Authors: Heiner,C., Maher,L., Schinkel,J., Lloyd,A.R. and Luciani,F. Bull,R.A. and Schinkel,J. GenBank Sequence Accession: KU871281 Sequence Length: 9244 Sequence Status: Not Complete Sequence: View Nucleotide Sequence and design PCR primers Tip: To select multiple or deselect, Ctrl­click (Windows) or Cmd­click (MacOS) Number of Proteins: 11 https://wwwOrganism Name:.bacpathbrc.org/brc/vipr_genome_search.spg?method=ShowCleanSearch&decorator=flavi_hcvHepatitis C virus 1/2 GenBank Note: genotype: 5 Mol Type: genomic RNA GenBank Host: Homo sapiens Host: Human Isolation Country: Netherlands Collection Date: 2014­01­01

Genome Image Map Hide Show

Protein Information (SOP) Gene Symbol Protein Product Name ViPR Locus ID CDS Start CDS End NCBI Gene ID Locus Name https://www.bacpathbrc.org/brc/viprStrainDetails.spg?ncbiAccession=KU871281&decorator=flavi_hcv&context=1512154213167 1/2 Influenza Virus H1 Clade Classification

• Good sequence quality in general • Community proposed classification scheme • promoting the new classification scheme - collaboration between IRD and FAO/OFFLU 4/18/2017 Influenza Research Database - A/Minnesota/14/2012 - HA Hemagglutinin - KJ620415

Loading Influenza Research Database...About Us Community Announcements Links Resources Support Workbench Sign In

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP

Home Swine H1... Results Strain Details (A/Minnesota/14/2012) Segment Details (A/Minnesota/14/2012 Seg. 4) Influenza Segment/Protein Details

Download Generate PDF Add to Working Set Identify Similar Sequences (BLAST) Send Comments to Curator

Segment: KJ620415 Strain Information | Segment Information | SNP Details | Annotation | References | Data Sources | Influenza Virus H1 CladeProtein: HA Hemagglutinin Classification Protein Information | Isoelectric Pt/Molecular Weight | Pfam Domain | Pfam Motifs | Predicted Epitopes | Gene Ontologies | Database Cross References | Our curation pipeline has not identified potential sequence artifacts in KJ620415 by alignment to A/4/H1 profile.

*1 Comprehensive H1 Strain Information

clade annotations Complete Genome Set: Yes Host: Human only in IRD Organism Name: Influenza A Virus Collection Date: 12/2012 A/Minnesota/14/2012 Flu Season (SOP): 12­13 Strain Name: View Strain Details Isolation Country: USA Subtype: H1N2 GenBank Submission 04/19/2014 Global Swine H1 Clade 1B.2.2.2 Date: (SOP): NCBI Taxon ID: 1479037 US Swine H1 Clade(SOP): delta1 GenBank Accession Numbers KJ620412­ 2009 Pandemic H1N1­like Mixed Positive and Negative Segments KJ620419 represent sequences from the 8 (SOP) ?: segments of Influenza A virus (A/Minnesota/14/2012(H1N2)). GenBank Header Notes: ##Assembly­Data­START## Assembly Method :: Lasergene DNA Star v. 10 4/18/2017 Influenza Research Database - Swine H1 Clade Classification An automated H1 clade classification tool H1 clade based analysis and visualization Sequencing Technology :: ABI 3730xl DNA Analyzers ##Assembly­Data­END## Loading Influenza Research Database...About Us Community Announcements Links Resources Support Workbench Sign In

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP Home Swine H1 Clade Classification Segment Information*1 Swine H1 Clade Classification Tool

The swine H1 clade classification tool classifies the clade of the HA/H1 viruses from any host and for any NA subtype, with reference to the global swine H1 clade scheme (Anderson et al., 2016 ) the US swine H1 classification scheme. This tool infers the clade for a query sequence from its position within the reference tree. It is a collaboration between Catherine Macken at IRD and Tavis Anderson and others at the USDA. Genbank Source Sequence KJ620415

SOP for global swine H1 clade classification Accession: Global Swine H1 Clade Classification Reference Tree Global Swine H1 Clade Classification Reference Sequences Influenza A virus (A/Minnesota/14/2012(H1N2)) segment 4 Definition: SOP for US swine H1 clade classification hemagglutinin (HA) gene, complete cds. US Swine H1 Clade Classification Reference Tree Description of US clades with name that include "­like" Authors: Duff,M.A., Ma,J., Ma,J., Hesse,R.A., Sloan,H. and Ma,W. Sequences from other serotypes of HA, or other segments will yield unpredictable and likely incorrect results. If unsure of your sequence's segment or serotype, we suggest you use the IRD Sequence Annotation Tool (Analyze & Visualize > Annotate Nucleotide Sequences). Segment Number: 4 Segment Length: 1698 ANALYSIS NAME Complete Coding Sequence: Complete INPUT SEQUENCES 2009 Pandemic H1N1­like (SOP) ? No Analyze my custom sequences only. Upload a file containing my sequences in FASTA format. Paste sequences in FASTA format. Sequence: View Sequence and design PCR primers

Analyze my custom sequences and associated metadata with IRD sequences.

Clear Run SNP (SOP)

Release Date: Mar 16, 2017 SNP Details This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to law enforcement officials. This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop # insertions compared # deletions compared # mismatches compared Grumman Health IT, J. Craig Venter Institute, and Vecna Technologies. Consensus with consensus with consensus with consensus Host: Human; Subtype: H1N2; Segment: 4 0 0 68

Segment Annotation*1 Protein Name Complete Sequence CDS Start CDS End Protein Length (aa) Source View Sequence and Design Primers HA Hemagglutinin Complete 1 1698 565 GenBank Nucleotide

https://www.fludb.org/brc/h1CladeClassifier.spg?method=ShowCleanInputPage&decorator=influenza 1/1 https://www.fludb.org/brc/fluSegmentDetails.spg?ncbiGenomicAccession=KJ620415&decorator=influenza 1/3 Influenza Virus H5 Clade Classification

World Health Organization/World Organisation for Animal Health/Food and Agriculture Organization (WHO/OIE/FAO) H5N1 Evolution Working Group. Influenza Other Respir Viruses. 2014 May;8(3):384-8. Influenza11/6/2017 Virus H5 CladeInfluenza Research Database -Classification Nucleotide Sequence Search Results Loading Influenza Research Database...About Us Community Announcements Links Resources Support Workbench Sign In

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP

Home Nucleotide Sequence Search Results Nucleotide Sequence Search Results

IRD uses an automated pipeline to detect potential sequence artifacts or poor quality by aligning Influenza virus nucleotide sequences submitted to the resource to a curated profile of like sequences. The pipeline sets flags indicating the category and location of artifacts, or the type of poor quality sequence. These flags are summarized in sequence search results and working sets as follows: Ambig­Seq (excessive Ns or ambiguity symbols, or insufficient similarity to the profile); Flag­NCR (issues only in non­coding regions); Flag­CDS (issues in CDS, possibly also in NCR); Pass (no issues).See SOP for further details.

Your Selected Items: 0 items selected

Show Strain Display Show Protein Display Add to Working Set Save Search Run Analysis ▼ Download

Your search returned 450 segments. Search Criteria Displaying 50 records per page, sorted by Strain Name in Display Settings ascending order.

Select all 450 segments 1 2 3 4 5 6 7 Next > Page: 1 of 9

Curation Protein Sequence Collection H5 Segment Subtype * Host Species Country Strain Name Report Name Accession Date Clade (SOP) 4 HA KR987706 H5N1 10/20/2006 *Dog Indonesia A/canine/Buleleng/21/2006 Pass 2.1.3.2

4 HA KR987703 H5N1 09/15/2006 *Dog Indonesia A/canine/Jembrana/18/2006 Pass 2.1.3

4 HA EU124095 H5N1 2005 Chicken/Avian Indonesia A/Chicken/Agam/BBPVI/2005 Pass 2.1.2

4 HA CY091811 H5N1 05/02/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­175/2007 Pass 2.1.3.2

4 HA CY091812 H5N1 05/15/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­205/2007 Pass 2.1.3.2

4 HA CY091813 H5N1 05/22/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­216/2007 Pass 2.1.3.2

4 HA CY091814 H5N1 05/23/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­219/2007 Pass 2.1.3.2

4 HA CY091815 H5N1 06/11/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­277/2007 Pass 2.1.3

4 HA CY091816 H5N1 06/13/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­288/2007 Pass 2.1.3

4 HA CY091818 H5N1 06/25/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­319ac/2007 Pass 2.1.3.2

4 HA CY091819 H5N1 07/03/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­328/2007 Pass 2.1.3

4 HA CY091820 H5N1 07/11/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­342/2007 Pass 2.1.3

4 HA CY091822 H5N1 09/03/2007 Chicken/Avian Indonesia A/chicken/Badung/BBVD­532/2007 Pass 2.1.3.2

4 HA KR078246 H5N1 2009 Chicken/Avian Indonesia A/chicken/Bali/M26/2009 Pass 2.1.3.2

4 HA KM892798 H5N1 06/01/2009 *Chicken/Avian Indonesia A/chicken/Bali/U8661/2009 Pass 2.1.3.2

4 HA GQ122391 H5N1 2005 Chicken/Avian Indonesia *A/chicken/Bali/UT2091/2005(H5N1) Pass 2.1.3.1

4 HA GQ122392 H5N1 2005 Chicken/Avian Indonesia A/chicken/Bali/UT2092/2005 Pass 2.1.3.1

4 HA CY091797 H5N1 05/31/2007 Chicken/Avian Indonesia A/chicken/Bangli/BBVD­245/2007 Pass 2.1.3

4 HA CY091796 H5N1 07/11/2007 Chicken/Avian Indonesia A/chicken/Bangli/BBVD­343/2007 Pass 2.1.3.2

4 HA CY091799 H5N1 07/26/2007 Chicken/Avian Indonesia A/chicken/Bangli/BBVD­387ab/2007 Pass 2.1.3.2

4 HA CY091800 H5N1 09/13/2007 Chicken/Avian Indonesia A/chicken/Bangli/BBVD­555ab/2007 Pass 2.1.3.2

4 HA CY091801 H5N1 09/18/2007 Chicken/Avian Indonesia A/chicken/Bangli/BBVD­562/2007 Pass 2.1.3

4 HA CY091802 H5N1 09/18/2007 Chicken/Avian Indonesia A/chicken/Bangli/BBVD­563/2007 Flag­ 2.1.3.2 NCR 4 HA CY091803 H5N1 09/25/2007 Chicken/Avian Indonesia A/chicken/Bangli/BBVD­575/2007 Pass 2.1.3

4 HA GU183450 H5N1 10/2004 Chicken/Avian Indonesia A/chicken/Banten/Pdgl­Kas/2004 Pass 2.1.3.1

4 HA GU183461 H5N1 01/2008 Chicken/Avian Indonesia A/chicken/Banten/Srg­Fadh/2008 Pass 2.1.3.2

4 HA CY091876 H5N1 07/12/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­446­24452/2007 Pass 2.1.3.2

4 HA CY091877 H5N1 07/12/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­446­24453/2007 Pass 2.1.3.2

4 HA CY091874 H5N1 07/06/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­446­24454/2007 Pass 2.1.3.2

4 HA CY091878 H5N1 07/12/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­446­24456/2007 Pass 2.1.3.2

4 HA CY091887 H5N1 07/18/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­482­22233/2007 Pass 2.1.3.2

4 HA CY091886 H5N1 07/10/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­482­22234/2007 Pass 2.1.3.2

4 HA CY091888 H5N1 07/18/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­482­22235/2007 Pass 2.1.3.2

4 HA CY091904 H5N1 08/28/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­627­23296/2007 Pass 2.1.3.2

4 HA CY091919 H5N1 09/21/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­678­441/2007 Pass 2.1.3.2

4 HA CY091920 H5N1 09/21/2007 Chicken/Avian Indonesia A/chicken/Bantul/BBVW­678­443/2007 Pass 2.1.3.2

4 HA CY091789 H5N1 09/05/2007 Chicken/Avian Indonesia A/chicken/Buleleng/BBVD­545b/2007 Pass 2.1.3

4 HA KJ842552 H5N1 2009 *Chicken/Avian Indonesia A/chicken/Central Java/51/2009 Pass 2.1.3.2

4 HA KJ842553 H5N1 2009 *Chicken/Avian Indonesia A/chicken/Central Java/59/2009 Pass 2.1.3.2

4 HA KR078244 H5N1 2009 Chicken/Avian Indonesia A/chicken/Central Java/M18/2009 Pass 2.1.3.2

4 HA KR078247 H5N1 2010 Chicken/Avian Indonesia A/chicken/Central Java/M31/2010 Pass 2.1.3.2

4 HA GQ122395 H5N1 2005 Chicken/Avian Indonesia *A/chicken/Central Java/UT3091/2005(H5N1) Pass 2.1.3

4 HA KR078219 H5N1 2010 Chicken/Avian Indonesia A/chicken/Central Sulawesi/M32/2010 Pass 2.1.3.1

4 HA AB685453 H5N1 08/2010 *Chicken/Avian Indonesia A/chicken/CentralJava/UT561/2010 Pass 2.1.3.2

4 HA EU167546 H5N1 2006 Chicken/Avian Indonesia A/chicken/Cicurug/IPB24­RS/2006 Flag­ 2.1.3.2 CDS 4 HA DQ497667 H5N1 2005 Chicken/Avian Indonesia *A/chicken/Dairi/BPPVI/2005(H5N1) Pass 2.1.2

4 HA EU124091 H5N1 09/23/2005 Chicken/Avian Indonesia A/Chicken/Deli Derdang/BBPVI/2005 Pass 2.1.2

4 HA EU124108 H5N1 2005 Chicken/Avian Indonesia A/Chicken/Deli Serdang/BPPV1/2005 Pass 2.1.2

4 HA DQ497668 H5N1 2005 Chicken/Avian Indonesia *A/chicken/Deli Serdang/BPPVI/2005(H5N1) Pass 2.1.2

4 HA CY091774 H5N1 03/27/2007 Chicken/Avian Indonesia A/chicken/Denpasar/BBVD­145/2007 Pass 2.1.3.2

1 2 3 4 5 6 7 Next > Page: 1 of 9

Your Selected Items: 0 items selected

Show Strain Display Show Protein Display Add to Working Set Save Search Run Analysis ▼ Download

Top * Please mouse­over the asterisk to see additional information

Release Date: Oct 21, 2017 This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to law enforcement officials. This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between Northrop Grumman Health IT, J. Craig Venter Institute, and Vecna Technologies.

https://www.fludb.org/brc/influenza_sequence_search_segment_display.spg 1/1 Rotavirus A Annotation

Missing metadata Incomplete, inconsistent segment annotations No strain level linkage Community genotyping tool is slow and unstable Rotavirus A Annotation Annotating a Genome with PATRIC Marcus Nguyen } } Francisella } Bartonella } Helicobacter } } Borrelia } Mycobacterium } } Rickettsia } Burkholderia } Salmonella } Campylobacter } Shigella } } Chlamydophila Staphylococcus } Streptococcus } } Vibrio } Coxiella } Yersinia } Ehrlichia PATRIC} Escherichia has ALL Bacterial Genomes, not just pathogens } Comprehensive Data Collection ◦ Unified Database, including RefSeq, GenBank, other sources } Uniform Annotation Across all Genomes ◦ RAST annotation, EC, GO, plus RefSeq annotations ◦ Uniform projection of Protein Families, AMR related genes and Virulence factors } User Workspace for analysis of User data ◦ “Virtual Integration” your data in the context of all the public datasets } Calling rRNAs (16S, 23S, 5S) } Calling tRNAs with tRNAscanSE ◦ (Lowe & Eddy 1997) } Searching for repeat regions } Finding special proteins ◦ Selenoproteins ◦ Pyrrolysylproteins } Calling CRISPRs ◦ clustered regularly interspaced short palindromic repeats } Calling protein-encoding genes

◦ Prodigal (Hyatt et al. 2010)

◦ Glimmer3 (Delcher et al. 2007) } Assigning function ◦ First attempt: annotates against CoreSEED ◦ Second attempt: annotates against FIGFams ◦ Third attempt: BLAST against close relatives } Overlapping genes are resolved } Annotates matches to: ◦ ARDB (Liu & Pop 2009) ◦ CARD (McArthur et al. 2013) ◦ VFDB (Chen et al. 2012) ◦ Victors (Xiang et al. 2007) ◦ PATRIC virulence factors (Mao et al. 2014) ◦ DrugBank (Law et al. 2014) ◦ TTD (Qin et al. 2014) ◦ Human homologs } Assigns proteins to families } Finds closest neighbors