Annotation: Curation, Tools, Ontologies, Databases Genomics
Total Page:16
File Type:pdf, Size:1020Kb
Annotation: curation, tools, ontologies, databases Mike Cherry Genomics Lecture 7 Genetics 211 - Winter 2014 Database of the Past Staats, Joan. The Classified Bibliography of Inbred Strains of Mice. Science 119(3087): 295-296 (1954-02-26). !2 Staats, Joan. The Classified Bibliography of Inbred Strains of Mice. Science 119(3087): 295-296 (1954-02-26) What’s this guy talking about? "So you want to make a nnn, there are a couple of steps. You acquire data through partners. You do a bunch of engineering on that data to get it into the right format and conflate it with other sources of data, and then you do a bunch of operations, which is what this tool is about, to hand massage the data. And out the other end pops something that is higher quality than the sum of its parts." ! Michael Weiss-Malik, The Atlantic 9/2012 !3 Why curate publications? • Standardize vocabulary – fulltext is easy to use, but difficult to know the search was complete • Integrate results – why don’t publishers mandate submission of standardized data? • for decades crystallographic data & coordinates, and GenBank accession numbers have been required • GEO & SRA accessions should be but, are not enforced. !5 Manual Curation • Read published literature or use tools for analysis of results to make the best annotation • Identify the experimental methods used • Connect associated IDs from ontologies/ vocabularies, sequences their IDs, and connections to other databases (pathway, chemical, orthology, interactions, disease, expression, ... etc.) !6 Examples of Curated Databases • Protein Database, UniProtKB • Genome, model organism database • Chemical, CHEBI or PubChem • Human Genetic Disease, OMIM • Gene Function, Gene Ontology Consortium • Gene Models, GenCODE & UCSC • Sequence Variants, LSVD & HGMD • Personal Genomics, Ingenuity & OMICIA How would you curate this paper? Drosophila Hedgehog The gene hedgehog is referred to in FlyBase by the symbol Dmel\hh (CG4637, FBgn0004644). It is a protein_coding_gene from Drosophila melanogaster. Its sequence location is 3R:18953425..18967881. It has the cytological map location 94E1. There is experimental evidence that it has the molecular function: cell surface binding; protein binding. There is experimental evidence for 24 unique biological process terms, many of which group under: anatomical structure development; cellular component movement; organ morphogenesis; regulation of biological process; sensory organ development; reproductive process in a multicellular organism; organ development; regulation of protein transport; cell projection organization; open tracheal system development; central nervous system development. 169 alleles are reported. The phenotypes of these alleles are annotated with: organ system; adult segment; adult mesothoracic segment; peripheral nervous system; primordium; nervous system; embryonic nervous system; late extended germ band embryo; ventral nerve cord primordium; abdominal ventral denticle belt. It has 2 annotated transcripts and 2 annotated polypeptides. !10 Saccharomyces SGS1 ! SGS1 encodes a helicase with similarity to E. coli RecQ and human BLM and WRN helicases (6). Mutations in BLM are implicated in the cancer- prone Bloom's Syndrome, and mutations in WRN cause the premature- aging Werner's Syndrome (6). SGS1 was identified in a screen for suppressors of the slow growth phenotype of top3 mutants, and Sgs1p has been shown to interact with the topoisomerase Top3 p (7). Sgs1p appears to be involved in the maintenance of genome stability and the suppression of illegitimate recombination; sgs1 null mutants show mitotic hyperrecombination and elevated levels of chromosome missegregation (8, 9, 10). Sgs1p has been localized to the nucleolus, and is needed to maintain the integrity of rDNA repeats (2). Sgs1p shows ATPase activity and unwinds duplex DNA; it preferentially binds to branched DNA substrates and has a 3' to 5' polarity of unwinding (11, 12). !11 www.pharmgkb.org ! Knowledge Base of Pharmacogenetic and Pharmacogenomic Information !12 Imaging expression patterns during Drosophila embryogenesis. Tomancak et al. Genome Biology 2002 3:research0088.1 !13 The Gene Wiki article about Cyclin-dependent kinase Good B M et al. Nucl. Acids Res. 2011;nar.gkr925 Trust distributions of Gene Wiki revisions versus general (non-Gene Wiki) Wikipedia revisions. Good B M et al. Nucl. Acids Res. 2011;nar.gkr925 Directed Acyclic Graph How can ontologies help organizing information? • Describe material entities of nature • Represent events, actions, procedures & relationships as immaterial entities • Specific connections between entities • Standardize nomenclature • Enhance computability of information • Provide a framework to communicate information An ontology is a set of terms… mitochondrion cell chromosome nucleus mitochondrial! chromosome !18 An ontology is a set of terms… … with different types of relationships to each . cell Parent Term part_of part_of part_of mitochondrion chromosome nucleus Child Term part_of is_a mitochondrial! chromosome An ontology is a set of terms… … with different types of relationships to each other." All relationships must be true. cell Parent Term part_of part_of part_of mitochondrion chromosome nucleus Child Term part_of part_of is_a mitochondrial! chromosome An ontology is a set of terms… … with different types of relationships to each other." All relationships must be true" because inferences can be made based on these relationships cell Parent Term part_of part_of part_of mitochondrion chromosome nucleus part_of Child Term part_of part_of is_a part_of mitochondrial! chromosome www.geneontology.org/GO.ontology.relations.shtml !22 Example Gene Ontology Term id: GO:0000016! name: lactase activity! namespace: molecular_function! def: "Catalysis of the reaction: lactose + H2O = ! !!!!!D-glucose + D-galactose." [EC:3.2.1.108]! synonym: "lactase-phlorizin hydrolase activity" BROAD ! !!!!![EC:3.2.1.108]! synonym: "lactose galactohydrolase activity" EXACT [EC:3.2.1.108]! xref: EC:3.2.1.108! xref: MetaCyc:LACTASE-RXN! xref: Reactome:20536! is_a: GO:0004553 ! hydrolase activity, hydrolyzing O-glycosyl! !!!!!compounds !23 Subset of Evidence Code Ontology [Term]! id: ECO:00000067! name: inferred from electronic annotation! def: "Used for annotations that depend directly on computation or automated transfer of annotations from a database, particularly when the analysis is performed internally and not published. A key feature that distinguishes this evidence code from others is that it is not made by a curator; use IEA when no curator has checked the specific annotation to verify its accuracy. The actual method used (BLAST search, Swiss-Prot keyword mapping, etc.) doesn't matter." [GO:IEA]! synonym: "IEA" RELATED []! is_a: ECO:0000043 ! inferred from in-silico analysis! ! [Term]! id: ECO:0000007! name: inferred from immunofluorescence! def: "Used when an annotation is made based on methods that detect the presence of macromolecules, proteins, and compounds by the use of a fluorescent-labeled antibody." [TAIR:TED]! synonym: "IDA: immunofluorescence" RELATED []! is_a: ECO:0000040 ! inferred from immunological assay! ! !24 Evidence Codes for Gene Product Annotations Direct assay (IDA). Enzyme assays, In vitro reconstitution (e.g. transcription), Immunofluorescence (for cellular component) Expression pattern (IEP). Transcript levels (e.g. Northerns, microarray data) Genetic interaction (IGI). "Traditional" genetic interactions such as suppressors, synthetic lethals, etc. Mutant phenotype (IMP). Any gene mutation/knockout Physical interaction (IPI). 2-hybrid interactions; Co-purification Sequence or structural similarity (ISS). Sequence similarity (homologue of/most closely related to); Recognized domains Sequence Alignment (ISA). Pairwise or multiple sequence alignment to experimentally characterized proteins. Sequence Orthology (ISO). Orthologous genes share a common ancestor and have arisen due to a speciation event. Phylogenetic analysis with maximum likelihood or nearest neighbor joining. Sequence Model (ISM). Statistical modeling tool determined protein membership in a functional family. HHM, tRNAscan and TMHMM are examples of this type of algorithm. Genomic context (IGC). Annotations based on synteny, in these cases sequences similarity is not enough. For example, presence within an operon in bacterial systems. !25 30000" Number"of"Annotated"Genes"per"Organism"by"Evidence"Type" December"2013"(using"genename"as"unique"ID)" 25000" 20000" 15000" 10000" 5000" 0" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" P" F" C" EcoCyc" PomBase" SGD" dictyBase" WB" FB" Chicken" Cow" ZFIN" Pig" Dog" RGD" MGI" TAIR" Human" Biological process GO enrichment of 88 human peroxisome proteins before the focused manual peroxisome protein annotation effort. Mutowo-Meullenet P et al. Database 2013;2013:bas062 © The Author(s) 2013. Published by Oxford University Press. Biological process GO enrichment of 88 human peroxisome proteins after the focused manual peroxisome protein annotation effort. Mutowo-Meullenet P et al. Database 2013;2013:bas062 © The Author(s) 2013. Published by Oxford University Press. Annotations are unconnected posi-ve(regula-on(of( protein( transcrip-on(from(pol(II( localiza-on(to( pap1% promoter(in(response(to( sty1% nucleus[GO: oxida-ve(stress[GO: 0034504]( 0036091]( cellular(response( to(oxida-ve(stress( [GO:0034599]( DB# Object# Term# Ev# Ref# ..# PomBase( sty1( GO:0034504( IMP( PMID:9585505( ..( ..( ..(