– p. 1 Educational Materials ©2004–2006 R. Gentleman

Metadata in Bioconductor – p. 2

Overview Integrative biology - fuse informationresearch from databases biological of many different types. In this lecture we focushelp attention to on make resources use that of can metadata in different analyses. – p. 3 package provides a set of tools that allow

Alternative Strategies biomaRt you to access online databasesVega, such Uniprot, as MSD, Wormbase. Ensembl, The We have begun development ofmetadata a packages new that set make of use(SQLite) of rather a than database R’s environments.in This smaller will size, result slightly slowergeneral access, queries. but much more Make use of the chipAffymetrix manufacturer’s this resources is (for the NetAffx site). – p. 4 on a

Per chip annotation [1] "hgu133a"[3] "hgu133aCHR"[5] "hgu133aCHRLOC"[7] "hgu133aENZYME2PROBE" "hgu133aGENENAME" [9] "hgu133aGO" "hgu133aENZYME" "hgu133aCHRLENGTHS" "hgu133aACCNUM" "hgu133aGO2ALLPROBES" per chip-type basis. > library("hgu133a") > ls("package:hgu133a") [11] "hgu133aGO2PROBE"[13] "hgu133aMAP"[15] "hgu133aOMIM" "hgu133aLOCUSID" [17] "hgu133aPATH"[19] "hgu133aPFAM"[21] "hgu133aPMID2PROBE" "hgu133aMAPCOUNTS" [23] "hgu133aORGANISM" "hgu133aQC" "hgu133aPROSITE" [25] "hgu133aPATH2PROBE" "hgu133aREFSEQ"[27] "hgu133aPMID" "hgu133aSYMBOL" "hgu133aSUMFUNC" "hgu133aQCDATA" "hgu133aUNIGENE" An early design decision was that we should provide metadata – p. 5 , which are environments ? hgu133a . This reports how many of each of the

A brief description used as hash tables. These packages contain R For each package, data provenanceprovided, information e.g. is Quality control information is available,hgu133a() e.g. different types of mappings were found. – p. 6 setting . [[ and $ , mget , get ` `

Accessing annotation packages 201473_at 201476_s_at ` ` or extraction tools for environments: You can access the data directly using any of the standard sub [1] "JUNB" $ [1] "RRM1" > hgu133aSYMBOL$"201473_at" [1] "JUNB" > hgu133aSYMBOL[["201473_at"]] [1] "JUNB" > get("201473_at", hgu133aSYMBOL) [1] "JUNB" > mget(c("201473_at", "201476_s_at"), hgu133aSYMBOL) $ – p. 7 numbers are assigned to .

Metadata I LocusLink is a catalog of genetic loci that connects defines sequence clusters. UniGene focuses on is a non-redundant set of transcripts and different enzymes and linked toEntrezGene. through -coding genes of the nuclear(excluding genome rRNA and mitochondrial sequences). of known genes for manymouse species, and including rat. human, curated sequence information to officialIt nomenclature. replaced RefSeq Enzyme Commission (EC) UniGene EntrezGene – p. 8

Metadata II (GO) is a structured vocabulary of terms is a service of the U.S. National Library of curated by the Protein Research Foundation, Medicine. PubMed provides a richand resource tools of for papers data inand journals health. related While to large, medicine thecomprehensive, and data not source all is papers not abstracted have been covers all articles dealing withaccessible peptides in from Japan journals describing products according tofunction, molecular biological process, or cellular component LITDB PubMed – p. 9

Metadata III Affymetrix’ NetAffx Analysis Center provides Online Mendelian Inheritance in Man is a catalog Kyoto Encyclopedia of Genes and Genomes; a Pathway data from both KEGG and BioCarta, in a of human genes and genetic disorders. annotation resources for Affymetrix GeneChip technology. collection of data resources includingof a pathway rich data. collection computable form. NetAffx OMIM KEGG cMAP – p. 10 Genes are identified with

Metadata IV The NCBI coordinates the Gene Expression Omnibus (GEO); TIGR providesResourcerer the database, and the EBIArrayExpress. supports , and where appropriate with strand. Data Archives Chromosomal Location – p. 11

Working with Metadata "BAD" Suppose we are interested in the> gene BAD. gsyms <-> unlist(as.list(hgu95av2SYMBOL)) whBAD <-> grep("^BAD$", gsyms[whBAD] gsyms) 1861_at > hgu95av2GENENAME$"1861_at" [1] "BCL2-antagonist of cell death" – p. 12

BAD Pathways ` ` ` ` ` 01510 04210 04510 04910 05030 ` ` ` ` ` Find the pathways that BAD is> associated with. BADpath <-> hgu95av2PATH$"1861_at" mget(BADpath, KEGGPATHID2NAME) $ [1] "Neurodegenerative Disorders" $ [1] "Apoptosis" $ [1] "Focal adhesion" $ [1] "Insulin signaling pathway" $ [1] "Amyotrophic lateral sclerosis (ALS)" – p. 13 489_at" ... 18_at" ... 42_at" ... 36_at" ... 513_at" ... in SID)))

BAD Pathways 33 87 189 127 16 $ 01510: chr [1:63] "38974_at" "33831_at" "39334_s_at" "40 $ 04210: chr$ [1:151] 04510: "40781_at" chr$ "32477_at" [1:324] 04910: "31647_at" "40781_at" chr$ "350 "33814_at" [1:192] 05030: "32477_at" "40781_at" chr "340 "40635_at" [1:29] "40636_at" "37033_s_at" "371 "34336_at" "32512_at" "32 each of these pathways. > allProbes <-> mget(BADpath, str(allProbes) hgu95av2PATH2PROBE) List of 5 > getEG => function(x) allEG unique(unlist(mget(x, => hgu95av2LOCU sapply(allProbes, sapply(allEG, getEG) length) 01510 04210 04510 04910 05030 We can get the GeneChip probes and the unique EntrezGene loci – p. 14 ta s for e ) ). YBL088C S. cerevisae

Annotating a Genome [1] "YEAST"[3] "YEASTCHR"[5] "YEASTCHRLOC"[7] "YEASTDESCRIPTION"[9] "YEASTENZYME2PROBE" "YEASTENZYME" "YEASTGENENAME" "YEASTCOMMON2SYSTEMATIC" "YEASTCHRLENGTHS" "YEASTALIAS" These packages are like the chip annotation packages, excep different set of primary keyssystematic is names used such (e.g. as for yeast we use th > library("YEAST") > ls("package:YEAST")[1:12] [11] "YEASTGO" "YEASTGO2ALLPROBES" whole genomes (e.g. Bioconductor also provides some comprehensive annotation – p. 15 he s getSEQ and getGI

The annotate package Functions for harvesting of curated persistent datafunctions source for simple HTTP queries tointerface web code service that providers provides common calling sequences for t assay based metadata packages such as perform web queries to NCBIsequence to corresponding extract to the a GI GenBank or accession nucleotide number. > ggi <-> getGI("M22490") gsq <- getSEQ("M22490") > ggi [1] "179503" > substring(gsq, 1, 40) [1] "GGCAGAGGAGGAGGGAGGGAGGGAAGGAGCGCGGAGCCCG" – p. 16 , and getPMID , getSYMBOL , work with lists of PubMed pm getGO : ³ pubMedAbst ³ [[1]]

The annotate package ` ` getLL identifiers for journal articles. other interface functions include functions whose names start with 37809_at 37809_at ` ` An object of class Title: Vertebrate homeoboxPMID: gene 1358459 nomenclature. Authors: MP Scott Journal: Cell Date: Nov 1992 > hgu95av2SYMBOL$"37809_at" [1] "HOXA9" > pm.getabst("37809_at", "hgu95av2") $ $ – p. 17

Working with GO molecular function (MF), biological process (BP) cellular component (CC). An ontology is a structuredcharacterizes vocabulary some that conceptual domain. The Gene Ontology (GO) Consortiumontologies defines characterizing three aspects of knowledgegenes about and gene products. These ontologies are – p. 18 al

GO is a part of a cell that is a component of of a gene product is what it does at the is a biological objective to which the biochemical level. This describes whatcan the do, gene but without product referenceactually to occurs. where Examples or of when functional“enzyme," this terms “transporter," activity or include “ligand." biological processes. gene product contributes. There isto often a a biological temporal process. aspect Biologicalinvolve the processes transformation usually of a“DNA physical replication” thing. or The “signal terms transduction” describe gener some larger object or structure.components Examples include of “chromosome”, cellular “nucleus” and “ribosome”. molecular function biological process cellular component – p. 19 Number of Terms BP 10765 CCMF 1733 7686

GO Characteristics Table 1: Number of GO terms per ontology. – p. 20 nuclear . hexose . cell is part of has two parents, is a nucleus GO hierarchy monosaccharide biosynthesis and hexose biosynthesis chromosome is a: class-subclass relationship, for example, part of: C parta of part D of means D, that butFor when C example, C does is not present, always it have is to be present. GO terms can be linked by two relationships: metabolism The ontologies are structured as directed acyclicDAGs are graphs. similar to hierarchiesmultiple but parent a terms. child For term example,term can the have biological process – p. 21 for example GO:

Working with GO . Three basic tasks that are commonly navigating the hierarchy, determining parentschildren and of selected terms, andthe deriving overall DAG subgraphs constituting of GO; resolving the mapping from GOlanguage tag characterizations to of natural function, location,process; or resolving the mapping between GOelements tags of or catalogs terms of and genes or gene products. For precision and conciseness, allresources indexing employs of 7-digit GO tags with prefix GO:0008094 performed in conjunction with GO are – p. 22 y we use: to refer to the parents, mappings. to refer to all descendants (children, ancestor "GO:0008094" CHILDREN and offspring PARENT

Navigating the hierarchy grandchildren, and so on) of a node. Finding parents and children of different terms is handled b To find the children of We use the term Similarly we use the term using the > get("GO:0008094", GOMFCHILDREN) [1] "GO:0003689" "GO:0015616" "GO:0043142" "GO:0004003" grandparents, and so on, of a> node. get("GO:0008094", GOMFOFFSPRING) [1] "GO:0003689" "GO:0015616"[5] "GO:0043142" "GO:0017116" "GO:0004003" "GO:0008722" "GO:0043140" "GO:0043141" – p. 23 environment. GOTERM

GO terms excess is sensedresults by in the aenergy central reduction expenditure. nervous in system food and intake and increased All GO terms are provided in the > GOTERM$"GO:0002021" GOID = GO:0002021 Term = responseDefinition to = dietary The excess physiological process by which dietary Ontology = BP – p. 24 eapply GO:0009047 using chromosome

Searching for terms . grep Let’s search for terms containing the word "dosage compensation, by hyperactivation of X chromosome" and > terms => eapply(GOTERM, terms[[18]] Term) [1] "killing of cells of> another uterms organism" => unlist(terms) re => regexpr("chromosome", chrTerms uterms) => uterms[re length(chrTerms) > 0] [1] 75 > chrTerms[1] – p. 25 GO term GO term → → package address the EntrezGene ID (incl. implied) EntrezGene ID (non-redundant) GO → → EntrezGene EntrezGene GO GO

Evidence Codes GOLOCUSID2ALLGO GOLOCUSID GOLOCUSID2GO GOALLOCUSID The mapping of genes toa GO project terms run is by carried theaims out European to by Bioinformatics provide GOA, Institute assignments that ofterms. gene products to GO Four environments in the association between EntrezGene sequence entriesGO and terms: – p. 26 ng")) d use that to

GO Evidence Codes Abbreviation Definition 17 51 1 28 30 4 1 30 [1,] "IMP"[2,] "IGI"[3,] "IPI"[4,] "ISS" "inferred[5,] from "IDA" mutant "inferred[6,] phenotype" from "IEP" genetic "inferred[7,] interaction" from "IEA" physical "inferred[8,] interaction" from "TAS" sequence "inferred[9,] similarity from "NAS" " direct "inferred assay" from expression "inferred pattern" from electronic "traceable annotation" author statement" "non-traceable author statement" get EntrezGene ID with that annotation > tfb <-> names(which(uterms gg1 == <-> "transcription get(tfb, table(names(gg1)) factor GOLOCUSID) bindi IDA IEA IMP IPI ISS NAS NR TAS [10,] "ND"[11,] "IC"Find the GO identifier for “transcription factor binding” "no an biological data "inferred available" by curator" – p. 27 ontology by BP , SLC35A2 7355

Ontology "BP""CC" "BP""MF" "CC" "BP" "CC" "BP" "MF" "CC" "MF" Consider the gene with EntrezGene ID using the helper function > getOntology(z, "BP") [1] "GO:0006012" "GO:0008643" "GO:0015780" "GO:0015785" > z <-> get("7355", length(z) GOLOCUSID2GO) [1] 11 > sapply(z, "[[", "Ontology") GO:0006012 GO:0008643 GO:0015780 GO:0015785 GO:0000139 GO:0005795 GO:0016020 GO:0016021 GO:0005338 GO:0005351 GO:0005459 there are 11 different GO terms. We get those from the – p. 28 and can drop codes getEvidence

Evidence Codes : dropEcode "TAS""IEA" "IEA""TAS" "IEA" "IEA" "IEA" "TAS""TAS" "IEA" "IEA" "TAS" "IEA" "TAS" > getEvidence(z) GO:0006012 GO:0008643 GO:0015780 GO:0015785 GO:0000139 GO:0005795 GO:0016020 GO:0016021 GO:0005338 GO:0005351 GO:0005459 > zz <-> dropECode(z, getEvidence(zz) code = "IEA") GO:0006012 GO:0015785 GO:0005459 using We get the evidence codes using – p. 29 es the r function (MF) GO:0003700

GO graphs is the set of GO terms that the genes are regulate transcription; includesgene the regulatory actions proteinstranscription of as factors. both well as the general ontology and has the GO label > library("GO") > library("GOstats") > GOTERM$"GO:0003700" GOID = GO:0003700 Term = transcriptionSecondary factor = activity GO:0000130 Definition = Any activity required to initiate or Ontology = MF associated with, together with allThe less term specific “transcription terms. factor activity” is in the molecula induced GO graph For any set of selected genes, and any of the three GO ontologi – p. 30 lipse", using and the code below. Rgraphviz GOstats

Induced GO graph function of the package GOGraph the The induced graph, based on the MF hierarchy, can be produced > tfG <- GOGraph("GO:0003700",We can GOMFPARENTS) plot the induced GO graph using +> fillcolor plot(tfG, = nodeAttrs "#f2f2f2", = fixedsize nattr) = FALSE) > library("Rgraphviz") > tfG => removeNode("all", mt tfG) => match(nodes(tfG), stopifnot(!any(is.na(mt))) names(terms)) > nattr <- makeNodeAttrs(tfG, label = terms[mt], shape = "el – p. 31 . transcription regulator activity molecular_function binding transcription factor activity GO graph DNA binding nucleic acid binding GO relationships for term “transcription factor activity” – p. 32

Induced GO graphs ` enhancer binding polymerase II transcriptionenhancer by region binding of a DNA. promoter or GO:0003705 ` > tfch <- GOMFCHILDREN$"GO:0003700" [1] "GO:0003705" > tfchild <- mget(tfch, GOTERM) $ GOID = GO:0003705 Term = RNA polymerase IIDefinition transcription = factor Functions activity, to initiate or regulateOntology RNA = MF – p. 33 or other KEGGSOAP

KEGG software. KEGG provides mappings from genes toWe pathways provide these in thequery package the KEGG, site you directly can using also One problem with the KEGGform is that that is the amenable data toproject is computation. provides not The data in cMAP that a isconstructing somewhat networks. more useful for – p. 34

package provides mapping from either provides mapping from KEGG contains the mapping in the other KEGG

Data in pathway ID to a textualOnly description the of numeric the part pathway. ofidentifiers the is KEGG used pathway (not the three letter species codes) EntrezGene (for human, mouse andReading rat) Frame or (yeast) Open to KEGGKEGGPATHID2EXTID pathway ID. direction. KEGGPATHID2NAME KEGGEXTID2PATHID – p. 35 ath dme eat hsa mmu rno sce

Counts per species Table 2: Pathway Counts Per Species Counts 111 118 84 172 168 160 100 – p. 36 y , and numeric . 00362

Exploring KEGG > KEGGPATHID2NAME$"00362" [1] "Benzoate degradation via hydroxylation" Species specific mapping from pathwayglueing to together genes three is letter indicated speciespathway b code, code. e.g. texttthsa > KEGGPATHID2EXTID$hsa00362 [1] "10449"[7] "1891" "83875" "30"> KEGGPATHID2EXTID$sce00362 [1] "3032" "YIL160C" "YKR009C" "347381" "59344" Consider pathway – p. 37 0" in humans 5058

Exploring KEGG chr [1:266] "102626" "109689" "109880" "109905" ... > KEGGEXTID2PATHID$"5058" [1] "hsa04010" "hsa04360"[6] "hsa04510" "hsa04810" "hsa04650" "hsa05120" "hsa0466 > KEGGPATHID2NAME$"04010" [1] "MAPK signaling pathway" We find that it issignaling involved pathway in contains 323 pathways. For mice,> the MAPK mm <-> KEGGPATHID2EXTID$mmu04010 str(mm) PAK1 has EntrezGene ID – p. 38 ") ss") putational manipulation. ware and molecular data

cMAP 60 41 37 2123 266 236 degradation apoptosis protein ubiquitination modification translocation transcription 4207 > cartaproc <- eapply(cMAPCARTAINTERACTION, "[[", "proce relevant to cancer. cMAP provides pathway data in a format> that is keggproc amenable to <- com > eapply(cMAPKEGGINTERACTION, table(unlist(keggproc)) "[[", "process reaction > z => table(unlist(cartaproc)) length(z) [1] 121 > z[order(-z)[1:6]] The cancer Molecular Analysis Project (cMAP) provides soft – p. 39 ) and suffix hsa

Homology Two genes are said todescended be from homologous a if common they ancestral have DNAThere sequence. is one homology packagethree for letter each species species; name a (e.g. The current system is goingimproved. to For be the changed time and being,will one be possible described alternative in the biomaRt lecture. homology