Introduction to Genome Informatics
Total Page:16
File Type:pdf, Size:1020Kb
GenomeGenome InformaticsInformatics Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11th, 2008 Kiyoko F. Aoki-Kinoshita IntroductionIntroduction GenomeGenome informaticsinformatics coverscovers thethe computercomputer-- basedbased modelingmodeling andand datadata processingprocessing ofof genomegenome--basedbased data.data. ThisThis includesincludes databasesdatabases andand resourcesresources forfor genomicgenomic analysis.analysis. YouYou werewere introducedintroduced toto KEGGKEGG onon DayDay 2.2. SomeSome otherother usefuluseful databasesdatabases andand resourcesresources willwill bebe coveredcovered today.today. ButBut first!first! DataData formatsformats ItIt isis usuallyusually notnot enoughenough toto simplysimply looklook atat thethe datadata providedprovided byby databasesdatabases ToTo actuallyactually useuse thethe datadata forfor analysis,analysis, oneone oftenoften needsneeds toto savesave thethe retrievedretrieved datadata ThisThis requiresrequires knowledgeknowledge aboutabout thethe datadata formatsformats usedused byby eacheach databasedatabase SoSo wewe willwill covercover thethe majormajor datadata formatsformats usedused inin bioinformaticsbioinformatics DataData FormatsFormats MajorMajor datadata formats:formats: –– GenBankGenBank –– EMBLEMBL/UniProt/UniProt –– FASTAFASTA –– PDBPDB FormatsFormats suitedsuited forfor programming:programming: –– ASN.1ASN.1 (Abstract(Abstract SyntaxSyntax NotationNotation One)One) –– XMLXML (eXtensible(eXtensible MarkupMarkup Language)Language) GenBankGenBank formatformat EachEach lineline startsstarts withwith aa keywordkeyword inin capitalcapital letters.letters. EachEach keywordkeyword isis followedfollowed byby aa tabtab andand thethe informationinformation correspondingcorresponding toto it.it. SomeSome keywordskeywords areare hierarchical:hierarchical: EMBLEMBL formatformat SimilarSimilar toto GenBank,GenBank, exceptexcept thatthat keywordskeywords areare twotwo--letterletter IDs.IDs. UniProtUniProt’’ss formatformat isis similarsimilar toto thisthis format.format. FASTAFASTA sequencesequence formatformat > Randseq1 first randomly generated seq GGTGGTTACTAACCGTAAGAGATGATGTCGCCGTGGTCGCGTGGC GCCGCGGACCCAGATTGTACTTCTCTGAGTCGTTCTAGATCGACC AGTCTTCTAGCTTGCCCGTGAGGTATGGGG AGCCGCATATTGCCCACAAT > Randseq2 second randomly generated seq GCGACGCGTCTCTACACCAGACGCTTCTGTTGAGGAAGAGTGCCT GAGTGCAGGTCCTCGAGAACCCACTGGAACTTGAAGGGCGCGTCT CACTGGTCGTGAGAAGGCTCCGTCGATACG AAAGTCCATGCCAAGGACAT > Randseq3 third randomly generated seq GGCGAGTCTGAACTCACAAATATTGCACGAGAGTTTAGTGTATGT TCCTCTTAGGCTGATAACAATAGTTTAGTGAGCGGAAATGCAACC GCGAGGCGGTCCCCTGCGCTTGTAATGGCC ACCTGTTGCCCGTCGGATAT NucleicNucleic acidacid codecode forfor FASTAFASTA AA ÆÆ adenosineadenosine MM ÆÆ AA CC (amino)(amino) CC ÆÆ cytidinecytidine SS ÆÆ GG CC (strong)(strong) GG ÆÆ guanineguanine WW ÆÆ AA TT (weak)(weak) TT ÆÆ thymidinethymidine BB ÆÆ GG TT CC UU ÆÆ uridineuridine DD ÆÆ GG AA TT RR ÆÆ GG AA (purine)(purine) HH ÆÆ AA CC TT YY ÆÆ TT CC (pyrimidine)(pyrimidine) VV ÆÆ GG CC AA KK ÆÆ GG TT (keto)(keto) NN ÆÆ AA GG CC TT (any)(any) -- ÆÆ gapgap ofof indeterminateindeterminate lengthlength AminoAmino acidacid codecode forfor FASTAFASTA A alanine P proline B aspartate or asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate or glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length PDBPDB formatformat SimilarSimilar toto GenBank,GenBank, usingusing differentdifferent keywordskeywords IncludesIncludes 33--dimensionaldimensional coordinatescoordinates ofof aminoamino acidsacids ASN.1ASN.1 formatformat HierarchicalHierarchical datadata formatformat GroupsGroups areare delineateddelineated byby curlycurly bracketsbrackets DataData typetype namesnames precedeprecede thethe bracketsbrackets DataData withinwithin aa groupgroup areare separatedseparated byby commascommas XMLXML formatformat HierarchicalHierarchical datadata formatformat TagsTags definedefine thethe typetype ofof data:data: – Opening tag: <name> – Closing tag: </name> DataData areare delineateddelineated byby openingopening andand closingclosing tagstags DataData formatformat converterconverter READSEQREADSEQ –– URLURL::http://thr.cit.nih.gov/molbio/readseq/http://thr.cit.nih.gov/molbio/readseq/ TypesTypes ofof databasesdatabases DataData resourcesresources ofof multiplemultiple typestypes ofof datadata – EBI (European Bioinformatics Institute) – NCBI (National Center for Biotechnology Information) – KEGG (Kyoto Encyclopedia of Genes and Genomes) GeneGene andand proteinprotein informationinformation – GenBank, UniProt, and PDB – Species specific: FlyBase, dictyBase, etc. OntologicalOntological datadata – Gene Ontology PathwayPathway datadata – KEGG PATHWAY, Reactome, BRENDA, etc. ProteinProtein--proteinprotein interactioninteraction datadata – IntAct, BioGRID, etc. EBIEBI http://www.ebi.ac.uk/http://www.ebi.ac.uk/ EuropeanEuropean basebase ofof molecularmolecular biologybiology information,information, includingincluding genomic,genomic, genegene expression,expression, andand literatureliterature informationinformation NCBINCBI http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ ContainsContains publicpublic databasesdatabases ofof molecularmolecular biologybiology informationinformation includingincluding genomes,genomes, microarraymicroarray genegene expression,expression, proteinprotein sequencesequence domains,domains, etc.etc. DevelopsDevelops softwaresoftware forfor analyzinganalyzing genomegenome data,data, includingincluding BLASTBLAST ProvidesProvides PubMed,PubMed, anan archivearchive ofof biomedicalbiomedical andand lifelife sciencescience journalsjournals GeneGene andand proteinprotein databasesdatabases UniProtUniProt –– UniversalUniversal ProteinProtein ResourceResource –– http://beta.uniprot.org/http://beta.uniprot.org/ –– ConsistsConsists ofof threethree components:components: The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Reference Clusters (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences. The UniRef databases provide clustered sets of sequences from UniProt (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records. TypesTypes ofof databasesdatabases DataData resourcesresources ofof multiplemultiple typestypes ofof datadata – EBI (European Bioinformatics Institute) – NCBI (National Center for Biotechnology Information) – KEGG (Kyoto Encyclopedia of Genes and Genomes) GeneGene andand proteinprotein informationinformation – GenBank, UniProt, and PDB – Species specific: FlyBase, dictyBase, etc. OntologicalOntological datadata – Gene Ontology PathwayPathway datadata – KEGG PATHWAY, Reactome, BRENDA, etc. ProteinProtein--proteinprotein interactioninteraction datadata – IntAct, BioGRID, etc. GeneGene andand proteinprotein databasesdatabases GenBankGenBank –– AA partpart ofof NCBINCBI –– http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ –– SearchSearch cancan bebe performedperformed throughthrough thethe EntrezEntrez interface,interface, whichwhich searchessearches forfor thethe queryquery inin allall databasesdatabases availableavailable atat thethe NCBINCBI GeneGene andand proteinprotein databasesdatabases PDB:PDB: ProteinProtein DataBankDataBank –– ContainsContains 33--dimensionaldimensional proteinprotein structuresstructures –– http://www.rcsb.orghttp://www.rcsb.org –– DataData submittedsubmitted byby individualindividual researchersresearchers DatabasesDatabases ofof ModelModel OrganismsOrganisms ModelModel organismsorganisms areare thosethose whosewhose genomegenome hashas beenbeen extensivelyextensively studiedstudied suchsuch thatthat thethe workingsworkings ofof biologicalbiological phenomenaphenomena forfor moremore complexcomplex organismsorganisms cancan bebe inferred.inferred. ForFor example,example, thethe mousemouse hashas beenbeen mostmost studiedstudied toto understandunderstand otherother mammalianmammalian systems.systems. ForFor plantplant species,species, ArabidopsisArabidopsis thalianathaliana isis mostmost oftenoften usedused asas aa modelmodel organism.organism. DatabasesDatabases ofof modelmodel organismsorganisms ArabidobsisArabidobsis (mustard(mustard plant)plant) – TAIR: The Arabidopsis Information Resource http://www.arabidopsis.org/ – The Carnegie Institution of Washington, the National Center for Genome Resources C.C. eleganselegans – WormBase http://www.wormbase.org/ – Cold Spring Harbor Laboratory DictyosteliumDictyostelium (slime(slime mold)mold) – dictyBase http://dictybase.org/ – Northwestern University DatabasesDatabases ofof modelmodel organismsorganisms DrosophilaDrosophila (Fruit(Fruit fly)fly) – FlyBase http://www.flybase.org/ – Indiana University MouseMouse