BBioinformaticsioinformatics EExplainedxplained

Bioinformatics explained: Biological Databases February 12, 2008

CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com [email protected] Explained ftedt hc a efudwe eerhn n ilgclqeto.Sm fteedt in data some these other. of listed each Some briefly to question. is referring biological on Below and any topics report. overlapping researching related Most partly a when for are found or look databases be paper the analyses, can project. a gel which in data research interpret realize together the data, hand everything the of first sequence write at at on finally not look may and depending to they PubMed though have flavors even you formats Often different different of this. number many a with in work researches comes data Biological data? biological is January What Each information. databases. additional for and below servers section the resources web of the to Check terms webservers. dedicated in on been demanding very Research have Acid is journals Nucleic it and of information. time important the issues interpret very same Entire and have a the filter may become at to one required has but is project with Internet research which the pages The knowledge basic of website in databases. nature of help several the tens in time-saving from on through found Depending information navigate be data collect to information. of to Unfortunately, to demanding amount relevant problem vast and of Internet. the the tedious degrees the in on is varying on information information It or relevant gain researchers find resources. to to fellow online struggle required books, often is through researches be it many can project This research any investigated. with out Starting large few a of overview Introduction short a give to aware is be purpose to The possibilities the Internet. databases. specific of public a the while some while well-known on time describing and organization, data of on responsible period searching concentrates limited the when paper a This of by for on. maintained date contain only going and to maybe is and large project up local are different kept and databases small in and Some are content others maintained purposes. their collections different present with data databases and global proteins, several optimal formats nucleotides, field file get e.g. each as different to subjects Within ways, and and types taxonomy. efficient data and different research to keep related databases to and find lab. can Databases the order You in data. in obtained biological needed data the from the are handling information years As data these and and increasing organize output storing research. heavily databases to is efficient for ability world need more the the over growing laboratories and all a more faster projects is and research allows there more from knowledge in gained data research scientific of the and amount of technical part becoming both are as projects research scale Large databases Biological explained: Bioinformatics Databases Biological explained: Bioinformatics • • • seg h D,SO n AHdatabases. CATH and SCOP PDB, the e.g. as structure. Protein respectively. sequences, protein and data. Sequence data. biological to related references and Text. xmlso etdtbssaePbe n MMcnann eta information textual containing OMIM and PubMed are databases text of Examples ulse netr su npbil vial aaae n uyissue July a and databases available publicly on issue entire an publishes eBn n nPo xmlfe ilgcldtbsscnann DNA containing databases biological exemplifies UniProt and GenBank o a lofn aaae pcfclyrltdt rti tutr files structure protein to related specifically databases find also can You .2 P. Bioinformatics Explained iue1: Figure Databases Biological explained: Bioinformatics gbrel.txt nte rbe ysoigi ltiefra stesz ftedtbs.Dtbsswith Databases users. database. most the for PC of normal size a on the handle software is to standard comprehensible. large is, format easy in too data flatfile be become opened Most may not a be entries may in sequence data. thus text of plain the been storing can thousands of not store and by amounts has to (flatfile) large problem information how However, format Another this on etc. text of Notepad standard plain all Word, own a like handling its in for has stored standard database about common however, every information a Thus PubMed, far, organize in So species created. and on stored information more. store papers published, been much to research have and required residues original conflicting is the were sequence. regions work to the regions, with annotated of references along lot be information additional could a store That projects databases sequence Most of data. sequence speed current the With mentioned databases databases Sequence the of some of description detailed more a give above. sections following The • • • • ueia data. images. Numerical gel reference on identified e.g. data resources. containing related providing databases collection a other of to Images. consisting families database and InterPro domains the protein e.g. from is links database of of type different A subject. Links. olcigifraino aho hs pcfcbooia atr,eg h nVcdatabase UniVec the e.g. NCBI. matters, biological by specific hosted these of each European on information the collecting of database matter. ArrayExpress Biological the is example An EBI. Institute, databases. Bioinformatics of number a from rwho h eBn aaae(source: database GenBank the of Growth ). otdtbsscnanifraino eunedt ihnaseii il or field specific a within data sequence on information contain databases Most ntefedo Dgladmcocpciae o a lofn aiu databases various find also can you images microscopic and gel 2D of field the In eeepeso aaa ela te irarydt r loaccessible also are data microarray other as well as data expression Gene rznbceilsris etr t.aeas ob on ndatabases in found be to also are etc. vectors strains, bacterial Frozen ftp://ftp.ncbi.nih.gov/genbank/ .3 P. Bioinformatics Explained n MLadtetredtbsssoeams dnia aa DJas rvdsvarious provides also DDBJ data. identical almost website GenBank the store with DNA through collaboration databases accepting tools close analysis is three and and in the search Japan work in and They hosted EMBL database researchers. and nucleotide Japanese a mainly is from Japan) submission of Bank Data (DNA DDBJ Japan of Bank Data can DNA information sequence global entries. of DNA three knowledge the a the between of and exchanged and any basis is from GenBank daily data retrieved a sequence applications, both be on of patent with DDBJ collection and and database international collaborating GenBank The projects maintained sequence EMBL, (DDBJ). sequencing and Japan nucleotide of produced EMBL Bank by is Data the as database to well the submitted as and directly researchers are individual sequences by RNA and DNA (EMBL), Institute, Laboratory Bioinformatics Biology European Molecular the - European EBI the by hosted at is Database Sequence Nucleotide EMBL The data. EMBL sequence handling al., for with et tools [Altschul together software database users provides entire theirs the staff to related In in NCBI offers sequence sequences the 2006]. NCBI homologous biological Moreover, for that al., 1990]. of searching service et kinds for used [Wheeler option all mostly an kind almost the BLAST, any probably stores is almost NCBI PubMed of data, data. sequence data storing biological to collecting addition in position dominant a at Hosted NCBI at GenBank below. described are databases four The are they that is four databases the these are basis. think of regular we a characteristics what on date the on to focus of up only kept One will and we maintained databases. here but sequence databases or particular major one sequence databases on many information other are of There in amount large data a retrieve additional fast very to and sequence. pointers easily can or one Thus connections tables. have databases information the Relational all store to is database. databases large relational with a websites in most by used approach alternative An Databases Biological explained: Bioinformatics • • • • UniProt DDBJ EMBL. GenBank. http://www.ncbi.nlm.nih.gov/ h N aaBn fJapan. of Bank Data DNA The h anErpa eoreo uloiesqec data. sequence nucleotide of resource European main The h nvra rti resource. protein universal The Sbsdcmrhniecleto fvrosbooia data. biological various of collection comprehensive US-based A http://www.ddbj.nig.ac.jp/ h ainlIsiueo elhhsachieved has Health of Institute National the , http://www.ebi.ac.uk/embl/ . . .4 P. Bioinformatics Explained ubro aaae n o fmtost epaayealti data. this all analyze website help to EBI methods hosts of (EBI), The lot Institute a Bioinformatics and European databases centers, of bioinformatic number European a larger website, the of the One at and project EBI the behind genomes. are eukaryotic of Institute species. into annotation Sanger divided automatic project the for and software //www.ensembl.org/index.html EBI developing - project EMBL a is Ensembl Ensembl e-mail. by you alerting at searches accessed searches. automated be the up can on PubMed set options registration and filtering After e.g. searches My update. customize at your PubMed account also save automated an can can for and You you up customized sign NCBI a to My offering possibility service at the a is website is NCBI which the National NCBI at field U.S. functionality the the new within relatively by journals A different provided from service resources million this 17 science. and than life format more of to text links in Medicine of data Library biological you gives PubMed PubMed resources and databases valuable Other UniProtKB the 2007]. contains and [Consortium, DDBJ UniProtKB, sequences alignments; and Knowledgebase, protein GenBank create available UniProt EMBL, and publicly your The to all searches submitted of provided. contains sequences BLAST information also coding do search the are also can of services translations can you other You and of - couple categories. information a three and these data in supporting sequence data, core - high-quality fications comprehensive, website, a information." UniProt functional with the and community At sequence scientific protein the of provide resource accessible to freely be is and to UniProt intend of database mission the website "The its quality: on high stated of as and and comprehensive resource, both protein universal the is UniProt UniProt Databases Biological explained: Bioinformatics aaae( database oano aiyt ubro ifrn aaae hc niiulycnan o frelevant of other protein lot one in a from contains information link individually which to related databases extent to different larger of links a number to a provide to try also family database or above InterPro domain the mentioned Nevertheless, databases databases. the of Most InterPro http://www.ebi.ac.uk/embl/ http://www.ebi.ac.uk/ http://www.uniprot.org/ http://www.ncbi.nlm.nih.gov/pubmed/ o a erhwti l h aafo h Ensemble the from data the all within search can you , uioae l,2007]. al., et [Kulikova ) losoe h MLncetd sequence nucleotide EMBL the stores also aahsbe iie notreclassi- three into divided been has data , . http: .5 P. Bioinformatics Explained tp h ataudneo aao h nenthsmd tipsil ocp ihalrelated all with cope to one impossible only it in made sequence has Internet one the on on information data all of retrieving abundance vast for The shortcuts step. no are there Unfortunately, information Retrieving one only about species- information of number detailed a very find to hold are: possible usually examples is few databases it A above species. Such mentioned particular databases the databases. of specific all to addition In databases Species-specific at to hosted related database resources CATH on-line The other known info/ Classification. to Structure all links Protein general. of between CATH in number databases relationships sequence a to evolutionary provides and and structure also protein structural and structures describes at of protein accessible number database is a SCOP Proteins provided The are of you Classification and Structural files SCOP: structure download studies. can structure you for website tools the At at hosted structures. due Bank information Data proteins. Protein data of RCSB sequence structures The dimensional as 3 fast solving as in developing pace not slower is to structure protein about Information databases Structure at and database role Pfam the the access about can knowledge You your benefit identification as can family protein. sequence unknown protein the protein an actual of the a on function of within working information domains When retrieve Pfam functional to families. 2006]. valuable of al., protein very et 9000 often [Finn is than database it more Pfam protein the on is information domains stores protein finding currently for database useful very A sequence any Pfam contain not does resources. database other various InterPro InterPro: to to hyperlinks The Link of mesh a 2007]. largely is al., but information et [Mulder information Databases Biological explained: Bioinformatics • • • • TAIR org/ Wormbase Flybase Colibase lsiispoensrcue rmtePBacrigt orlvlhierarchy. four-level a to according PDB the from structures protein classifies . h rbdpi nomto Resource Information Arabidopsis The rspiadatabase drosophila A An http://www.ebi.ac.uk/interpro/ aaaefor database A .coli E. database .elegans C. http://colibase.bham.ac.uk/ http://flybase.bio.indiana.edu/ http://pfam.janelia.org/ http://www.pdb.org n te nematodes other and http://www.arabidopsis.org/ http://scop.berkeley.edu/ od lgtymr hn48000 than more slightly holds http://www.wormbase. http://www.cathdb...... 6 P. . Bioinformatics Explained er2008 Year 2007 Year Research Acid Nucleic of Issues http://en.wikipedia.org/wiki/Biological_database Databases Biological on Wikipedia resources useful Other found be new can as databases these coming journal of the is Many in more data. issue and We of database exist usages. annual types they specific semi new that the yield for in may acknowledge databases which simply create created and but are researches build them Many methods to of paper. years all this in even mention databases or cannot useful months highly weeks, many of used exclusion have the excuse to want We Acknowledgements work. manual are and errors labor such of Unfortunately, lot predictions. a own require their and sequences. on annotate find based to to predicting used methods hard later up prediction very are end they computational they which many way, in of this databases that possibility In from is increased extracted an data problem is on emerging there trained are One databases the data. to submitted erroneous data finding of amount exploding an in With found papers research data Erroneous the information. in out filter available relations to find is used to be information can able tools Much algorithms text-mining and advanced methods and proteins. SRS develop PubMed, or the to databases. trying genes Using different also of between number are Bioscience. a groups Lion across research from queries Many text system advanced SRS together put well-known can web the one is one system various system of from information retrieval authors displaying Another the by simply Nevertheless, information web-page: related single of sequence. one lot on protein a databases retrieve or to trying DNA is specific page one to information Databases Biological explained: Bioinformatics • • • dtl issue Database index.dtl issue Webserver dtl issue Database http://nar.oxfordjournals.org/content/vol36/suppl_1/index. http://nar.oxfordjournals.org/content/vol35/suppl_1/index. http://nar.oxfordjournals.org/content/vol34/suppl_2/ http://harvester.embl.de/ uli cdResearch Acid Nucleic . . .7 P. Bioinformatics Explained CCbo a ob lal aee sato n rvdro h ok o a o s this See use not may You work. work. this the upon and build of form nor provider original transform, and alter, its author not in may as work You labeled purposes. the educational clearly commercial for attribute for be work must work to the You use has and bio" conditions: display, "CLC following distribute, copy, the to under free purposes, are Attribution-NonCommercial- You Commons License. Creative a 2.5 under NoDerivs licensed are articles scientific bio’s CLC All License Commons Creative Databases Biological explained: Bioinformatics o ouetecontents. the use to how http://creativecommons.org/licenses/by-nc-nd/2.5/ o oeifrainon information more for .8 P. Bioinformatics Explained Welr .L,Bret . esn .A,Byn,S . aee K., Canese, H., S. Bryant, A., D. Benson, T., Barrett, L., D. Wheeler, 2006] al., et [Wheeler Binns, A., Bateman, A., Bairoch, K., T. Attwood, R., Apweiler, J., N. Mulder, 2007] al., et [Mulder Klkv,T,Aha,R,Adbr,P,Atop,N,Adrsn M., Andersson, N., Althorpe, P., Aldebert, R., Akhtar, T., Kulikova, 2007] al., et V., [Kulikova Hollich, S., Griffiths-Jones, B., Schuster-Böckler, J., Mistry, D., R. Finn, 2006] al., et (). [Finn resource protein universal The (2007). U. T. Consortium, 2007] [Consortium, Atcu,S . ih . ilr . yr,E . n imn .J. D. Lipman, and W., E. Myers, W., Miller, W., Gish, F., S. Altschul, 1990] al., et [Altschul References Databases Biological explained: Bioinformatics . trhno . ue,T . auo,R,Ttsv,T . anr . n Yaschenko, and L., information. biotechnology Wagner, for A., center T. national Tatusova, the R., of Souvorov, Res K., Tatusov, resources Acids Sirotkin, Ostell, Database T., O., R., S. T. D. Sherry, (2006). E., Maglott, Suzek, E. Sequeira, L., M., G., T. L. Madden, Schriml, Starchenko, J., D., W., A., G. Helmberg, D. Schuler, Y., Lipman, D., L. O., K. Geer, Khovayko, Pruitt, S., J., L., Federhen, D. R., Kenton, Edgar, M., Y., DiCuccio, Kapustin, M., D. the Church, in V., D., developments Chetvernin, P. New Thomas, (2007). A., Nikolskaya, C. J. Yeats, C. A., and Sigrist, Mitchell, H., D., J., J. C. Mistry, Selengut, Wu, Letunic, database. R., D., J., Kanapin, R., Petryszak, Wilson, McDowall, D., C., Lopez, F., C., Kahn, Orengo, Valentin, D., S., S., McAnulla, Lonsdale, Orchard, Hunter, J., S., Dibley, N., N., Maslen, P. L., A. Hulo, M., Daugherty, Langendijk-Genevaux, D., U., Madera, A., Haft, Das, Labarga, I., J., E., A., Gough, Courcelle, Kejariwal, R., W., Copley, A., Fleischmann, L., R., Cerutti, Finn, V., M., Buillard, P., Bork, D., issue):D16--D20. .G,Pase,S,Sbay . ter . aga,R,W,D,Zu . n Apweiler, and W., Zhu, D., Wu, M. 2006. R., Pastor, in Q., Vaughan, database F., Lin, P., sequence Nardone, R., Stoehr, nucleotide G., Leinonen, S., Embl Mukherjee, C., Sobhany, H., Lee, G., (2007). S., McWilliam, C., Cochrane, R. D., Kanz, Plaister, M., Lorenc, G., G., Castro, Hoad, R., P., P. N., Lopez, Browne, Faruque, V., L., R., Bower, Lombard, Eberhardt, S., K., Bhattacharyya, K., Duggan, Bates, services. A., and Baldwin, tools Sonnhammer, web R., clans, S. Pfam: Eddy, R., (2006). Durbin, A. A., Bateman, issue):D247--D251. Khanna, 34(Database and M., L., Marshall, L. S., E. Moxon, T., Lassmann, Res Acids 19) ai oa lgmn erhtool. search alignment local Basic (1990). 4Dtbs issue):D173--D180. 34(Database , 5Dtbs issue):D193--D197. 35(Database , uli cd Res Acids Nucleic 5Dtbs issue):D224--D228. 35(Database , o Biol Mol J 215(3):403--410. , uli cd Res Acids Nucleic uli cd Res Acids Nucleic 35(Database , Nucleic Nucleic .9 P. ,