Identify Audiences, Especially in the Fast-Moving Field of Bioinformatics

AnAn IntroductionIntroduction toto NCBINCBI’’ss BioinformaticsBioinformatics ResourcesResources Dr.Dr. MedhaMedha DevareDevare [email protected] Life Sciences/Bioinformatics Specialist Albert R. Mann Library Cornell University, Ithaca, NY 14853 USAIN 2006: Delivering Information for the New Life Sciences October 7, 2006 Part I: Introduction to DNA Sequencing Part II: Data Mining in Bioinformatics CENTRAL DOGMA OF BIOLOGY Courtesy: National Human Genome Research Institute NUCLEOTIDES Nucleotide = phosphate + pentose sugar + base http://www.web-books.com/MoBio/Free/Ch3A.htm PENTOSE SUGARS http://www.web-books.com/MoBio/Free/Ch3A.htm NITROGENOUS BASES Purines Adenine Guanine Pyrimidines Cytosine Thymine Uracil (RNA only) http://dl.clackamas.cc.or.us/ch106-09/nucleoti.htm STRUCTURE OF DNA Courtesy: National Human Genome Research Institute DNA REPLICATION http://www.ncc.gmu.edu/dna/repanim.htm DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING http://www.dnalc.org/ddnalc/resources/cycseq.html CLONING – PLASMID VECTOR http://www.accessexcellence.org/RC/VL/GG/inserting.html CLONING – identifying transformed cells DNA insert AmpR origin of replication VECTORS Vector FormForm Host CarryingCarrying Capacity Major UsesUses Plasmid Double-stranded circular DNA E. coli Upto 15 kb cDNA libraries; subcloning Bacteriophage lambda Virus – linear DNA E. coli Upto 25 kb Genomic and cDNA libraries Cosmid Double-stranded circular DNA E. coli 30 – 45 kb Genomic libraries Bacteriophage P1 Virus – circular DNA E. coli 70 – 90 kb Genomic libraries BAC Bacterial artificial chromosome E. coli 100 – 500 kb Genomic libraries YAC Yeast artificial chromosome Yeast 250 – 2000 kb Genomic libraries GENOME SEQUENCING Genome sequencing: http://www.pbs.org/wgbh/nova/genome/sequencer.html# Whole genome shotgun sequencing: http://smcg.cifn.unam.mx/enp-unam/03-EstructuraDelGenoma/animaciones/humanShot.swf WhatWhat isis bioinformatics?bioinformatics? Research,Research, developmentdevelopment oror applicationapplication ofof computationalcomputational toolstools andand approachesapproaches toto expandexpand thethe use,use, acquisition,acquisition, visualization,visualization, analysis,analysis, organizationorganization andand archivingarchiving ofof biological,biological, medical,medical, behavioralbehavioral oror healthhealth data.data. [Bioinformatics[Bioinformatics atat thethe NIH,NIH, 2001]2001] http://http://grants.nih.gov/grants/bistic/bistic.cfmgrants.nih.gov/grants/bistic/bistic.cfm ImportantImportant databasesdatabases inin thethe publicpublic domaindomain •• NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI) http://www.ncbi.nlm.nih.gov •• EuropeanEuropean BioinformaticsBioinformatics InstituteInstitute ((http://www.ebi.ac.uk/) •• EuropeanEuropean MolecularMolecular BiologyBiology LaboratoryLaboratory ((http://www.embl.org) •• DNADNA DataData BankBank ofof JapanJapan ((http://www.ddbj.nig.ac.jp/Welcome.html) •• TIGRTIGR ((http://www.tigr.org) TheThe NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI) Bethesda CreatedCreated inin 19881988 (( NationalNational LibraryLibrary ofof MedicineMedicine atat NIH)NIH) – Establish public databases – Conduct research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information NCBI FieldGuide NCBINCBI databasedatabase typestypes – Bibliographic Citations for biomedical articles http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed Free archive of life sci. journals http://www.pubmedcentral.nih.gov/ From NCBI FieldGuide NCBINCBI databasedatabase typestypes – Bibliographic Books that can be searched online http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books&itool=toolbar Human genes/genetic disorders http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM From NCBI FieldGuide NCBINCBI databasedatabase typestypes – Sequence (nucleotide; protein) – Taxonomy – Genome http://www.ncbi.nlm.nih.gov – Gene – Expression –Structure NCBI FieldGuide TypesTypes ofof SequenceSequence DatabasesDatabases PrimaryPrimary DatabasesDatabases –– ContainContain rawraw andand redundantredundant datadata:: originaloriginal experimentalexperimental sequences,sequences, submittedsubmitted andand ““ownedowned”” byby experimentalistsexperimentalists –– DatabaseDatabase staffstaff reviewreview andand organizeorganize thethe datadata:: dondon’’tt add,add, modifymodify oror updateupdate thethe recordsrecords ¾¾Examples:Examples: GenBank,GenBank, SNP,SNP, GEOGEO NCBI FieldGuide TypesTypes ofof SequenceSequence DatabasesDatabases DerivativeDerivative DatabasesDatabases –– HumanHuman--curatedcurated (data(data compilationcompilation andand correction)correction) ¾ Examples:Examples: LocusLinkLocusLink,, OMIMOMIM && LiteratureLiterature databasesdatabases –– ComputationallyComputationally--DerivedDerived (auto(auto--partitioningpartitioning GenBankGenBank seqsseqs)) ¾¾Example:Example: UniGeneUniGene –– CombinationCombination ¾ Examples:Examples: RefSeq,RefSeq, GenomeGenome AssemblyAssembly NCBI FieldGuide 11ºº SequenceSequence DatabaseDatabase GenBank •• NucleotideNucleotide--onlyonly sequencesequence databasedatabase •• ArchivalArchival ((>292,000 organisms) SubmissionSubmission ofof GenBankGenBank DataData toto NCBI:NCBI: ¾¾DirectDirect submissionssubmissions ofof individualindividual recordsrecords viavia WebWeb ((BankItBankIt,, SequinSequin)) ¾¾BatchBatch submissionssubmissions ofof bulkbulk sequencessequences viavia ee--mailmail ((EST,EST, dbGSSdbGSS,, dbSTSdbSTS)) ¾¾FTPFTP accountsaccounts forfor sequencingsequencing centerscenters NCBI FieldGuide TheThe InternationalInternational SequenceSequence DatabaseDatabase CollaborationCollaboration NIHNIH Entrez NCBI GenBankGenBank EMBLEMBL •Submissions •Updates •Submissions EMBLEMBL •Updates DDBJDDBJ CIB EBI NIGNIG •Submissions •Updates SRS getentry NCBI FieldGuide CheckCheck forfor crosscross--functionalityfunctionality ofof accessionaccession numbersnumbers AccessionAccession no.no. AB062786AB062786 EBI:EBI: http://http://www.ebi.ac.ukwww.ebi.ac.uk DDBJ:DDBJ: http://www.ddbj.nig.ac.jp/http://www.ddbj.nig.ac.jp/ OrganizationOrganization ofof GenBank:GenBank: GenBankGenBank DivisionsDivisions ((gbdivgbdiv)) RecordsRecords areare divideddivided intointo 1818 divisions:divisions: -- 11 PatentPatent 5 High Throughput EST Expressed Sequence Tag -- 5 High Throughput PRIGSS PrimateGenome Survey Sequence PLNHTG PlantHigh and Throughput Fungal Genomic - 1212 TraditionalTraditional BCTSTS BacterialSequence and Tagged Archaeal Site - INVHTC InvertebrateHigh Throughput cDNA Traditional Divisions: ROD Rodent VRL Viral ••BulkDirect Divisions: Submissions VRT Other Vertebrate •• Batch(Sequin Submission and BankIt) MAM Mammalian (ex. ROD and PRI) •• Accurate(Email and FTP) PHG Phage •• Well characterized SYN Synthetic (cloning vectors) •• Inaccurate UNA Unannotated •• Poorly characterized ENV Environmental NCBI FieldGuide Length mRNA = cDNA Division DNA = genomic Accession Number Accession.Version NCBI’s Taxonomy Feature Table GenPept Protein ID Database searching: http://www.ncbi.nlm.nih.gov/ e.g.e.g. -- pharmacogeneticspharmacogenetics • Identifying novel targets for new drugs ¾ mapping and identifying genes associated w/ disease ¾ characterizing proteins targets for new drugs • Identifying genetic variants associated w/ adverse drug reactions ¾ e.g., cytochrome P450s = multigene family of enzymes (liver) ¾ genetically variable expression = variation in drug efficacy Adapted from: Wolf et al., British Medical J., 320: 987-990 Potential consequences of polymorphic drug metabolism • Extended pharmacological effect • Adverse drug reactions • Lack of pro-drug activation (e.g., codeine) • Drug toxicity • Increased effective dose • Metabolism by alternate, deleterious pathways • Exacerbated drug – drug interactions Adapted from: Wolf et al., British Medical J., 320: 987-990 Common pharmacogenetic polymorphisms in human drug metabolizing enzymes (Weber, W.W. Pharmacogenetics. Oxford, 1997) Gene Metaboliser Frequency # of drugs Examples Phenotype CYP2D6 Poor White 6%, African American 2% >100 codeine, dextromethorphan Ultra-rapid Ethiopian 20%, Spanish 7% CYP2C9 Reduced >60 Ibuprofen, warfarin TPMT Poor low in all populations <10 6-mercaptopurine, 6-thioguanine Example: Cytochrome P450 gene - CYP2D6 • CYP2D6 is highly polymorphic (inactive in ~ 6% of Caucasians) ¾ codes for debrisoquine hydroxylase Adapted from: Wolf et al., British Medical J., 320: 987-990 http://www.ncbi.nlm.nih.gov/ Sequence/structureSequence/structure searchingsearching toolstools s e q results Simple sequence search u (BLAST) e n results Profile-sequence search c (HMMER) e results Structure-sequence search s (threading) t r u Homology modeling c (MODELLER) t u Structure-structure search (CE) r e Slide courtesy of Pillardy, Ripoll, and Sun (CBSU) ToolTool comparisoncomparison BLAST HMM Threading Sensitivity: Least sensitive Most sensitive Speed: Seconds Minutes Hours DB size: 1 x 106 1 x 106 18000 (PDB) Result Some expertise interpretation: Relatively easy required Slide courtesy of Pillardy, Ripoll, and Sun (CBSU) SequenceSequence similaritysimilarity searchingsearching WhyWhy dodo it?it? • identify and annotate sequences with no, incomplete, incorrect annotations (GenBank) • infer functionality for genes/proteins • find conserved domains • assemble genomes; clean up sequences (e.g., suspected cloning vector sequences) • explore evolutionary relationships NOTE: Similar sequences may

Identify Audiences, Especially in the Fast-Moving Field of Bioinformatics

Comparative Genomics of Arabidopsis and Maize: Prospects and Comment Limitations Volker Brendel*, Stefan Kurtz† and Virginia Walbot‡

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Quality Assessment of Maize Assembled Genomic Islands (Magis) and Large-Scale Experimental Verification of Predicted Genes

An Active DNA Transposon Family in Rice

Biological Sequence Database: NCBI

5, and J. Chris Pires

Genome Survey of Misgurnus Anguillicaudatus to Identify Genomic Information, Simple Sequence Repeat (SSR) Markers and Mitochondrial Genome

SSR-HRM) Analysis for Genetic Relationship of Luffa Genotypes

Identification and Characterization of Rearrangements in the Vervet Monkey Genome

The Nuclear Genome of Brachypodium Distachyon: Analysis of BAC End Sequences

Rice Transposable Elements: a Survey of 73,000 Sequence-Tagged-Connectors

Distribution of Genes and Repetitive Elements in the Diabrotica Virgifera Virgifera Genome Estimated Using BAC Sequencing Brad S