AnAn IntroductionIntroduction toto NCBINCBI’’ss BioinformaticsBioinformatics ResourcesResources
Dr.Dr. MedhaMedha DevareDevare [email protected]
Life Sciences/Bioinformatics Specialist Albert R. Mann Library Cornell University, Ithaca, NY 14853
USAIN 2006: Delivering Information for the New Life Sciences October 7, 2006 Part I: Introduction to DNA Sequencing
Part II: Data Mining in Bioinformatics CENTRAL DOGMA OF BIOLOGY
Courtesy: National Human Genome Research Institute NUCLEOTIDES
Nucleotide = phosphate + pentose sugar + base
http://www.web-books.com/MoBio/Free/Ch3A.htm PENTOSE SUGARS
http://www.web-books.com/MoBio/Free/Ch3A.htm NITROGENOUS BASES Purines
Adenine Guanine Pyrimidines
Cytosine Thymine Uracil (RNA only) http://dl.clackamas.cc.or.us/ch106-09/nucleoti.htm STRUCTURE OF DNA
Courtesy: National Human Genome Research Institute DNA REPLICATION
http://www.ncc.gmu.edu/dna/repanim.htm DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING
http://www.dnalc.org/ddnalc/resources/cycseq.html CLONING – PLASMID VECTOR
http://www.accessexcellence.org/RC/VL/GG/inserting.html CLONING – identifying transformed cells
DNA insert AmpR
origin of replication VECTORS
Vector FormForm Host Carrying Carrying Capacity Capacity Major Uses Major Uses
Plasmid Double-strandedcircularDNA E. coli Upto 15 kb cDNA libraries; subcloning
Bacteriophage lambda Virus – linear DNA E. coli Upto 25 kb Genomic and cDNA libraries
Cosmid Double-strandedcircularDNA E. coli 30 – 45 kb Genomic libraries
Bacteriophage P1 Virus – circular DNAE. coli 70 – 90 kb Genomic libraries
BAC Bacterialartificialchromosome E. coli 100 – 500 kb Genomic libraries
YAC artificialYeastchromosome ast Ye 250 – 2000 kbnomic Gelibraries
GENOME SEQUENCING
Genome sequencing:
http://www.pbs.org/wgbh/nova/genome/sequencer.html#
Whole genome shotgun sequencing: http://smcg.cifn.unam.mx/enp-unam/03-EstructuraDelGenoma/animaciones/humanShot.swf
WhatWhat isis bioinformatics?bioinformatics?
Research,Research, developmentdevelopment oror applicationapplication ofof computationalcomputational toolstools andand approachesapproaches toto expandexpand thethe use,use, acquisition,acquisition, visualization,visualization, analysis,analysis, organizationorganization andand archivingarchiving ofof biological,biological, medical,medical, behavioralbehavioral oror healthhealth data.data. [Bioinformatics[Bioinformatics atat thethe NIH,NIH, 2001]2001] http://http://grants.nih.gov/grants/bistic/bistic.cfmgrants.nih.gov/grants/bistic/bistic.cfm ImportantImportant databasesdatabases inin thethe publicpublic domaindomain
•• NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI) http://www.ncbi.nlm.nih.gov
•• EuropeanEuropean BioinformaticsBioinformatics InstituteInstitute ((http://www.ebi.ac.uk/)
•• EuropeanEuropean MolecularMolecular BiologyBiology LaboratoryLaboratory ((http://www.embl.org)
•• DNADNA DataData BankBank ofof JapanJapan ((http://www.ddbj.nig.ac.jp/Welcome.html)
•• TIGRTIGR ((http://www.tigr.org) TheThe NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI)
Bethesda
CreatedCreated inin 19881988 (( NationalNational LibraryLibrary ofof MedicineMedicine atat NIH)NIH)
– Establish public databases – Conduct research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information NCBI FieldGuide NCBINCBI databasedatabase typestypes
– Bibliographic
Citations for biomedical articles http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
Free archive of life sci. journals http://www.pubmedcentral.nih.gov/
From NCBI FieldGuide NCBINCBI databasedatabase typestypes – Bibliographic
Books that can be searched online http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books&itool=toolbar
Human genes/genetic disorders http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
From NCBI FieldGuide NCBINCBI databasedatabase typestypes – Sequence (nucleotide; protein) – Taxonomy – Genome http://www.ncbi.nlm.nih.gov – Gene – Expression –Structure
NCBI FieldGuide TypesTypes ofof SequenceSequence DatabasesDatabases
PrimaryPrimary DatabasesDatabases –– ContainContain rawraw andand redundantredundant datadata:: originaloriginal experimentalexperimental sequences,sequences, submittedsubmitted andand ““ownedowned”” byby experimentalistsexperimentalists
–– DatabaseDatabase staffstaff reviewreview andand organizeorganize thethe datadata:: dondon’’tt add,add, modifymodify oror updateupdate thethe recordsrecords
¾¾Examples:Examples: GenBank,GenBank, SNP,SNP, GEOGEO
NCBI FieldGuide TypesTypes ofof SequenceSequence DatabasesDatabases
DerivativeDerivative DatabasesDatabases
–– HumanHuman--curatedcurated (data(data compilationcompilation andand correction)correction) ¾ Examples:Examples: LocusLinkLocusLink,, OMIMOMIM && LiteratureLiterature databasesdatabases
–– ComputationallyComputationally--DerivedDerived (auto(auto--partitioningpartitioning GenBankGenBank seqsseqs)) ¾¾Example:Example: UniGeneUniGene
–– CombinationCombination ¾ Examples:Examples: RefSeq,RefSeq, GenomeGenome AssemblyAssembly
NCBI FieldGuide 11ºº SequenceSequence DatabaseDatabase GenBank
•• NucleotideNucleotide--onlyonly sequencesequence databasedatabase •• ArchivalArchival ((>292,000 organisms)
SubmissionSubmission ofof GenBankGenBank DataData toto NCBI:NCBI: ¾¾DirectDirect submissionssubmissions ofof individualindividual recordsrecords viavia WebWeb ((BankItBankIt,, SequinSequin)) ¾¾BatchBatch submissionssubmissions ofof bulkbulk sequencessequences viavia ee--mailmail ((EST,EST, dbGSSdbGSS,, dbSTSdbSTS)) ¾¾FTPFTP accountsaccounts forfor sequencingsequencing centerscenters
NCBI FieldGuide TheThe InternationalInternational SequenceSequence DatabaseDatabase CollaborationCollaboration NIHNIH Entrez
NCBI GenBankGenBank EMBLEMBL •Submissions •Updates •Submissions EMBLEMBL •Updates DDBJDDBJ CIB EBI
NIGNIG •Submissions •Updates SRS getentry
NCBI FieldGuide CheckCheck forfor crosscross--functionalityfunctionality ofof accessionaccession numbersnumbers
AccessionAccession no.no. AB062786AB062786 EBI:EBI: http://http://www.ebi.ac.ukwww.ebi.ac.uk DDBJ:DDBJ: http://www.ddbj.nig.ac.jp/http://www.ddbj.nig.ac.jp/ OrganizationOrganization ofof GenBank:GenBank: GenBankGenBank DivisionsDivisions ((gbdivgbdiv))
RecordsRecords areare divideddivided intointo 1818 divisions:divisions: -- 11 PatentPatent
5 High Throughput EST Expressed Sequence Tag -- 5 High Throughput PRIGSS Primate Genome Survey Sequence PLNHTG Plant High and Throughput Fungal Genomic - 1212 TraditionalTraditional BCTSTS Bacterial Sequence and Tagged Archaeal Site - INVHTC Invertebrate High Throughput cDNA Traditional Divisions: ROD Rodent VRL Viral ••BulkDirect Divisions: Submissions VRT Other Vertebrate •• Batch(Sequin Submission and BankIt) MAM Mammalian (ex. ROD and PRI) •• Accurate(Email and FTP) PHG Phage •• Well characterized SYN Synthetic (cloning vectors) •• Inaccurate UNA Unannotated •• Poorly characterized ENV Environmental NCBI FieldGuide Length mRNA = cDNA Division DNA = genomic
Accession Number Accession.Version
NCBI’s Taxonomy Feature Table
GenPept Protein ID Database searching: http://www.ncbi.nlm.nih.gov/ e.g.e.g. -- pharmacogeneticspharmacogenetics
• Identifying novel targets for new drugs ¾ mapping and identifying genes associated w/ disease ¾ characterizing proteins targets for new drugs
• Identifying genetic variants associated w/ adverse drug reactions ¾ e.g., cytochrome P450s = multigene family of enzymes (liver) ¾ genetically variable expression = variation in drug efficacy
Adapted from: Wolf et al., British Medical J., 320: 987-990 Potential consequences of polymorphic drug metabolism
• Extended pharmacological effect
• Adverse drug reactions
• Lack of pro-drug activation (e.g., codeine)
• Drug toxicity
• Increased effective dose
• Metabolism by alternate, deleterious pathways
• Exacerbated drug – drug interactions
Adapted from: Wolf et al., British Medical J., 320: 987-990 Common pharmacogenetic polymorphisms in human drug metabolizing enzymes (Weber, W.W. Pharmacogenetics. Oxford, 1997)
Gene Metaboliser Frequency # of drugs Examples Phenotype
CYP2D6 Poor White 6%, African American 2% >100 codeine, dextromethorphan Ultra-rapid Ethiopian 20%, Spanish 7%
CYP2C9 Reduced >60 Ibuprofen, warfarin
TPMT Poor low in all populations <10 6-mercaptopurine, 6-thioguanine
Example: Cytochrome P450 gene - CYP2D6 • CYP2D6 is highly polymorphic (inactive in ~ 6% of Caucasians) ¾ codes for debrisoquine hydroxylase
Adapted from: Wolf et al., British Medical J., 320: 987-990 http://www.ncbi.nlm.nih.gov/ Sequence/structureSequence/structure searchingsearching toolstools s e q results Simple sequence search u (BLAST) e n results Profile-sequence search c (HMMER) e
results Structure-sequence search s (threading) t r u Homology modeling c (MODELLER) t u Structure-structure search (CE) r e
Slide courtesy of Pillardy, Ripoll, and Sun (CBSU) ToolTool comparisoncomparison
BLAST HMM Threading
Sensitivity: Least sensitive Most sensitive
Speed: Seconds Minutes Hours
DB size: 1 x 106 1 x 106 18000 (PDB)
Result Some expertise interpretation: Relatively easy required
Slide courtesy of Pillardy, Ripoll, and Sun (CBSU) SequenceSequence similaritysimilarity searchingsearching WhyWhy dodo it?it? • identify and annotate sequences with no, incomplete, incorrect annotations (GenBank)
• infer functionality for genes/proteins
• find conserved domains
• assemble genomes; clean up sequences (e.g., suspected cloning vector sequences)
• explore evolutionary relationships
NOTE: Similar sequences may NOT be homologous! BBasicasic LLocalocal AAlignmentlignment SSearchearch TToolool (BLAST)(BLAST)
CalculatesCalculates similaritysimilarity forfor biologicalbiological sequencessequences
FindsFinds bestbest locallocal alignmentsalignments
SearchesSearches forfor matchingmatching ““wordswords”” ratherrather thanthan individualindividual residuesresidues
UsesUses statisticalstatistical theorytheory toto determinedetermine ifif aa matchmatch mightmight havehave occurredoccurred byby chancechance
NCBI FieldGuide SequenceSequence AlignmentAlignment
GlobalGlobal alignment:alignment: comparecompare sequencessequences overover entireentire lengthlength (dynamic – e.g., Needleman-Wunsch) --identify long insertions/deletions --check data quality
LocalLocal alignment:alignment: comparecompare segmentssegments ofof sequencessequences speed (heuristic -- BLAST; FASTA; Smith-Waterman) --high quality alignments
DotDot plot:plot: explorationexploration ofof twotwo entireentire sequencessequences forfor similaritysimilarity --repeat discovery --identify long insertions/deletions BBasicasic LLocalocal AAlignmentlignment SSearchearch TToolool (BLAST)(BLAST) http://www.ncbi.nlm.nih.gov/BLAST/ SampleSample BLASTBLAST 11 SampleSample BLASTBLAST 11 SampleSample BLASTBLAST 2:2: “cytochrome AND Archaea” WhatWhat isis BLAST?BLAST?
AATTGGCTAGCTAA | || ||||||| ...AAAAATGCAAAATGCGGGTAGCTTATTCTAGAAGATT...
Matches: 10 Mismatches: 4
Similarity score based on matches, mis-matches, gaps BLAST:BLAST: SubstitutionSubstitution MatrixMatrix andand GapGap CostCost
Query Length Substitution Matrix Gap cost
<35 PAM-30* (9,1) 35-50 PAM-70 (10, 1) 50-85 BLOSUM-80 (10, 1) >85 BLOSUM-62 (11, 1)
*PAM = Percent Accepted Mutation; 1 PAM unit = 1% of aa in protein changed
• BLOSUM-62 generally performs better than PAM • PAM better if looking for distant relationships
Modified from Pillardy, Ripoll, and Sun (CBSU) NCBI BLAST • matrix used to create look-up tables of neighborhood words • seeks pairs of similar segments whose score exceeds threshold (HSPs) ¾ T < 13 not reported ¾ locates “seeds” of similarity along query ¾ extends seeds in both directions until max. possible score reached ProteinProtein WordsWords Query: GTQITVEDLFYNIATRRKALKN Word size = 3 Word size 2 or 3 (default = 3) GTQ W = 2; T = 16 TQI W = 3; T = 32 QIT Neighborhood Words ITV LTV, MTV, ISV, LSV, etc. Make a lookup TVE table of words VED EDL…
NCBI FieldGuide NucleotideNucleotide WordsWords
Query: GTACTGGACATGGACCCTACAGGAA
Word size = 11 Minimum word size = 7 GTACTGGACAT blastn default = 11 TACTGGACATG megablast default = 28 ACTGGACATGG CTGGACATGGA Make a lookup TGGACATGGAC table of words GGACATGGACC GACATGGACCC…
NCBI FieldGuide BLAST:BLAST: BitBit ScoreScore
Bit Score (S‘) : normalized raw score (S), allows direct comparison of searches from diverse dbs
S' = (λS-ln K)/ln2
S = raw score (sum of scores in substitution) K = variable; value dependent on matrix used λ = parameter used as natural scale for scoring system BLASTBLAST Statistics:Statistics: EE--valuevalue
E-value (E) : measure of statistical significance e.g., E=0.01 1% chance that match is due to a random event; dependent on db size
E = Kmne-λS
K = variable; value dependent on matrix used m = length of query (nucleotide or aa) n = size of db λ = parameter used as natural scale for scoring system S = raw score (sum of scores in substitution) ToolsTools forfor 33--DD StructureStructure DisplayDisplay andand SearchingSearching
Cn3D:Cn3D: 3-D structure and sequence alignment viewer --NCBI “Structure” db
DomainDomain ArchitectureArchitecture RetrievalRetrieval ToolTool (DART):(DART): --displays-- functional domains that make up a protein --lists proteins with similar domain architectures
VectorVector AlignmentAlignment SearchSearch ToolTool (VAST):(VAST): --structure-structure-- similarity search program
Threading:Threading: algorithms for recognition of protein folding http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbar ThankThank you!you!
Dr.Dr. MedhaMedha DevareDevare [email protected]
Life Sciences/Bioinformatics Specialist Albert R. Mann Library Cornell University, Ithaca, NY 14853 Exercise: BLASTp Exercise: BLASTp continued Top hit:
Multiple sequence alignments for pfam05724.4 Related Structure: PleasePlease taketake aa 55--min.min. surveysurvey beforebefore youyou leave!leave!
http://www.mannlib.cornell.edu/
Reference and Instruction Library Instruction Workshops Web survey BLAST: all BLAST - bacteria BLAST - bacteria BLAST-bacteria