AnAn IntroductionIntroduction toto NCBINCBI’’ss BioinformaticsBioinformatics ResourcesResources

Dr.Dr. MedhaMedha DevareDevare [email protected]

Life Sciences/Bioinformatics Specialist Albert R. Mann Library Cornell University, Ithaca, NY 14853

USAIN 2006: Delivering Information for the New Life Sciences October 7, 2006 Part I: Introduction to DNA Sequencing

Part II: Data Mining in CENTRAL DOGMA OF BIOLOGY

Courtesy: National Human Research Institute NUCLEOTIDES

Nucleotide = phosphate + pentose sugar + base

http://www.web-books.com/MoBio/Free/Ch3A.htm PENTOSE SUGARS

http://www.web-books.com/MoBio/Free/Ch3A.htm NITROGENOUS BASES Purines

Adenine Guanine Pyrimidines

Cytosine Thymine Uracil (RNA only) http://dl.clackamas.cc.or.us/ch106-09/nucleoti.htm STRUCTURE OF DNA

Courtesy: National Human Genome Research Institute DNA REPLICATION

http://www.ncc.gmu.edu/dna/repanim.htm DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING DNA SEQUENCING

http://www.dnalc.org/ddnalc/resources/cycseq.html CLONING – PLASMID VECTOR

http://www.accessexcellence.org/RC/VL/GG/inserting.html CLONING – identifying transformed cells

DNA insert AmpR

origin of replication VECTORS

Vector FormForm Host Carrying Carrying Capacity Capacity Major Uses Major Uses

Plasmid Double-strandedcircularDNA E. coli Upto 15 kb cDNA libraries; subcloning

Bacteriophage lambda Virus – linear DNA E. coli Upto 25 kb Genomic and cDNA libraries

Cosmid Double-strandedcircularDNA E. coli 30 – 45 kb Genomic libraries

Bacteriophage P1 Virus – circular DNAE. coli 70 – 90 kb Genomic libraries

BAC Bacterialartificialchromosome E. coli 100 – 500 kb Genomic libraries

YAC artificialYeastchromosome ast Ye 250 – 2000 kbnomic Gelibraries

GENOME SEQUENCING

Genome sequencing:

http://www.pbs.org/wgbh/nova/genome/sequencer.html#

Whole genome shotgun sequencing: http://smcg.cifn.unam.mx/enp-unam/03-EstructuraDelGenoma/animaciones/humanShot.swf

WhatWhat isis bioinformatics?bioinformatics?

Research,Research, developmentdevelopment oror applicationapplication ofof computationalcomputational toolstools andand approachesapproaches toto expandexpand thethe use,use, acquisition,acquisition, visualization,visualization, analysis,analysis, organizationorganization andand archivingarchiving ofof biological,biological, medical,medical, behavioralbehavioral oror healthhealth data.data. [Bioinformatics[Bioinformatics atat thethe NIH,NIH, 2001]2001] http://http://grants.nih.gov/grants/bistic/bistic.cfmgrants.nih.gov/grants/bistic/bistic.cfm ImportantImportant databasesdatabases inin thethe publicpublic domaindomain

•• NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI) http://www.ncbi.nlm.nih.gov

•• EuropeanEuropean BioinformaticsBioinformatics InstituteInstitute ((http://www.ebi.ac.uk/)

•• EuropeanEuropean MolecularMolecular BiologyBiology LaboratoryLaboratory ((http://www.embl.org)

•• DNADNA DataData BankBank ofof JapanJapan ((http://www.ddbj.nig.ac.jp/Welcome.html)

•• TIGRTIGR ((http://www.tigr.org) TheThe NationalNational CenterCenter forfor BiotechnologyBiotechnology InformationInformation (NCBI)(NCBI)

Bethesda

CreatedCreated inin 19881988 (( NationalNational LibraryLibrary ofof MedicineMedicine atat NIH)NIH)

– Establish public databases – Conduct research in – Develop software tools for sequence analysis – Disseminate biomedical information NCBI FieldGuide NCBINCBI databasedatabase typestypes

– Bibliographic

Citations for biomedical articles http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

Free archive of life sci. journals http://www.pubmedcentral.nih.gov/

From NCBI FieldGuide NCBINCBI databasedatabase typestypes – Bibliographic

Books that can be searched online http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books&itool=toolbar

Human genes/genetic disorders http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

From NCBI FieldGuide NCBINCBI databasedatabase typestypes – Sequence (nucleotide; protein) – Taxonomy – Genome http://www.ncbi.nlm.nih.gov – Gene – Expression –Structure

NCBI FieldGuide TypesTypes ofof SequenceSequence DatabasesDatabases

PrimaryPrimary DatabasesDatabases –– ContainContain rawraw andand redundantredundant datadata:: originaloriginal experimentalexperimental sequences,sequences, submittedsubmitted andand ““ownedowned”” byby experimentalistsexperimentalists

–– DatabaseDatabase staffstaff reviewreview andand organizeorganize thethe datadata:: dondon’’tt add,add, modifymodify oror updateupdate thethe recordsrecords

¾¾Examples:Examples: GenBank,GenBank, SNP,SNP, GEOGEO

NCBI FieldGuide TypesTypes ofof SequenceSequence DatabasesDatabases

DerivativeDerivative DatabasesDatabases

–– HumanHuman--curatedcurated (data(data compilationcompilation andand correction)correction) ¾ Examples:Examples: LocusLinkLocusLink,, OMIMOMIM && LiteratureLiterature databasesdatabases

–– ComputationallyComputationally--DerivedDerived (auto(auto--partitioningpartitioning GenBankGenBank seqsseqs)) ¾¾Example:Example: UniGeneUniGene

–– CombinationCombination ¾ Examples:Examples: RefSeq,RefSeq, GenomeGenome AssemblyAssembly

NCBI FieldGuide 11ºº SequenceSequence DatabaseDatabase GenBank

•• NucleotideNucleotide--onlyonly sequencesequence databasedatabase •• ArchivalArchival ((>292,000 organisms)

SubmissionSubmission ofof GenBankGenBank DataData toto NCBI:NCBI: ¾¾DirectDirect submissionssubmissions ofof individualindividual recordsrecords viavia WebWeb ((BankItBankIt,, SequinSequin)) ¾¾BatchBatch submissionssubmissions ofof bulkbulk sequencessequences viavia ee--mailmail ((EST,EST, dbGSSdbGSS,, dbSTSdbSTS)) ¾¾FTPFTP accountsaccounts forfor sequencingsequencing centerscenters

NCBI FieldGuide TheThe InternationalInternational SequenceSequence DatabaseDatabase CollaborationCollaboration NIHNIH Entrez

NCBI GenBankGenBank EMBLEMBL •Submissions •Updates •Submissions EMBLEMBL •Updates DDBJDDBJ CIB EBI

NIGNIG •Submissions •Updates SRS getentry

NCBI FieldGuide CheckCheck forfor crosscross--functionalityfunctionality ofof accessionaccession numbersnumbers

AccessionAccession no.no. AB062786AB062786 EBI:EBI: http://http://www.ebi.ac.ukwww.ebi.ac.uk DDBJ:DDBJ: http://www.ddbj.nig.ac.jp/http://www.ddbj.nig.ac.jp/ OrganizationOrganization ofof GenBank:GenBank: GenBankGenBank DivisionsDivisions ((gbdivgbdiv))

RecordsRecords areare divideddivided intointo 1818 divisions:divisions: -- 11 PatentPatent

5 High Throughput EST Expressed Sequence Tag -- 5 High Throughput PRIGSS Primate Genome Survey Sequence PLNHTG Plant High and Throughput Fungal Genomic - 1212 TraditionalTraditional BCTSTS Bacterial Sequence and Tagged Archaeal Site - INVHTC Invertebrate High Throughput cDNA Traditional Divisions: ROD Rodent VRL Viral ••BulkDirect Divisions: Submissions VRT Other Vertebrate •• Batch(Sequin Submission and BankIt) MAM Mammalian (ex. ROD and PRI) •• Accurate(Email and FTP) PHG Phage •• Well characterized SYN Synthetic (cloning vectors) •• Inaccurate UNA Unannotated •• Poorly characterized ENV Environmental NCBI FieldGuide Length mRNA = cDNA Division DNA = genomic

Accession Number Accession.Version

NCBI’s Taxonomy Feature Table

GenPept Protein ID Database searching: http://www.ncbi.nlm.nih.gov/ e.g.e.g. -- pharmacogeneticspharmacogenetics

• Identifying novel targets for new drugs ¾ mapping and identifying genes associated w/ disease ¾ characterizing proteins targets for new drugs

• Identifying genetic variants associated w/ adverse drug reactions ¾ e.g., cytochrome P450s = multigene family of enzymes (liver) ¾ genetically variable expression = variation in drug efficacy

Adapted from: Wolf et al., British Medical J., 320: 987-990 Potential consequences of polymorphic drug metabolism

• Extended pharmacological effect

• Adverse drug reactions

• Lack of pro-drug activation (e.g., codeine)

• Drug toxicity

• Increased effective dose

• Metabolism by alternate, deleterious pathways

• Exacerbated drug – drug interactions

Adapted from: Wolf et al., British Medical J., 320: 987-990 Common pharmacogenetic polymorphisms in human drug metabolizing enzymes (Weber, W.W. Pharmacogenetics. Oxford, 1997)

Gene Metaboliser Frequency # of drugs Examples Phenotype

CYP2D6 Poor White 6%, African American 2% >100 codeine, dextromethorphan Ultra-rapid Ethiopian 20%, Spanish 7%

CYP2C9 Reduced >60 Ibuprofen, warfarin

TPMT Poor low in all populations <10 6-mercaptopurine, 6-thioguanine

Example: Cytochrome P450 gene - CYP2D6 • CYP2D6 is highly polymorphic (inactive in ~ 6% of Caucasians) ¾ codes for debrisoquine hydroxylase

Adapted from: Wolf et al., British Medical J., 320: 987-990 http://www.ncbi.nlm.nih.gov/ Sequence/structureSequence/structure searchingsearching toolstools s e q results Simple sequence search u (BLAST) e n results Profile-sequence search c (HMMER) e

results Structure-sequence search s (threading) t r u Homology modeling c (MODELLER) t u Structure-structure search (CE) r e

Slide courtesy of Pillardy, Ripoll, and Sun (CBSU) ToolTool comparisoncomparison

BLAST HMM Threading

Sensitivity: Least sensitive Most sensitive

Speed: Seconds Minutes Hours

DB size: 1 x 106 1 x 106 18000 (PDB)

Result Some expertise interpretation: Relatively easy required

Slide courtesy of Pillardy, Ripoll, and Sun (CBSU) SequenceSequence similaritysimilarity searchingsearching WhyWhy dodo it?it? • identify and annotate sequences with no, incomplete, incorrect annotations (GenBank)

• infer functionality for genes/proteins

• find conserved domains

• assemble ; clean up sequences (e.g., suspected cloning vector sequences)

• explore evolutionary relationships

NOTE: Similar sequences may NOT be homologous! BBasicasic LLocalocal AAlignmentlignment SSearchearch TToolool (BLAST)(BLAST)

ƒƒ CalculatesCalculates similaritysimilarity forfor biologicalbiological sequencessequences

ƒƒ FindsFinds bestbest locallocal alignmentsalignments

ƒƒ SearchesSearches forfor matchingmatching ““wordswords”” ratherrather thanthan individualindividual residuesresidues

ƒƒ UsesUses statisticalstatistical theorytheory toto determinedetermine ifif aa matchmatch mightmight havehave occurredoccurred byby chancechance

NCBI FieldGuide SequenceSequence AlignmentAlignment

GlobalGlobal alignment:alignment: comparecompare sequencessequences overover entireentire lengthlength (dynamic – e.g., Needleman-Wunsch) --identify long insertions/deletions --check data quality

LocalLocal alignment:alignment: comparecompare segmentssegments ofof sequencessequences speed (heuristic -- BLAST; FASTA; Smith-Waterman) --high quality alignments

DotDot plot:plot: explorationexploration ofof twotwo entireentire sequencessequences forfor similaritysimilarity --repeat discovery --identify long insertions/deletions BBasicasic LLocalocal AAlignmentlignment SSearchearch TToolool (BLAST)(BLAST) http://www.ncbi.nlm.nih.gov/BLAST/ SampleSample BLASTBLAST 11 SampleSample BLASTBLAST 11 SampleSample BLASTBLAST 2:2: “cytochrome AND Archaea” WhatWhat isis BLAST?BLAST?

AATTGGCTAGCTAA | || ||||||| ...AAAAATGCAAAATGCGGGTAGCTTATTCTAGAAGATT...

Matches: 10 Mismatches: 4

Similarity score based on matches, mis-matches, gaps BLAST:BLAST: SubstitutionSubstitution MatrixMatrix andand GapGap CostCost

Query Length Substitution Matrix Gap cost

<35 PAM-30* (9,1) 35-50 PAM-70 (10, 1) 50-85 BLOSUM-80 (10, 1) >85 BLOSUM-62 (11, 1)

*PAM = Percent Accepted ; 1 PAM unit = 1% of aa in protein changed

• BLOSUM-62 generally performs better than PAM • PAM better if looking for distant relationships

Modified from Pillardy, Ripoll, and Sun (CBSU) NCBI BLAST • matrix used to create look-up tables of neighborhood words • seeks pairs of similar segments whose score exceeds threshold (HSPs) ¾ T < 13 not reported ¾ locates “seeds” of similarity along query ¾ extends seeds in both directions until max. possible score reached ProteinProtein WordsWords Query: GTQITVEDLFYNIATRRKALKN Word size = 3 Word size 2 or 3 (default = 3) GTQ W = 2; T = 16 TQI W = 3; T = 32 QIT Neighborhood Words ITV LTV, MTV, ISV, LSV, etc. Make a lookup TVE table of words VED EDL…

NCBI FieldGuide NucleotideNucleotide WordsWords

Query: GTACTGGACATGGACCCTACAGGAA

Word size = 11 Minimum word size = 7 GTACTGGACAT blastn default = 11 TACTGGACATG megablast default = 28 ACTGGACATGG CTGGACATGGA Make a lookup TGGACATGGAC table of words GGACATGGACC GACATGGACCC…

NCBI FieldGuide BLAST:BLAST: BitBit ScoreScore

Bit Score (S‘) : normalized raw score (S), allows direct comparison of searches from diverse dbs

S' = (λS-ln K)/ln2

S = raw score (sum of scores in substitution) K = variable; value dependent on matrix used λ = parameter used as natural scale for scoring system BLASTBLAST Statistics:Statistics: EE--valuevalue

E-value (E) : measure of statistical significance e.g., E=0.01 1% chance that match is due to a random event; dependent on db size

E = Kmne-λS

K = variable; value dependent on matrix used m = length of query (nucleotide or aa) n = size of db λ = parameter used as natural scale for scoring system S = raw score (sum of scores in substitution) ToolsTools forfor 33--DD StructureStructure DisplayDisplay andand SearchingSearching

Cn3D:Cn3D: 3-D structure and sequence alignment viewer --NCBI “Structure” db

DomainDomain ArchitectureArchitecture RetrievalRetrieval ToolTool (DART):(DART): --displays-- functional domains that make up a protein --lists proteins with similar domain architectures

VectorVector AlignmentAlignment SearchSearch ToolTool (VAST):(VAST): --structure-structure-- similarity search program

Threading:Threading: algorithms for recognition of protein folding http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbar ThankThank you!you!

Dr.Dr. MedhaMedha DevareDevare [email protected]

Life Sciences/Bioinformatics Specialist Albert R. Mann Library Cornell University, Ithaca, NY 14853 Exercise: BLASTp Exercise: BLASTp continued Top hit:

Multiple sequence alignments for pfam05724.4 Related Structure: PleasePlease taketake aa 55--min.min. surveysurvey beforebefore youyou leave!leave!

http://www.mannlib.cornell.edu/

Reference and Instruction Library Instruction Workshops Web survey BLAST: all BLAST - bacteria BLAST - bacteria BLAST-bacteria