生物信息学：导论与方法 Bioinformatics: Introduction and Methods

生物信息学：导论与方法 Bioinformatics: Introduction and Methods https://www.coursera.org/course/pkubioinfo 生物信息学：导论与方法 Bioinformatics: Introduction and Methods 北京大学生物信息学中心高歌、魏丽萍 Ge Gao & Liping Wei Center for Bioinformatics, Peking University Sequence Database Search 北京大学生物信息学中心高歌 Ge Gao, Ph.D. Center for Bioinformatics, Peking University Unit 1: Sequence Databases 北京大学生物信息学中心高歌 Ge Gao, Ph.D. Center for Bioinformatics, Peking University S S - T - T ( Russ Altman BMI214) Global alignment (Needleman‐Wunsch) Local alignment (Smith‐Waterman) F 0,0 0 F 0 ,0 0 F i 1, j 1 s x i , y j F i 1, j 1 s x i , y j F i, j max F i 1, j d F i , j max F i 1, j d F i , j 1 d F i, j 1 d 0 Copyright © Peking University Sequence Alignment “How can we determine the similarity between two sequences?” Why is it important? • Similar sequence Similar structure Similar function (The “Sequence‐ to‐Structure‐to‐Function Paradigm”) • Similar sequence Common ancestor (“Homology”) Copyright © Peking University Sequence Database Searching • Rather than do the alignment pair‐wise, it’s more often to search sequence database in a high‐throughput style. • Or, identify similarities between – novel query sequence whose structures and functions are usually unknown and/or uncharacterized – sequences in (public) databases whose structures and functions have been elucidated and annotated. Copyright © Peking University Sequence Database Searching • The query sequence is compared/aligned with every sequence in the database • Statistically significant hits are assumed to be related to the query sequence – Similar function/structure – Common evolutionary ancestor Copyright © Peking University A (naïve) algorithm for database searching Get the inputted query Get a database sequence Run pair‐wise global/local alignment between query and database sequence Copyright © Peking University F 0,0 0 xi aligned to yj F i 1, j 1 sxi , y j F i, j max F i 1, j d xi aligned to a gap F i, j 1 d yj aligned to a gap X F i 1, j 1 F i, j 1 Y d sxi , y j F i 1, j d F i, j Copyright © Peking University There are nm entries in the matrix. Sequence X of length m n Each entry requires a constant number c length of operation(s). of Dynamic programming matrix Y Sequence c*m*n operations needed in total, for one pair‐wise alignment. Copyright © Peking University • Say your query sequence (HBA_HUMAN) has 142 amino acids • Most recent release of human‐curated Swiss‐Prot protein databases contains 540,958 sequences with 192,206,270 amino acids (Sept 18th, 2013); – On average, the sequence length is 192,206,270/540,958 = 355.30 aa • And assume your super‐fast computer can run one operation in 1μs = (0.000001s) • Then, you will need 7.8 hr for ONE comparison! Copyright © Peking University Source: http://web.expasy.org/docs/relnotes/relstat.html Source: http://www.ebi.ac.uk/uniprot/TrEMBLstats Copyright © Peking University Genbank SRA × 3 log2(bp) = ‐1.2 10 + 0.59y log (bp) = ‐4.6×103 + 2.3y 2 ‐16 2 R = 0.97, p‐value < 2.2×10 R2 = 0.97, p‐value < 2.2×10‐16 Doubling every 20 months Doubling every 5 months! Copyright © Peking University F 0,0 0 xi aligned to yj F i 1, j 1 sxi , y j F i, j max F i 1, j d xi aligned to a gap F i, j 1 d yj aligned to a gap X F i 1, j 1 F i, j 1 Y d sxi , y j F i 1, j d F i, j Copyright © Peking University X F i 1, j 1 F i, j 1 Y d sxi , y j F i 1, j d F i, j x y Copyright © Peking University A A G A G 0 0 0 0 A G A 0 2 2 0 G 0 0 0 4 C 0 0 0 0 0 F i 1, j 1 F i, j 1 d sxi , y j F i 1, j d F i, j Copyright © Peking University BLAST: Intro • To make the alignment effectively, a Heuristic algorithm BLAST (Basic Local Alignment Search Tool) is proposed by Altschul et al in 1990. • BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. – Very fast algorithm – Can be used to search extremely large databases – Sufficiently sensitive and selective for most purposes – Robust –the default parameters just work for most cases Copyright © Peking University Copyright © Peking University Copyright © Peking University Copyright © Peking University Copyright © Peking University data knowledge protein sequence functional information (Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt) Copyright © Peking University (Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt) Copyright © Peking University (Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt) Manual annotation Function, Subcellular location, Protein and gene names Catalytic activity, Disease, Taxonomic information Tissue specificty, Pathway… MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL References One protein sequence Manual annotation NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE Post-translational modifications, GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGOne gene TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGROne species variants, transmembrane domains, AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD signal peptide… EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG Alternative products: protein sequences produced by alternative splicing, Manual annotation Cross-references alternative promoter usage, Keywords to over 125 databases alternative initiation… and Gene Ontology UniProtKB/Swiss-Prot www.uniprot.org Copyright © Peking University Organism-specific Genome annotation Polymorphism Family and domain AGD Sequence Proteomic Gene3D ArachnoServer Ensembl dbSNP EMBL PeptideAtlas HAMAP CGD IPI PRIDE EnsemblBacteria ConoServer InterPro EnsemblFungi PANTHER CTD PIR ProMEX Pfam CYGD RefSeq EnsemblMetazoa dictyBase EnsemblPlants PIRSF UniGene PRINTS EchoBASE Gene expression EnsemblProtists ProDom EcoGene euHCVdb ArrayExpress GeneID PROSITE SMART EuPathDB GenomeReviews Bgee SUPFAM FlyBase CleanEx KEGG TIGRFAMs GeneCards NMPDR GeneDB_Spombe Genevestigator Protein family/group GeneFarm GermOnline TIGR GenoList UCSC Allergome Gramene Ontologies VectorBase CAZy H-InvDB MEROPS HGNC GO PeroxiBase HPA PptaseDB LegioList REBASE Leproma UniProtKB/Swiss-Prot: TCDB MaizeGDB MGI 2D gel MIM 129 explicit links 2DBase-Ecoli neXtProt ANU-2DPAGE Orphanet Aarhus/Ghent-2DPAGE (no server) PharmGKB COMPLUYEAST-2DPAGE PseudoCAP and 14 implicit links! Cornea-2DPAGE RGD DOSAC-COBS-2DPAGE SGD ECO2DBASE (no server) OGP TAIR PHCI-2DPAGE TubercuList PMMA-2DPAGE WormBase Rat-heart-2DPAGE Xenbase REPRODUCTION-2DPAGE ZFIN Siena-2DPAGE 3D structure SWISS-2DPAGE Phylogenomic dbs UCD-2DPAGE eggNOG Other PPI DisProt World-2DPAGE PTM HSSP GeneTree BindingDB DIP Enzyme and pathway HOGENOM GlycoSuiteDB PDB DrugBank IntAct BioCyc HOVERGEN PhosphoSite PDBsum InParanoid NextBio MINT BRENDA ProteinModelPortal OMA PhosSite PMAP-CutDB STRING Pathway_Interaction_DB OrthoDB SMR Reactome PhylomeDB ProtClustDB (Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt) Copyright © Peking University Copyright © Peking University Summary Domain(s) Hits summary Copyright © Peking University Copyright © Peking University HBA_HUMAN HBB_HUMAN Copyright © Peking University Needleman‐Wunsch Global alignment Smith‐Waterman Local alignment BLAST Local alignment Copyright © Peking University Copyright © Peking University Copyright © Peking University Copyright © Peking University Copyright © Peking University Copyright © Peking University Summary Questions • Why do we need to perform a database searching? • What’s the major challenge/obstacle when searching sequence database? Copyright © Peking University 生物信息学：导论与方法 Bioinformatics: Introduction and Methods https://www.coursera.org/course/pkubioinfo.

生物信息学：导论与方法 Bioinformatics: Introduction and Methods

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

A Beginner's Guide to Eukaryotic Genome Annotation

Downloaded from the National Center for Cide Resistance Mechanisms

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

An Integrated Mosquito Small RNA Genomics Resource Reveals

Supplementary Material

Biocuration 2016 - Posters

Improved Aedes Aegypti Mosquito Reference Genome Assembly Enables Biological Discovery and Vector Control

A High-Quality De Novo Genome Assembly from a Single Mosquito Using Pacbio Sequencing

Quantification of Ortholog Losses in Insects and Vertebrates Stefan Wyder†, Evgenia V Kriventseva‡†, Reinhard Schröder§, Tatsuhiko Kadowaki¶ and Evgeny M Zdobnov†¥

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

Uniprot.Ws: R Interface to Uniprot Web Services