Substitution Matrices E S V U
Total Page:16
File Type:pdf, Size:1020Kb
C E N Introduction to bioinformatics T R E 2007 F B O I R O I I N N Lecture 8 T F E O G R R M A A T T I I V C Substitution Matrices E S V U C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [1] Substitution matrices – Sequence analysis 2006 Sequence Analysis Finding relationships between genes and gene products of different species, including those at large evolutionary distances C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [2] Substitution matrices – Sequence analysis 2006 Archaea Domain Archaea is mostly composed of cells that live in extreme environments. While they are able to live elsewhere, they are usually not found there because outside of extreme environments they are competitively excluded by other organisms. Species of the domain Archaea are •not inhibited by antibiotics, •lack peptidoglycan in their cell wall (unlike bacteria, which have this sugar/polypeptide compound), •and can have branched carbon chains in their membrane lipids of the phospholipid bilayer. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [3] Substitution matrices – Sequence analysis 2006 Archaea (Cnt.) • It is believed that Archaea are very similar to prokaryotes (e.g. bacteria) that inhabited the earth billions of years ago. It is also believed that eukaryotes evolved from Archaea, because they share many mRNA sequences, have similar RNA polymerases, and have introns. • Therefore, it is generally assumed that the domains Archaea and Bacteria branched from each other very early in history, after which membrane infolding * produced eukaryotic cells in the archaean branch approximately 1.7 billion years ago. There are three main groups of Archaea: 1. extreme halophiles (salt), 2. methanogens (methane producing anaerobes), 3. and hyperthermophiles (e.g. living at temperatures >100º C!). *Membrane infolding is believed to have led to the nucleus of eukaryotic cells, which is a membrane-enveloped cell organelle that holds the cellular DNA. Prokaryotic cells are more primitive and do not have a nucleus. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [4] Substitution matrices – Sequence analysis 2006 Example of nucleotide sequence database entry for Genbank LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION D.melanogasterdecapentaplegic gene complex (DPP-C), complete cds. ACCESSION M30116 KEYWORDS . SOURCE D.melanogaster, cDNA to mRNA. ORGANISM Drosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFER ENCE 1 (bases 1 to 4001) AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLE A transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family JOURNAL Nature 325, 81-84 (1987) MEDLINE 87090408 COMMENT The initiation codon could be at either 1188-1190 or 1587-1589 FEATURES Location/Qualifiers source 1..4001 /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA <1..3918 /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn0000490” gene 1..4001 /note=“decapentaplegic” /gene=“dpp” /allele=“” /db_xref=“FlyBase:FBgn0000490” CDS 1188..2954 /gene=“dpp” /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“FlyBase:FBgn0000490” /db_xref=“PID:g157292” /translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT 1170 a 1078 c 956 g 797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa …………………………. 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g // C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [5] Substitution matrices – Sequence analysis 2006 Example of protein sequence database entry for SWISS-PROT (now UNIPROT) ID DECA_DROME STANDARD; PRT; 588AA. AC P07713; DT 01-APR-1988 (REL. 07, CREATED) DT 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DE DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). GN DPP. OS DROSOPHILA MELANOGASTER (FRUIT FLY). OC EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. RN [1] RP SEQUENCE FROM N.A. RM 87090408 RA PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; RL NATURE 325:81-84 (1987) RN [2] RP CHARACTERIZATION, AND SEQUENCE OF 457-476. RM 90258853 RA PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; RL MOL. CELL. BIOL. 10:2669-2677(1990). CC -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE CC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL CC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. CC -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. CC -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. DR EMBL; M30116; DMDPPC. DR PIR; A26158; A26158. DR HSSP; P08112; 1TFG. DR FLYBASE; FBGN0000490; DPP. DR PROSITE; PS00250; TGF_BETA. KW GROWTH FACTOR; DIFFERENTIATION; SIGNAL. FT SIGNAL 1 ? POTENTIAL. FT PROPEP ? 456 FT CHAIN 457 588 DECAPENTAPLEGIC PROTEIN. FT DISULFID 487 553 BY SIMILARITY. FT DISULFID 516 585 BY SIMILARITY. FT DISULFID 520 587 BY SIMILARITY. FT DISULFID 552 552 INTERCHAIN (BY SIMILARITY). FT CARBOHYD 120 120 POTENTIAL. FT CARBOHYD 342 342 POTENTIAL. FT CARBOHYD 377 377 POTENTIAL. FT CARBOHYD 529 529 POTENTIAL. SQ SEQUENCE 588 AA; 65850MW; 1768420 CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [6] Substitution matrices – Sequence analysis 2006 Definition of substitution matrix • Two-dimensional matrix with score values describing the probability of one amino acid or nucleotide being replaced by another during sequence evolution. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [7] Substitution matrices – Sequence analysis 2006 Scoring matrices for nucleotide sequences • Can be simple: • Can be more • e.g. positive value complicated: for match and zero • taking into account for mismatch. transitions and • frequencies of transversions mutation are equal (e.g. Kimura model) for all bases. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [8] Substitution matrices – Sequence analysis 2006 Scoring matrices for nucleotide sequences • Simple model • Kimura A C T G A 1 0 0 0 C 0 1 0 0 T 0 0 1 0 G 0 0 0 1 purines pyrimidines C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [9] Substitution matrices – Sequence analysis 2006 What is better to align? DNA or protein sequences? 1. Many mutations within DNA are synonymous ⇒ divergence overestimation C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [10] Substitution matrices – Sequence analysis 2006 2. Evolutionary relationships can be more accurately expressed using a 20 ×20 amino acid exchange table 3. DNA sequences contain non-coding regions , which should be avoided in homology searches. 4. Still an issue when translating into (six) protein sequences through a codon table. 5. Searching at protein level: frameshifts can occur, leading to stretches of incorrect amino acids and possibly elongation. However, frameshifts normally result in stretches of highly unlikely amino acids. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [11] Substitution matrices – Sequence analysis 2006 So? Rule of thumb: ⇒ if ORF exists, then align at protein level C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [12] Substitution matrices – Sequence analysis 2006 Scoring matrices for amino acid sequences • Are complicated, scoring has to reflect: • Physio-chemical properties of aa’s • Likelihood of residues being substituted among truly homologous sequences • Certain aa with similar properties can be more easily substituted: preserve structure/function • “Disruptive” substitution is less likely to be selected in evolution (e.g.