Alternative Splicing and Protein Structure Evolution

Alternative Splicing and Protein Structure Evolution Fabian Birzele Munchen¨ 2008 Alternative Splicing and Protein Structure Evolution Fabian Birzele Dissertation an der Fakultat¨ fur¨ Mathematik, Informatik und Statistik der Ludwig–Maximilians–Universitat¨ Munchen¨ vorgelegt von Fabian Birzele aus Wurzburg¨ Munchen,¨ den 09.12.2008 Erstgutachter: Prof. Dr. Ralf Zimmer Zweitgutachter: Prof. Dr. Dimtrij Frishman Tag der mundlichen¨ Prufung:¨ 27.01.2009 Contents Abstract xiii Zusammenfassung xv 1 Introduction 1 I Protein Structure Comparison 7 2 Introduction to Protein Structure Analysis 9 2.1 Goals and open questions in protein structure analysis . 9 2.2 Existing approaches to protein structure comparison . 10 2.2.1 SCOP and CATH - the standard of truth . 10 2.2.2 Automated protein structure alignment and comparison . 11 3 A Systematic Comparison of SCOP and CATH 13 3.1 Methods . 14 3.1.1 Mapping of domain assignments . 14 3.1.2 Mapping inner nodes of the hierarchies . 15 3.2 Results . 15 3.2.1 Datasets . 15 3.2.2 Detailed Comparison of SCOP and CATH . 16 3.3 Applications of the SCOP-CATH mapping . 20 3.3.1 Benchmarking Structure-Comparison methods . 20 3.3.2 Inter-Fold Similarities revealed by Consistency Checks . 24 3.4 Conclusion . 25 4 Protein Structure Alignment considering Phenotypic Plasticity 27 4.1 Outline of the method . 29 4.2 Conclusion . 30 5 Vorolign - fast protein structure alignment using Voronoi contacts 33 5.1 Methods . 34 5.1.1 Voronoi tessellation of protein structures . 34 vi CONTENTS 5.1.2 Properties of Voronoi cells . 35 5.1.3 Similarity of Voronoi cells . 36 5.1.4 Pairwise alignment of protein structures . 37 5.1.5 Multiple alignment of protein structures . 38 5.1.6 Fast scan for family members: VorolignScan . 39 5.1.7 Automatic detection of domains in multi-domain structures . 40 5.1.8 Parameter Optimization . 40 5.2 Results . 41 5.2.1 Family recognition on a testset of 979 query proteins . 41 5.2.2 Detailed Evaluation on the SCOP-CATH set . 43 5.2.3 Alignment quality . 46 5.2.4 Specific examples for multiple alignment of protein structures . 47 5.2.5 Large scale evaluation of multiple alignment quality . 49 5.3 The AutoPSI database . 52 5.3.1 The current database content . 52 5.3.2 Database access . 53 5.4 Voronoi contact patterns around functionally important sites . 55 5.4.1 Outline of the method . 56 5.4.2 Results: Case study of the Trypsin-like serine protease family . 58 5.4.3 Discussion . 60 5.5 Conclusions . 61 II Alternative Splicing in the Context of Protein Structure 63 6 Introduction to the Analysis of Alternative Splicing 65 6.1 Fundamentals of alternative splicing . 66 6.1.1 The splicing process in the cell . 66 6.1.2 Different types of alternative splicing events . 66 6.1.3 Regulation of alternative splicing . 67 6.2 Biological functions of alternative splicing events . 68 6.3 Alternative splicing and disease . 69 7 Alternative Splicing and Protein Structure Evolution 71 7.1 Hypotheses and major concepts . 72 7.2 Methods . 72 7.2.1 Alternative splicing and literature data . 73 7.2.2 Protein structure assignment . 73 7.2.3 Assignment of protein structures to families . 74 7.2.4 Multiple structure alignments and evolutionary ”isoforms” . 74 7.2.5 Alternative splicing and alternative structural models . 74 7.3 Results . 75 7.3.1 Structural complexity of functional isoforms . 75 CONTENTS vii 7.3.2 Interpretation of splicing events using evolutionary isoforms . 79 7.3.3 Supporting hypotheses on fold evolution via alternative splicing data . 80 7.3.4 Alternative splicing: a mechanism to explore the protein fold space? . 81 7.4 Discussion . 84 8 Alternative Splicing and Protein Repeats 87 8.1 Methods . 88 8.1.1 Genomic data and annotated splice variants . 88 8.1.2 Splicing events annotated in Swissprot . 88 8.1.3 Repeat assignment and data collection . 89 8.1.4 Assessing the statistical significance of the results . 89 8.2 Results . 89 8.2.1 Frequency of repeats in Swissprot and Ensembl . 89 8.2.2 Alternative splicing of proteins containing repeats . 90 8.2.3 Alternative splicing of C2H2-Zinc-finger motifs . 94 8.2.4 Sculpting protein binding domains - Ankyrin repeats . 95 8.2.5 Alternative splicing of β-propellers . 97 8.2.6 Importance of splicing for complex tissue organizations . 97 8.3 Discussion . 98 III Genome-wide Detection and Analysis of Alternative Splicing 101 9 Detection and prediction of Alternative Splicing Events 103 10 The ProSAS - Database 107 10.1 The ProSAS pipeline . 107 10.1.1 Genome and alternative splicing data . 108 10.1.2 Protein structure prediction . 108 10.1.3 Characterization of splicing events . 109 10.1.4 Affymetrix data mapping . 110 10.1.5 Further data sources . 110 10.2 The ProSAS database content . 110 10.3 The ProSAS database web interface . 112 10.4 Conclusion . 115 11 PASS - Identifying splicing events in Affymetrix Exon Arrays 117 11.1 Outline of the method . 117 11.2 Results and Discussion . 119 12 Alternative splicing and proteome complexity in mass spectrometry data 123 12.1 Methods . 124 12.1.1 Identification of isoforms in precompiled MS peptide datasets . 124 12.1.2 Expected versus observed unique isoform peptides . 124 viii Contents 12.1.3 Sequence-based analysis of alternative splicing . 125 12.1.4 Protein structure prediction pipeline . 125 12.2 Results . 125 12.2.1 Presence of isoforms in mass spectrometry data . 126 12.2.2 Changing signal peptides, localization and domain composition via alternative splicing . 128 12.2.3 Structural analysis of isoforms identified in mass spectrometry data . 130 12.3 Discussion . 134 12.4 Outlook: Identification of novel splice variants in mass spectrometry data . 135 IV Conclusions and Outlook 139 13 Conclusion and Outlook 141 Bibliography 147 Acknowledgements 163 Curriculum Vitae 165 List of Figures 3.1 Examples of interfold similarities detected by the systematic comparison of SCOP and CATH . 21 3.2 Evaluation of the performance of TM-align on SCOP and the novel benchmark set of consistently classified pairs of structures . 22 4.1 Phenotypic plasticity observed in the pheromone binding domain family (a.39.2.1) 28 4.2 Schematic overview on the single steps of the PPM method . 29 5.1 Anomaly in a structural alignment due to structural divergence . 34 5.2 Construction of a two-dimensional Voronoi tessellation . 35 5.3 Basic principle of the Vorolign scoring function to score the similarity of two Voronoi cells . 38 5.4 Detailed evaluation of the performance of PPM and Vorolign on the SCOP- CATH consensus set . 45 5.5 Structural quality of Vorolign alignments across different sequence identities in more than 8000 pairs of structures from the same SCOP family . 46 5.6 Flexible structural alignment of members of the Calmodulin family computed by Vorolign . 47 5.7 Pairwise Vorolign alignment quality induced from multiple structure alignments of more than 800 SCOP families . 49 5.8 Conservation rate of biologically relevant residues (CSA patterns and cystein bridges) in multiple structure alignments . 51 5.9 Conservation rate of biologically relevant residues (PROSITE patterns and Swis- sprot annotated functional residues) in multiple structure alignments . 51 5.10 Detail view in the AutoPSI database web interface . 54 5.11 Schematic example of the consistency check in the Vorolign Pattern identification method . 57 5.12 PROSITE pattern matches of the catalytic triad on the Trypsin structure (PDB: 1a0j) . 58 5.13 Initial Voronoi pattern in the Trypsin structure (PDB: 1a0j) . 59 5.14 Final consensus Voronoipattern for eukaryotic Tyrpsin-like serine proteases family (SCOP: b.47.1.2) . 60 x List of Figures 6.1 Important consensus splicing signals in intronic sequences . 66 6.2 Different types of alternative splicing events . 67 7.1 Schematic.

Alternative Splicing and Protein Structure Evolution

DFFA Monoclonal Antibody (M05), Degradation of the Chromosomal DNA Into Nucleosomal Clone 3A11 Units

DFFB (NM 004402) Human Tagged ORF Clone Product Data

ICAD Antibody Cat

Proteomic Analysis of Ubiquitin Ligase KEAP1 Reveals Associated Proteins That Inhibit NRF2 Ubiquitination

Produktinformation

Anti-ICAD Antibody (ARG54401)

DFFA Antibody Cat

Primepcr™Assay Validation Report

Proteome-Scale Prediction of Structure and Func- Tion

DFFA Conjugated Antibody

Anti-DFFA Polyclonal Antibody Cat: K001939P Summary

The History of the CATH Structural Classification of Protein Domains