Sequence Analysis Primer UWBC Biotechnical Resource Series

Home , Sequence analysis

Richard R. Burgess, Series Editor University of Wisconsin Biotechnology Center Madison, Wisconsin

M. Gribskov and J. Devereux, Sequence Analysis Primer eds.

In Preparation: G. Grant, ed. Synthetic Peptides: Design and Use

B. Brownstein and D. Nelson, YAC Libraries: Construction eds. and Use

R. Garber In Situ Localization: Analysis of RNA, DNA and Protein Molecules Sequence Analysis Primer

Edited by

Michael Gribskov and John Devereux

M stockton press New York London Tokyo Melbourne Hong Kong Cover illustration by Dr. Mark S. Boguski

©Stockton Press, 1991

Published in the United States and Canada by Stockton Press 15 E. 26th Street, New York, NY 10010

Library of Congress Cataloging-In-Publication Data I. Gribskov, Michael, 1958-. ll. Devereux, John, 1947-. Sequence analysis primer I Michael Gribskov, John Devereux. p. em. Includes bibliographical references Includes index. ISBN 978-1-56159-007-0 (soft cover) l. Nucleotide sequence--Methodology. 2. Amino acid sequence--Methodology QP620.S47 1990 51 4.87'322--dc20 for library of Congress 90-10164 CIP

Published in the United Kingdom by MACMILLAN PUB USHERS LTD (Journals Division), 1991

British Library Cataloging in Publication Data Gribskov, Michael Sequence analysis primer. l. Organisms. DNA. Sequences. Analysis. Applications of computer systems I. Title ll. Deverux, John 574.873282 ISBN 978-0-333-55092-2 ISBN 978-1-349-21355-9 (eBook) DOI 10.1007/978-1-349-21355-9 987654321 Contributors

Mark S. Boguski, National Center for Biotechnology Infonnation, National Library of Medicine, National Institutes of Health, Bethesda, MD

Lisa Caballero, Biocomputing Center, Salk Institute, San Diego, CA

David Eisenberg, Molecular Biology Institute and Department of Chemistry and Biochemistry, University of California, Los Angeles, CA

Keith Elliston, Department of Biological Data, Merck Sharp & Dohme Research Laboratory, Rahway, NJ

Roland Liithy, Molecular Biology Institute and Department of Chemistry and Biochemistry, University of California, Los Angeles, CA

Peter M. Rice, Computing Group, European Molecular Biology Laboratory, Heidelberg, FRG

David J. States, National Center for Biotechnology Infonnation, National Library of Medicine, National Institutes of Health, Bethesda, MD Contents

Foreword xm

1 DNA

Sequencing Project Management 1 Getting Started 1 Choosing a Software Package 1 Sequence Reading Hardware 2 Creating a Database 5 Project Standards 5 Sequence Entry and Quality Control 6 Sequence Entry 6 Sequence Checking 7 Removing Vector Sequence 7 Restriction Site Checking 8 Sequence Assembly 8 Finding Overlaps and Building Contigs 8 Confirming Overlaps 10 Shotgun Projects 13 Progressive Deletions 13 Walking Primers 14

vii viii CONTENTS

The Last Few Gaps 14 Error Detection and Correction 17 Incorrect Overlaps 17 Progress Reports 18 Open Reading Frames and Other Traps 20 The Final Result 23 Identification of Simple Sites and Transcriptional Signals 23 Sequence Patterns 23 Pattern Matching Programs 24 Restriction Enzymes 26 Restriction Enzyme Databases 26 Searching for Fixed Patterns 26 Promoters and Enhancers 27 Bacterial Promoters 28 Eukaryotic Promoters 29 Enhancers and Transcription Factors 31 Terminators and Attenuators 32 Coding Region Identification 32 Identification of mRNA Signals 35 Ribosome Binding Sites 35 The Initiation Codon 36 The Stop Codon 39 Suppressor tRNAs 39 RNA Splice Sites 39 Polyadenylation Signals 42 Coding Region Sequence Patterns 43 Base Composition Bias 44 TestCode Analysis 44 Codon Bias 45 Correspondence Analysis 48 Codon Usage and Gene Expression Prediction 48 Assembly of Continuous Coding Sequences 49 RNA Structure 49 Optimal Folding 51 Suboptimal Folding 53 Pseudoknots 53 Interactive Folding 54 Representation of Folded Structures 55 DNA Structure 56 Identification of Stem and Loop Regions 57 DNA Conformation 57 Summary 58 CONTENTS ix 2

Protein

Physical Properties 61 Molecular Weight and Amino Acid Composition 61 lsoelectric Point and Extinction Coefficient 63 Structural Properties 64 Secondary Structure Prediction 64 Hydrophobicity Patterns 67 Detection of the Outside Face of Helices 77 Detection of Motifs 78 Post-translational Modification Sites 82 Supersecondary structure 83 Folding Domain Motifs 84 Sequence Families 85 Summary 86 3 Similarity and Homology

Similarity versus Homology 90 Dot Matrix Methods 92 Algorithms 93 The Simple Dot Plot 93 The Window Approach and Output Filtering 97 Estimation of Statistical Significance 100 Examples 105 Gene and Genome Structure and Evolution 105 RNA and DNA Structure and Folding 106 Protein Sequence Organization and Comparison 106 Analysis of Repetitive Sequence Proteins 109 Summary and Future Developments 124 Dynamic Programming Methods 124 Derivation of Dynamic Programming Alignment 125 Simple Example of Dynamic Programming Alignment 127 More Complicated Alignments 130 Other Derivations 132 Extensions 134 Scoring Systems 134 FastMethods 137 Hashing and Neighborhood Algorithms 138 X CONTENTS

Finite State Machines 140 Statistical Approaches- BLAST 140 Multiple Sequence Alignment 141 Extension of Pairwise Alignment Techniques 144 Data Interdependence and Sequence Weighting 145 Examples of Multiple Sequence Alignment 146 Multiple Alignment for Database Searching 14 7 Similarity and Significance 148 Theoretical Foundations of Molecular Evolution 148 Statistical Analysis of Alignments 152 Examples 154 Summary 157 4 Practical Aspects: Analysis of Notch

Background 159 Strategy 160 eDNA Sequence Analysis 160 Sequencing Project Management 160 Restriction Map 163 Finding Open Reading Frames 164 Codon Preference 164 TestCode Statistic 165 Windows 169 Protein Analysis: Determining Structure and Function 169 Database Searches 169 PASTA 170 Scan for Motifs with PROSITE 173 BLAST 177 Aligning Notch and Xotch 179 The Algorithm 179 Looking for Internal Repeats 184 Dotplot 184 Aligning the Repeated Regions 187 Profile Analysis 190 Hydropathy Plot 191 Transmembrane Region? 192 Antigenic Determinants 193 Secondary Structure 194 Protein Degradation: PEST Sequences 195 Genomic DNA 196 Weight Matrices 196 Simple Regulatory Elements 197 CONTENTS xi

RNA Stem and Loop Structures 200 Summary 201 References 205 Appendices I. Nucleic Acid Codes 225 IT. Amino Acid Codes and Properties 227 lll. Amino Acid Composition of Proteins in PIR Release 26 229 N. Log-odds Matrices 231 V. A Partial List of Software Suppliers 235 VI. Hardware 239 Vll. Electronic Communication 243 VIII. Sequence and Structural Databases 259 IX. Data Submission 263 X. Glossary 271 Index 277 Foreword

This volume, Sequence Analysis Primer, is the first in the Biotechnical Resource series being produced by the University of Wisconsin Biotech• nology Center and published by Stockton Press. Books in this series will cover two broad areas that we refer to as current and emerging technologies. Those books describing current technologies such as this volume will be directed at the beginning or intermediate user of the technology. Emerging technologies will cover topics at a more advanced level. As we see it, many emerging technologies will soon become current and widely used. Rapid advances in research have paved the way for rapid changes in the ways we do research, leading to more research advances. Once we relied solei yon our labs' own power (and possibly shared an ultracentrifuge with other groups). Now many of us have come to rely on core facilities or contract services to provide us with some of the more powerful research tools. We no longer need to be an expert in all areas necessary for modem research, but we do need some familiarity with these services to use them most effectively and efficiently. We hope these books will help provide that familiarity. We have four main goals we wish to achieve with these books. First, we want to provide information enabling you to understand the principles of the methods. Second, we want to address many of the "common" problems faced by beginners so they do not repeat these mistakes. In addition, we want to describe relevant applications of the technology to help those searching for answers to their research problems. Most of all, we hope these books become helpful references to the labs using the technologies covered. In this first volume, we hope to familiarize the reader with computer- xiii xiv FOREWORD aided sequence analysis. Our ability to obtain information on the se• quences of protein, RNA, and DNA has only developed during the last 40 years. The ftrst sequence information was obtained from proteins in the laboratories of Edman and Sanger in the mid-1950s. Later, in the mid- 1960s, the laboratories of Holley and Sanger developed the ability to sequence RNA. Methods for obtaining DNA sequence were not available until1975, when they were developed by Maxam and Gilbert and in an alternative approach, by Sanger's laboratory. Now the amount of sequence obtained from DNA vastly overshadows the sequence obtained for proteins and RNA. The databases that contain primary sequence information have grown exponentially over the last 10 years and now, in early 1991, have reached over 50 million nucleotides of DNA sequence available in the GenBank nucleic acid database. Until recently, a majority of the DNA sequencing occurred in individual research groups who were studying a protein or enzyme of interest They isolated the gene for that protein, and then sequenced the DNA. Therefore they were somewhat familiar with the product coded for by the region they had sequenced. This focus has now shifted, driven by the impetus of the human genome program, a very ambitious program to obtain extremely large amounts of sequence information including the entire DNA sequence of the genomes of the human and a number of major research organisms. Much larger amounts of DNA sequence are now being generated and vastly more will be generated as sequencing techniques continue to improve. But, unlike the earlier sequence work, which focused on a particular gene, most of these data will be raw, uncharacterized sequence where it is not known if it codes for a protein nor what the protein is if one is encoded. The growing availability of DNA sequence information, both that generated in individual labs, that generated by use of centralized DNA sequencing facilities in universities or companies, and that generated by large sequencing projects, opens up tremendous opportunities for the analysis of the coding regions and the proteins that are encoded. No matter how a sequence is accumulated, more and more people will need to analyze sequences. Fortunately the growth of sequencing technology was paralleled by the increasing sophistication of computer hardware and software that are essential to store and analyze sequence information. Whether researchers carry out computer-aided sequence analysis on their PC in theirlaboratory or office, whether they use their PC as a terminal to connect to a larger centralized computer, or whether they have access to a computerresource center that will carry out the analysis as needed, they still need to understand the,procedures of sequence analysis: what kinds of analysis are available, what are the limitations or caveats associated with each sort of analysis, and what you can and can not conclude based on this analysis. This volume is designed to help researchers new to protein or DNA sequence analysis, to help them understand the sequence features that can be searched for and identified, to help them understand the assumptions FOREWORD XV behind the analyses, and to help them better and more accurately interpret what the analysis reveals about their particular sequence. In a lab, for example, that has obtained a DNA sequence, the researcher may ask, "What can I learn from the DNA sequence now that I have it?" Once they have identified the protein coding region within that DNA sequence and have translated it into a protein sequence, they can ask, "What can I learn from the protein sequence about the structure or function or characteristics of that protein?" Finally they can ask, "Is my sequence similar or homologous to sequences that have already been obtained in other labs?" Significant similarity between two sequences implies some common ancestry and makes suggestions about both structural and functional characteristics of an uncharacterized protein. This volume is divided into four chapters. The first chapter introduces thereadertothecomputeranalysesthatcanbecarriedouton DNAorRNA sequences. In the second chapter, one is introduced to the analyses that can be carried out on a protein sequence. In the third chapter, the question of identifying similar sequences and the inference of homology is treated. Finally, in the fourth chapter, all of these forms of analysis are brought to bear on one particular gene, the Drosophila Notch gene to show how the various analyses can be combined to learn a great deal about the structure and the function of a gene and the protein that it encodes. The references for all four chapters are found together in a reference section. This is followed by a series of appendices that are designed to provide in one place useful information about nucleotide base and amino acid designations and characteristics, software and hardware used in sequence analysis, databases and bulletin boards of interest to persons involved in sequence analysis, and a glossary of specialized terms that are used throughout the book. This book is not designed to be an advanced text on sequence analysis but rather to offer the reader an introduction to the terms, the kinds of computer analyses presently available, the limitations of these analyses, and examples of how theseanalysesareapplied to specific DNA, RNA, and protein sequences. It is our hope that this volume will prove to be both useful and educational and that it will provide a stepping stone to the more advanced literature. I would like to thank the volume editors Michael Gribskov and John Devereux for suggesting the overall organization of the volume and helping make the individual sections fit together and the chapter authors for their thoughtful contributions. Finally, I want especially to thank Carolyn Stock, series developmental editor at the University of Wisconsin Biotechnology Center, for her massive efforts to actually assemble and refine the volume and take care of the innumerable details that were involved in producing Sequence Analysis Primer.

Richard Burgess, Series Editor Madison, Wisconsin March 9, 1991