Computational Molecular Biology Lecture Notes by A.P. Gultyaev
Total Page:16
File Type:pdf, Size:1020Kb
Computational Molecular Biology Lecture Notes by A.P. Gultyaev Leiden Institute of Applied Computer Science (LIACS) Leiden University February 2016 !1 Contents Introduction . 3 1. Sequence databases . 4 1.1. Nucleotide sequence databases . 4 1.2. Protein sequence databases . 4 2. Sequence alignment . 5 2.1. Pairwise alignment . 5 2.2. Sequence database similarity search . 10 2.3. Multiple sequence alignment . 13 3. Sequence motifs, patterns and profile analysis . 15 4. Gene finding and genome annotation . 18 4.1. Search by signal . 18 4.2. Search by content . 19 4.3. Probabilistic models . 20 4.4. Similarity-based search . 21 4.5. Genome annotation strategies . 21 5. Protein structure prediction . 22 5.1. Homology modeling . 23 5.2. Fold recognition . 23 5.3. Ab initio protein structure prediction . 25 5.4. Combinations of approaches . 26 5.5. Predictions of coiled coil domains and transmembrane segments . 26 5.6. Inverse folding and protein design . 27 6. RNA structure prediction . 28 6.1. Thermodynamics of RNA folding and RNA structure predictions . 28 6.2. Comparative RNA analysis . 29 6.3. Detecting conserved structures in related RNA sequences . 29 6.4. RNA motif search . 30 Recommended reading . 31 Exercises . .32 !2 Introduction The course Computational Molecular Biology aims to make participants familiar with a number of widely used bioinformatic approaches. The purpose of the course is not to make a guided tour along the available resources in the Internet such as databases, programs etc. but rather to give an idea about the main principles behind efficient algorithms for genomic data analysis. Of course, the text below is just some (lecture) notes that may assist in understanding the material given in lectures and should not be considered as independent comprehensive description. This material is accompanied by exercises with tasks to be accomplished using a number of bioinformatic resources, in order to get hands-on experience. !3 1. Sequence databases The databases of protein amino acid sequences have appeared before nucleotide databases. The Atlas of Protein Sequences and Structures was published in 1965. The development of efficient DNA sequencing techniques resulted in the explosion of the information on nucleotide sequences. In the early 1980’s three major databases have been created: European EMBL nucleotide sequence database, American GenBank and Japanese DDBJ. In 1988 an agreement of a common format has been achieved. Nowadays, the three databases exchange all sequences. However, any update of a sequence entry is performed by a database that owns the entry from the moment of direct submission, to prevent conflicts of updates. 1.1. Nucleotide sequence databases The main elementary unit of sequence databases is a flatfile representing a single sequence entry. Although the database management systems (DBMS) of modern sequence databases may invoke relational database management systems (DBMS), flatfile format remains a convenient format of exchange between GenBank, EMBL and DDBJ databases. Some small (mainly syntax) differences between flatfiles of the three databases do exist, for instance, line- type prefixes at every line of EMBL files. Below the structure of the GenBank flatfile is considered as an example (see also “Sample GenBank Record”, http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) . 1.1.2 The GenBank flatfile The main three parts of the GenBank entry are the header (description of the record), features (annotations on the record) and the nucletide sequence itself. These main parts contain various datafields. The header contains the description of sequence including data such as names of organism, gene name, sequence length, keywords, literature reference data. A very important part of the header is identifiers that are used for searching the entries. Two identifiers are used: “Accession” and “GI” numbers. An accession number is unique for a sequence record. It does not change, even if information in the record is changed (e.g. sequence correction), but version number is assigned (U12345.1; U12345.2). GI-numbers (GenInfo) identifiers are unique for every sequence: if a sequence changes, a new GI-number is assigned. A separate number is used to every ORF encoded by a nucleotide sequence. Features describe the biological information relevant to the sequence (annotation), such as encoded proteins, RNA sequences, various signals such as protein binding sites, mRNA, exons and introns etc. Many of the datafields in GenBank sequence entry provide links to the relevant entries of other databases such as literature (e.g. PubMed) or protein databases. These are so-called hard links (between entries from different databases that share the same datafields) of the ENTREZ retrieval system. ENTREZ is maintained by American National Center for Biotechnology Information (NCBI) and provides unified access to various databases and resources. In addition to the access to the primary database of nucleotide sequences (to which data is submitted by scientific community), ENTREZ provides the access to various curated databases such as RefSeq (non-redundant reference sequences), ENTREZ Gene (gene-centered information) etc. 1.2. Protein sequence databases One of the oldest and very popular protein sequence databases is SWISS-PROT. It was created in 1987 by collaboration of University of Geneva and EMBL. The database is accessible via both Entrez (NCBI) and SRS (European Bioinformatics Institute, EBI) systems. It can be called a secondary (curated) database: it adds value to what is already present in a primary database (nucleotide sequence database). A very small proportion of amino acid sequences is directly submitted to SWISS-PROT. The data in SWISS-PROT sequence file may be divided into the core data and annotation. The core contains a sequence, bibliography and taxonomic information. The annotation includes function(s) of the protein, domains, specific sites etc. In addition to SWISS-PROT, the protein entries in the Entrez retrieval system have been compiled from several databases, and also include translations from annotated coding regions in GenBank and RefSeq. ENTREZ Protein database entries provide hard links to relevant entries from other databases. Furthermore, ENTREZ system provides the links to the protein sequences similar to the one contained in the accessed entry (a concept of neighboring entries in ENTREZ). It should be noted that some protein-coding sequences (open reading frames, ORFs) may remain unnoticed (unannotated). A popular resource of ENTREZ is the ORF Finder with relatively simple algorithm searching for potential ORFs in a given nucleotide sequence in all 6 reading frames (including complementary strand). !4 2. Sequence alignment In general, alignment of sequences can be defined as mapping between the residues of two or more sequences, that is, finding the residues that are related to each other due to phylogenetic or functional reasons. One of the main goals of sequence alignment is to determine whether sequences display sufficient similarity to conclude about their homology. However, similarity is not equivalent of homology. Similarity (of sequences) is a quantity that might be measured in e.g. percent identity. Homology means a common evolutionary history. In principle, all alignment methods try to follow the molecular mechanisms of sequence evolution, namely substitutions, deletions and insertions of monomers: AGTCA AGTCA AGT-CA ! ! ! AGACA AG-CA AGTACA (substitution) (deletion) (insertion) An alignment that spans complete sequences is called global alignment, an alignment of sequence fragments - local alignment. Depending on a biological problem, either global or local alignment is more efficient to detect a similarity between sequences. For instance, global alignments are advisable in case of relatively closely related sequences. Local alignments are a powerful tool to localize the most important regions of similarity, e.g. functional motifs, that may be flanked by variable regions that are less important for functioning. Obviously, local alignments should be applied in cases of mosaic arrangement of protein domains or for comparisons between spliced mRNAs and genomic sequences. Alignment tasks can be subdivided into pairwise alignment (two sequences), multiple alignment (more than two) and database similarity search (searching for the sequences in the database that are similar to a given sequence). 2.1. Pairwise alignment For two sequences, many different alignments are possible (for instance, about 10179 different alignments for two sequences of length 300). Obviously, an alignment with relatively high number of matching residues in two sequences and relatively small number of deletions/insertions is the most likely to have biological meaning, because it assumes smaller number of evolutionary events. The problem to find such an alignment is equivalent to finding the “best” path in a dot matrix, where two sequences are plotted at the coordinates of a two-dimensional graph and dots indicate matching residues. Qualitatively, the best global alignment would be a path visiting as many dots as possible with a preference for diagonal moves as compared to horizontal or vertical moves (because the latter represent deletions/insertions in the sequences). Dot matrices are also very demonstrative for visualizing alignments, especially after “filtering” local regions of low similarity, for instance, leaving only diagonals of consecutive similar residues. Sequence 2 Sequence 2 A G C T A G G A G A G C T A G G A G A . A . G .