Introduction to

Sequence analysis

Etienne de Villiers

BecA-ILRI Hub Nairobi, Kenya Outline

1. Molecular sequences 2. Nucleic acid sequence analysis 3. Protein sequence analysis 4. Searching Sequence analysis: overview

Sequencing Sequence Sequence Manual project sequence entry database management browsing entry sequence analysis Nucleotide sequence file Search databases for Search for protein Protein similar sequences coding regions sequence Design further Translate analysis experiments coding into Protein sequence file protein Restriction mapping PCR planning non-coding Search databases Search for Predict for similar known motifs secondary Sequence comparison sequences structure Search for RNA structure known motifs prediction Sequence comparison Multiple sequence analysis Predict tertiary Create a multiple structure

Edit the alignment

Format the Molecular Protein alignment for phylogeny family publication analysis Sequence entry

Sequences for analysis can be obtained from two main sources:

• Generated by yourself or

• Obtained from databases Sequence formats

• Why different formats? • Organise sequence information • Database integration • It is import to ensure that sequence files do not contain special characters. ASCII files are suitable for most sequence programs. • However independent DB and some widely used programs developed slightly different formats for sequences. • Correct use of different formats is critical as well as a possibility to recognize and convert sequence/file/entry from one format to another. Main file formats used in Bioinformatics

ASN.1 EMBL Swiss Prot FASTA GenBank Phylip PIR Nexus GCG Sequence formats

• There are many different (> 20) sequences formats including GenBank, EMBL, SwissProt, FASTA and several others.

1. FASTA/Pearson 2. GenBank format format LOCUS seq1 16bp >seq1 DEFINITION seq1, 16 bases, 2688 agctagct actgg checksum. >seq2 ORIGIN aactaact attcg 1 agctagctag // LOCUS seq2 20bp Sequence format conversions

There are several computer programs able to convert formats:

ReadSeq Available as standalone package or on the web: • http://bioportal.bic.nus.edu.sg/readseq/readseq.html • http://www-bimas.cit.nih.gov/molbio/readseq/ • http://bioweb.pasteur.fr/seqanal/interfaces/readseq-simple.html

Seqret A program in EMBOSS suite Molecular Sequence Databases

What kinds of analyses are there?

1. Finding regions of interest in nucleic acid sequence 2. finding 3. Frequency analysis 4. Database searching – (Day 5) 5. Multiple alignment – (Day 6) 6. Measuring homology by pairwise alignment – (Day 6) A gene codes for a protein

Gene/DNA CCTGAGCCAACTATTGATGAA

transcription

mRNA CCUGAGCCAACUAUUGAUGAA

translation

Protein PEPTIDE Structure of Prokaryote and Eukaryote Finding regions of interest in nucleic acid sequence

TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTTTTCCAACAGTTGATCCTATT

• Classification of lowly sequence repeats. • Identification of gene components – (/ boundaries – Promoters – transcription factor binding sites..) Gene finding (1/3)

• Parts of a gene is scattered over DNA. • One approach - identify exactly where four specific signal type can be found. – The start codon. (ATG) – Beginning of each intron.(GT-) – End of each intron.(-AG) – The stop codon. Gene finding (2/3)

• Second approach - content scoring method – analyze larger regions of sequence using codon frequency • codon frequency is different in coding and noncoding regions. • The coding portion of . • The noncoding portion of exons. • • intergenic regions. Gene finding (3/3)

• The two major types : eukaryotic and prokaryotic • eukaryotic gene finding is much harder – Presence of introns. – Coding density is low. – Underlying technologies : hidden Markov models, decision trees, neural networks... Gene finding

TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTAGTCCAACAGTTGATCCTATT Frequency analysis

• Determine the frequency of occurrence of sequence elements. • Applications – using oligomer frequency to distinguish coding and noncoding regions. – frequency of amino acids for predicting 3D structure of protein, functionality, location in the cell. – finding ribosome binding site. using codon frequency

coding Frame 1

non-coding

Frame 2

coding sequence Frame 3

correct start Profile / PSSM

LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL • DNA / proteins Segments LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL of the same length L; LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI • Often represented as LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI Positional frequency LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL matrix; LPMSRQDIADYLGLTIETVSRTFTKLERHGAI

Protein Sequence Analysis

 Physico-chemical properties.  Cellular localization.  Signal peptides.  Transmembrane domains.  Post-translational modifications.  Motifs & domains.  Secondary structure.  Other resources. ExPASy (Expert Protein Analysis System)

• Swiss Institute of Bioinformatics (SIB). • Dedicated to the analysis of protein sequences and structures. • Many of the programs for protein sequence analysis can be accessed via ExPASy. http://www.expasy.org/tools/ 1) Physico-chemical properties:

• ProtParam tool o molecular weight o theoretical pI (pH no net electrical charge) o amino acid composition o atomic composition o extinction coefficient o estimated half-life o instability index o aliphatic index o grand average of hydropathicity (GRAVY) 2) Cellular localization:

• Proteins destined for particular subcellular localizations have distinct amino acid properties particularly in their N-terminal regions. • Used to predict whether a protein is localized in the cytoplasm, nucleus, mitochondria, or is retained in the ER, or destined for lysosome (vacuolar) or the peroxisome. • PSORT • End of the output the percentage likelihood of the subcellular localization. 3) Signal peptides:

• Proteins destined for secretion, operation with the endoplasmic reticulum, lysosomes and many transmembrane proteins are synthesized with leading (N-terminal) 13 – 36 residue signal peptides. • SignalP WWW server can be used to predict the presence and location of signal peptide cleavage sites in your proteins. • Useful to know whether your protein has a signal peptide as it indicates that it may be secreted from the cell. • Proteins in their active form will have their signal peptides removed. 4) Transmembrane domains:

• TMpred program makes a prediction of membrane- spanning regions and their orientation.

• Algorithm is based on the statistical analysis of TMbase, a database of naturally occurring transmembrane proteins.

• Presence of transmembrane domains is an indication that the protein is located on the cell surface. 5) Post-translational modifications:

• After translation has occurred proteins may undergo a number of posttranslational modifications. • Can include the cleavage of the pro- region to release the active protein, the removal of the signal peptide and numerous covalent modifications such as, acetylations, glycosylations, hydroxylations, methylations and phosphorylations. • Posttranslational modifications may alter the molecular weight of your protein and thus its position on a gel. • Many programs available for predicting the presence of posttranslational modifications, we will take a look at one for the prediction of type O-glycosylation sites in mammalian proteins. • These programs work by looking for consensus sites and just because a site is found does not mean that a modification definitely occurs. 6) Motifs and Domains:

• Motifs and domains give you information on the function of your protein.

• Search the protein against one of the motif or profile databases.

• ProfileScan, which allows you to search both the Prosite and Pfam databases simultaneously 7) Secondary Structure Prediction:

• WHY: – If protein structure, even secondary structure, can be accurately predicted from the now abundantly available gene and protein sequences, such sequences become immensely more valuable for the understanding of drug design, the genetic basis of disease, the role of protein structure in its enzymatic, structural, and signal transduction functions, and basic physiology from molecular to cellular, to fully systemic levels.

• JPRED - works by combining a number of modern, high quality prediction methods to form a consensus. Secondary Structure Prediction

• Essentially protein secondary structure consists of 3 major conformations;

. α Helix.

. β pleated sheet.

. coil conformation.

Sequence Comparisons

• The comparison of DNA sequences is most used method in bioinformatics: – Annotations of new nucleotide and protein sequences – construction of protein structures – design and analysis of bioinformatic and biological experiments. • Nature acts conservatively, it does not develop a new kind of biology for every life form but continuously changes and adapts a proven general concept. • One may transfer functional information from one protein to another if both possess a certain degree of similarity. DB searches for similar sequences

• Since Charles Darwin the idea of common origin of species became widely accepted view, however the level of similarity on molecular level between distant species remained unclear until 1970s and 1980s. • At that time the fact that many DNA and particularly protein molecules retain significant (>60-70%) or high (>85%) similarity hundreds of millions of years after separation from the common ancestor was established. • This discovery as well as practical needs to search growing DB lead to development of effective methods of similarity search. • Two programs, which greatly facilitated the similarity search, were developed FASTA (Pearson and Lipman 1988) and BLAST (Altschul et al. 1990). Basics of similarity searches

• The basic step in any similarity search is an alignment of two or more sequences. Principles of alignment will be considered during the next lecture. • The search provides a list of DB sequences with which a query sequence can be aligned. Then scoring procedure is implemented, which allows to measure degree of similarity from 100% identity to a loose similarity. • A common reason for performing a DB search is to find a related gene. A matched gene (or any other sequence) may provide a clue as to function. • An alternative task can be be achieved when a sequence with known function or role is used as a query for search in a species genome. • The search must be fast and sensitive enough. Inferring function by homology

• The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology searching, the equivalent genes in one species to those known to be important in other model species.

• Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D structure is similar then there is a good chance that the function is similar. Sequence comparison through pairwise alignments • Goal of pairwise comparison is to find conserved regions (if any) between two sequences

• Extrapolate information about our sequence using the known characteristics of the other sequence THIO_EMENI GFVVVDCFATWCGPCKAIAPTVEKFAQTY G ++VD +A WCGPCK IAP +++ A Y ??? GAILVDFWAEWCGPCKMIAPILDEIADEY

Extrapolate

??? THIO_EMENI SwissProt Why Align Sequences?

• DNA sequences (4 letters in alphabet) • GTAAACTGGTACT… • Amino acid (protein) sequences (20 letters) • SSHLDKLMNEFF… • Align them so we can search databases • To help predict structure/function of new genes • In particular, look for homologues (evolutionary relatives) • 3D-pssm (Imperial College - Structure Prediction) • http://www.sbg.bio.ic.ac.uk/servers/3dpssm • Give it a gene sequence • It predicts the protein structure Do alignments make sense ?

Evolution of sequences • Sequences evolve through and selection  Selective pressure is different for each residue position in a protein (i.e. conservation of active site, structure, charge, etc.) • Modular nature of proteins  Nature keeps re-using domains

• Alignments try to tell the evolutionary story of the proteins Relationships

Same Sequence

Same Same Origin Function

Same 3D Fold • Two similar regions of the Drosophila melanogaster Slit and Notc proteins

970 980 990 1000 1010 1020 SLIT_DROME FSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFC ..:.: :. :.: ...:.: .. : :.. : ::.. . :.: ::..:. :. :. : NOTC_DROME YKCECPRGFYDAHCLSDVDECASN-PCVNEGRCEDGINEFICHCPPGYTGKRCELDIDEC 740 750 760 770 780 790 • Comparing the tissue-type and urokinase type plasminogen activators. Displayed using a diagonal plot or Dotplot. Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator

URL: www.isrec.isb-sib.ch/java/dotlet/Dotlet.html  As simple as projecting the diagonals onto the axis. Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator

Tissue-Type plasminogen Activator A A’ B C D A B C D Urokinase-Type plasminogen Activator Some definitions

Identity Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value strongly depends on how the two sequences are aligned. Similarity Proportion of pairs of similar residues between two aligned sequences. If two residues are similar is determined by a substitution matrix. This value also depends strongly on how the two sequences are aligned, as well as on the substitution matrix used. Homology Two sequences are homologous if and only if they have a common ancestor. There is no such thing as a level of homology ! (It's either yes or no) • Homologous sequences do not necessarily serve the same function... • ... Nor are they always highly similar: structure may be conserved while sequence is not. Concept of a sequence alignment • Pairwise Alignment:  Explicit mapping between the residues of 2 sequences deletion

Seq A GARFIELDTHELASTFA-TCAT ||||||||||| || |||| Seq B GARFIELDTHEVERYFASTCAT

errors / mismatches insertion

– Tolerant to errors (mismatches, insertion / deletions or indels) – Evaluation of the alignment in a biological concept (significance) Number of alignments • There are many ways to align two sequences • Consider the sequence fragments below: a simple alignment shows some conserved portions

CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA but also:

CGATGCAGACGTCA |||||||| CGATGCAAGACGTCA

• Number of possible alignments for 2 sequences of length 1000 residues:  more than 10600 gapped alignments (Avogadro 1024, estimated number of atoms in the universe 1080) What is a good alignment ? • We need a way to evaluate the biological meaning of a given alignment

• Intuitively we "know" that the following alignment:

CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA is better than:

ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG

• We can express this notion more rigorously, by using a scoring system Simple alignment scores • A simple way (but not the best) to score an alignment is to count 1 for each match and 0 for each mismatch.

CGAGGCACAACGTCA ||| ||| |||||| CGATGCAAGACGTCA  Score: 12

ATTGGACAGCAATCAGG | || | | ACGATGCAAGACGTCAG  Score: 5 Scoring schemes - amino acids

• Scoring system used for nucleic acids doesn’t take into account – Likelihood of one amino acid changing to another – Some amino acid substitutions are disastrous • So they don’t survive evolution – Some substitutions barely change anything • Because the two amino acids are chemically quite similar • Scoring schemes address this problem – Give scores to the chances of each substitution • 2 possibilities: – Use empirical evidence • Of actual substitutions in known homologues (families) – Use theory from (hydrophobicity, etc.) BLOSUM62 Scheme

• Blocks Amino Acid Substitution Matrices • Empirical method – Based on roughly 2000 amino acid patterns (blocks) – Found in more than 500 families of related proteins

• Calculate the Log-odds scores for each pair (R1, R2)

– Let O = observed frequency R1 <=> R2

– Let E = expected frequency R1 <=> R2 [happening by chance]

– I.e., Score = round(2 * log2(O/E)) • To calculate the score for an alignment of two sequences – Add up the pairwise scores for residues BLOSUM62 Substitution Matrix

• Zero: by chance – + more than chance – - less than chance • Arranged by – Sidegroups – So, high scoring in the end boxes • Example – M,I,L,V – Interchangeable Example Calculation

• Query = S S H L D K L M R • Dbase = H S H L K L L M G • Score = -1 4 8 4 -1 -2 4 5 0

• Total score = -1+4+8+4+-1+-2+4+5+-2 = 21 • Write Blosum(Query,Dbase) = 21 – Not standard to do this Basic Local Alignment Search Tool (BLAST) was developed as a new way to perform sequence similarity searches. “Local” means it searches and aligns sequence segments, rather than align the entire sequence. It’s able to detect relationships among sequences which share only isolated regions of similarity. Currently, it is the most popular and most accepted sequence analysis tool. Why BLAST?

• Identify unknown sequences - The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well- characterized sequence, then you may have access to a wealth of biological information. • Help gene/protein function and structure prediction – genes with similar sequences tend to share similar functions or structure. • Identify – group related (paralog or ortholog) genes and their proteins into a family. • Prepare sequences for multiple alignments • And more … BLAST Algorithm

 Idea: statistically significant alignments (hits) – Will have regions of at least 3 letters same  Or at least high scoring with respect to BLOSUM matrix

CCNDHRKMTCSPNDNNRK more likely than CCNDHRKMTCSPNDNNRK TTNDHRMTACSPDNNNKH YTNHHMMTTYSLDNNNKK

 Based on small local alignments  Makes use of lookup tables A Question

Question: Given the protein sequence

SLAALLNKCKTPQGQRLVNQW and the word length L= 3, how does the BLAST algorithm find the highest scoring alignment between this sequence and another sequence? Answer: Explaining the BLAST Algorithm

1. Query sequence must be split into words of defined length. A list of words of length 3 (L) in the query protein sequence is made starting with positions 1,2, and 3; then 2,3, and 4; etc. Our query sequence:

SLAALLNKCKTPQGQRLVNQW

SLA, LAA, AAL, ALL, LLN, LNK, NKC, KCK, CKT,PQG,QGQ,GQR,QRL,RLV,LVN,VNQ NQW Con…BLAST Algorithm

2. Define a threshold alignment score T (neighbourhood score threshold). 3. Find all word-pairs of length L with score ≥ T – e.g Find all w such that S(w, PQG) ≥ T – In another words, the query sequence are evaluated with any other combination of three amino acids. – This is done using a scoring matrix (e.g., BLOSUM 62). – Note: There are a total 20 x 20 x 20 = 8,000 possible match scores for a word Con…BLAST Algorithm

Neighbourhood words to PQG

PQG 18 PEG 15 PRG 14 Neighbourhood words PKG 14 PDG 13 PHG 13 PMG 13 PSG 13 PQA 12 Neighbourhood Score Threshold (T=13) PQN 12

Note: This procedure is repeated for each three-letter word in the query sequence Con….BLAST Algorithm

4. Now, search database for all ‘hits’ -sequences with exact matches to each w. 5. Extend in both directions alignment of ‘hits’ while score increases – producing High Scoring Pair’s (locally optimal ungapped alignments). 6. Return sequences with HSP’s which have significantly (statistically) higher scores than a threshold Smax – Smax obtained empirically from random sequences Con….BLAST Algorithm

• So….

SLAALLNKCKTPQGQRLVNQW +LA++L+ TP G R++ +W TLASVLDCTVTPMGSRMLKRW

High Scoring Segment Pair’s (HSPs) Con….BLAST Algorithm

7. Varying the threshold alignment score T – Search time decreases as T is increased, fewer word pairs are found – Sensitivity of search decreases as T is increased, word pairs overlooked (homologous (or similar) sequences may be discarded). – Note: The score of the alignment Smax AND the associated statistical significance are required to assess whether homology is suggested. Con….BLAST Algorithm

Finally: • For each statistically significant HSP – The alignment is reported • If a sequence D has two HSPs with Query Q – Two different alignments are reported • Later versions of BLAST – Try and unify the two alignments Blast@NCBI Blast@NCBI Blast advanced parameters Blast results Interpret BLAST results - Distribution

Query sequence BLAST hits. Click to access the pairwise alignment.

This image shows the distribution of BLAST hits on the query sequence. Each line represents a hit. The span of a line represents the region where similarity is detected. Different colors represent different ranges of scores. Interpret BLAST results - Description

The description (also called definition) lines are listed below under the heading "Sequences producing significant alignments". The term "significant" simply refers to all those hits whose E value was less than the threshold. It does not imply biological significance.

Links Max score – higher, better. ID (GI #, refseq #, DB-specific ID Gene/sequence Click to access the pairwise #) Click to access the record in Definition alignment GenBank Expect value – lower, better. It tells the possibility that this is a random hit Interpret BLAST results – pairwise alingments

Query line: the segment from query sequence. Sbjct line: the segment from hit (subject) sequence. Middle line: the consensus bases Summary - If your sequence is NUCLEOTIDE

Length DB Purpose Program

20 bp Nucleic Identify the query sequence MegaBlast or longer blastn Find sequences similar to query blastn sequence Find similar proteins to translated query in tblastx a translated database Protein Find similar proteins to translated query in blastx a protein database 7-20 bp Nucleic Find primer binding sites or map short Search for short, contiguous motifs nearly exact matches Summary - If your sequence is PROTEIN

Length DB Purpose Program

15 residue Protein Identify the query sequence or find protein blastp or longer sequences similar to query Find members of a protein family or build a PSI- custom position-specific score matrix

Find proteins similar to the query around a PHI-blast given pattern Nucleic Find similar proteins in a translated tblastn nucleotide database 5-15 Protein Search for peptide motifs Search for short, residue nearly exact matches