
Introduction to Bioinformatics Sequence analysis Etienne de Villiers BecA-ILRI Hub Nairobi, Kenya Outline 1. Molecular sequences 2. Nucleic acid sequence analysis 3. Protein sequence analysis 4. Homology Searching Sequence analysis: overview Sequencing Sequence Sequence Manual project sequence Nucleotide entry database management browsing entry sequence analysis Nucleotide sequence file Search databases for Search for protein Protein similar sequences coding regions sequence Design further Translate analysis experiments coding into Protein sequence file protein Restriction mapping PCR planning non-coding Search databases Search for Predict for similar known motifs secondary Sequence comparison sequences structure Search for RNA structure known motifs prediction Sequence comparison Multiple sequence analysis Predict tertiary Create a multiple structure sequence alignment Edit the alignment Format the Molecular Protein alignment for phylogeny family publication analysis Sequence entry Sequences for analysis can be obtained from two main sources: • Generated by yourself or • Obtained from databases Sequence formats • Why different formats? • Organise sequence information • Database integration • It is import to ensure that sequence files do not contain special characters. ASCII files are suitable for most sequence programs. • However independent DB and some widely used programs developed slightly different formats for sequences. • Correct use of different formats is critical as well as a possibility to recognize and convert sequence/file/entry from one format to another. Main file formats used in Bioinformatics ASN.1 EMBL Swiss Prot FASTA GenBank Phylip PIR Nexus GCG Sequence formats • There are many different (> 20) sequences formats including GenBank, EMBL, SwissProt, FASTA and several others. 1. FASTA/Pearson 2. GenBank format format LOCUS seq1 16bp >seq1 DEFINITION seq1, 16 bases, 2688 agctagct actgg checksum. >seq2 ORIGIN aactaact attcg 1 agctagctag // LOCUS seq2 20bp Sequence format conversions There are several computer programs able to convert formats: ReadSeq Available as standalone package or on the web: • http://bioportal.bic.nus.edu.sg/readseq/readseq.html • http://www-bimas.cit.nih.gov/molbio/readseq/ • http://bioweb.pasteur.fr/seqanal/interfaces/readseq-simple.html Seqret A program in EMBOSS suite Molecular Sequence Databases What kinds of analyses are there? 1. Finding regions of interest in nucleic acid sequence 2. Gene finding 3. Frequency analysis 4. Database searching – (Day 5) 5. Multiple alignment – (Day 6) 6. Measuring homology by pairwise alignment – (Day 6) A gene codes for a protein Gene/DNA CCTGAGCCAACTATTGATGAA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE Structure of Prokaryote and Eukaryote genes Finding regions of interest in nucleic acid sequence TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTTTTCCAACAGTTGATCCTATT • Classification of lowly sequence repeats. • Identification of gene components – (exon/intron boundaries – Promoters – transcription factor binding sites..) Gene finding (1/3) • Parts of a gene is scattered over DNA. • One approach - identify exactly where four specific signal type can be found. – The start codon. (ATG) – Beginning of each intron.(GT-) – End of each intron.(-AG) – The stop codon. Gene finding (2/3) • Second approach - content scoring method – analyze larger regions of sequence using codon frequency • codon frequency is different in coding and noncoding regions. • The coding portion of exons. • The noncoding portion of exons. • Introns • intergenic regions. Gene finding (3/3) • The two major types : eukaryotic and prokaryotic • eukaryotic gene finding is much harder – Presence of introns. – Coding density is low. – Underlying technologies : hidden Markov models, decision trees, neural networks... Gene finding TTGCAAAACACCTATGAGGGTCAAAAAAGTTTTATTATATACACTTCCGGTTGTCGGTAT TCTTTGATTATATTTAATTTCGTTAGGAAAAGACCGGAAAAAGAAGAGGAACTCAAACCT CCTTCTGCATTAGAAGATGAACTTAAAAAACGTGAAGAAGAAAGCCGAAAACGCATGGAA GAAATGCAAAAGGAAATTCTCGAAAAAAAGTTAAGAGAAGGTAAAAAAGCCTTGGAAGAA CTTGAAAAACGTGAAAAAGAAGTGGTAGATGAGTTTGCAAAACACCTCAAAAAACCTGAA GAAAGACTTCCTAAAATTATTCTTACATTGGATTCCGGTAGTCCAACAGTTGATCCTATT Frequency analysis • Determine the frequency of occurrence of sequence elements. • Applications – using oligomer frequency to distinguish coding and noncoding regions. – frequency of amino acids for predicting 3D structure of protein, functionality, location in the cell. – finding ribosome binding site. Gene prediction using codon frequency coding Frame 1 non-coding Frame 2 coding sequence Frame 3 correct start Profile / PSSM LTMTRGDIGNYLGLTVETISRLLGRFQKSGML LTMTRGDIGNYLGLTIETISRLLGRFQKSGMI LTMTRGDIGNYLGLTVETISRLLGRFQKSEIL • DNA / proteins Segments LTMTRGDIGNYLGLTVETISRLLGRLQKMGIL of the same length L; LAMSRNEIGNYLGLAVETVSRVFSRFQQNELI LAMSRNEIGNYLGLAVETVSRVFTRFQQNGLI LPMSRNEIGNYLGLAVETVSRVFTRFQQNGLL VRMSREEIGNYLGLTLETVSRLFSRFGREGLI • Often represented as LRMSREEIGSYLGLKLETVSRTLSKFHQEGLI Positional frequency LPMCRRDIGDYLGLTLETVSRALSQLHTQGIL LPMSRRDIADYLGLTVETVSRAVSQLHTDGVL matrix; LPMSRQDIADYLGLTIETVSRTFTKLERHGAI Protein Sequence Analysis Physico-chemical properties. Cellular localization. Signal peptides. Transmembrane domains. Post-translational modifications. Motifs & domains. Secondary structure. Other resources. ExPASy (Expert Protein Analysis System) • Swiss Institute of Bioinformatics (SIB). • Dedicated to the analysis of protein sequences and structures. • Many of the programs for protein sequence analysis can be accessed via ExPASy. http://www.expasy.org/tools/ 1) Physico-chemical properties: • ProtParam tool o molecular weight o theoretical pI (pH no net electrical charge) o amino acid composition o atomic composition o extinction coefficient o estimated half-life o instability index o aliphatic index o grand average of hydropathicity (GRAVY) 2) Cellular localization: • Proteins destined for particular subcellular localizations have distinct amino acid properties particularly in their N-terminal regions. • Used to predict whether a protein is localized in the cytoplasm, nucleus, mitochondria, or is retained in the ER, or destined for lysosome (vacuolar) or the peroxisome. • PSORT • End of the output the percentage likelihood of the subcellular localization. 3) Signal peptides: • Proteins destined for secretion, operation with the endoplasmic reticulum, lysosomes and many transmembrane proteins are synthesized with leading (N-terminal) 13 – 36 residue signal peptides. • SignalP WWW server can be used to predict the presence and location of signal peptide cleavage sites in your proteins. • Useful to know whether your protein has a signal peptide as it indicates that it may be secreted from the cell. • Proteins in their active form will have their signal peptides removed. 4) Transmembrane domains: • TMpred program makes a prediction of membrane- spanning regions and their orientation. • Algorithm is based on the statistical analysis of TMbase, a database of naturally occurring transmembrane proteins. • Presence of transmembrane domains is an indication that the protein is located on the cell surface. 5) Post-translational modifications: • After translation has occurred proteins may undergo a number of posttranslational modifications. • Can include the cleavage of the pro- region to release the active protein, the removal of the signal peptide and numerous covalent modifications such as, acetylations, glycosylations, hydroxylations, methylations and phosphorylations. • Posttranslational modifications may alter the molecular weight of your protein and thus its position on a gel. • Many programs available for predicting the presence of posttranslational modifications, we will take a look at one for the prediction of type O-glycosylation sites in mammalian proteins. • These programs work by looking for consensus sites and just because a site is found does not mean that a modification definitely occurs. 6) Motifs and Domains: • Motifs and domains give you information on the function of your protein. • Search the protein against one of the motif or profile databases. • ProfileScan, which allows you to search both the Prosite and Pfam databases simultaneously 7) Secondary Structure Prediction: • WHY: – If protein structure, even secondary structure, can be accurately predicted from the now abundantly available gene and protein sequences, such sequences become immensely more valuable for the understanding of drug design, the genetic basis of disease, the role of protein structure in its enzymatic, structural, and signal transduction functions, and basic physiology from molecular to cellular, to fully systemic levels. • JPRED - works by combining a number of modern, high quality prediction methods to form a consensus. Secondary Structure Prediction • Essentially protein secondary structure consists of 3 major conformations; . α Helix. β pleated sheet. coil conformation. Sequence Comparisons • The comparison of DNA sequences is most used method in bioinformatics: – Annotations of new nucleotide and protein sequences – construction of protein structures – design and analysis of bioinformatic and biological experiments. • Nature acts conservatively, it does not develop a new kind of
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages72 Page
-
File Size-