Sequence Analysis
Introduction to Bioinformatics BIMMS December 2015
Gabriel Teku Department of Experimental Medical Science Faculty of Medicine Lund University
Sequence analysis
Part 1 • Sequence analysis: general introduction • Sequence features • Motifs and Domains Part 2 • Galaxy • EMBOSS • Bioinformatics software for sequence analysis Sequence analysis
Part 1 • Sequence analysis: general introduction • Sequence features • Motifs and Domains Sequence analysis: definition
… refers to the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution...
[http://en.wikipedia.org/wiki/Sequence_analysis] Quick sequence analysis example
1. Obtain the protein sequence encoded by Human elastase gene from Uniprot, P08246
2. Obtain the CDS sequence for the protein.
http://www.ebi.ac.uk/Tools/st
1. Translate the CDS sequence obtained above
http://www.ebi.ac.uk/Tools/st Quick sequence analysis example
4. Compare the translated CDS to the protein sequence obtained from 1 above.
http://www.ebi.ac.uk/Tools/msa/clustalo/ Quick sequence analysis example
4. Compare the translated CDS to the protein sequence obtained from 1 above.
http://www.ebi.ac.uk/Tools/msa/clustalo/ Types of sequence analysis
Searching databases
Sequence alignments
Feature analyses Feature analysis
Part 1 • General introduction • Feature analyses • Motifs and Domains What is a feature
Sequence features are groups of nucleotides or amino acids that confer certain characteristics upon a gene or protein, and may be important for its overall function.
http://www.ebi.ac.uk/Tools/st Protein features Gene features Exercise on features
1. Explore the features along the protein P08246
within UniProt
2. View the protein’s structure from pdb by following
the 3D structure link for 1h1b. Quick exercise on motifs and domains
1. Identify the functional motif(s) of the protein P08246 Use PROSITE link from Uniprot → Family & Domains 2. What is the motif as represented by the database entry Sequence analysis
Part 1 • General introduction • Features • Motifs and Domains Motifs
• Short, conserved sequence patterns
• Associated to specific function(s)
Binding site
Active site
• ~ 10 - 30 amino acids
• Prosite Motifs Motifs: prosite
From CDS to protein sequence
Statistically significant motifs
Functional motifs
Protein family by virtue of similar functional sites Quick exercise on scanning for motifs
1. Use the protein sequence of the gene ELANE to scan
prosite for motifs and domains.
2. Compare the results with that of the previous
exercise. Motifs: prosite
Methodology
• Pattern development
Pattern from literature
Profiles Pattern development
• Based on signature patterns
• Sensitivity
• Specificity Pattern development
• Literature curated patterns
published
curated
tested against Swiss-Prot for specificity Pattern development
• New patterns
start with review article
alignment of proteins from article
focus on biologically important regions
create core pattern Pattern development
• New patterns (contd)
Search Swiss-Prot using core sites
Retain/discard core pattern
Refine core pattern and repeat search Patterns
• Prosite syntax for patterns:
• one-letter codes for amino acids, e.g. G=Gly
• elements separated by a hyphen, “-”
• “X” used where any amino acid is accepted, Patterns
Prosite syntax for patterns contd: • Ambiguities indicated by [ ], e.g. [AG] means Ala or Gly,
• Amino acids that are not accepted at a given position are listed between curly braces, “{ }”, e.g. {AG} means any amino acid except Ala and Gly, Patterns
Prosite syntax for patterns contd: • repetitions are placed between braces,“( )”, e.g. [AG](2,4) means Ala or Gly between 2 and 4 times, • a pattern is anchored to the N-terminal or C-terminal by “<“ and “>”, respectively. G H E G V G K V V K L G A G A
G H E K K G Y F E D R G P S A
G H E G Y G G R S R G G G Y S
G H E F E G P K G C G A L Y I
G H E L R G T T F M P A L E C
G H E G V G K V V K L G A G A
K K Y F E D R A P S S
F Y G R S R G G Y I
L E P K G C P L E C
R T T F M G-H-E-X(2)-G –X(5)-[GA]-X(3) Quick exercise
Interpret the motif you obtained from the previous exercise. Motifs: prosite
Methodology
• Pattern development
Pattern from literature
New patterns
• Profiles Profiles
Popular approaches
• position weight matrix
• HMM Position weight matrix
1 2 3 4 5 6
1 A T G T C G
2 A A G A C T
3 T A C T C A
1 2 3 4 5 6 Overall 4 C G G A G Gfreq. Pos. 5 A A C C T G
A 0.6 0.6 - 0.4 - 0.2 0.30
T 0.2 0.2 - 0.4 0.2 0.2 0.20
G - 0.2 0.6 - 0.2 0.6 0.27
C 0.2 - 0.4 0.2 0.6 - 0.23 1 2 3 4 5 6 Overall freq. Pos.
A 0.6 0.6 - 0.4 - 0.2 0.30
T 0.2 0.2 - 0.4 0.2 0.2 0.20
G - 0.2 0.6 - 0.2 0.6 0.27
C 0.2 - 0.4 0.2 0.6 - 0.23 1 2 3 4 5 6 Overall freq. Pos.
A 2.0 2.0 - 1.33 - 0.67 0.30
T 1.0 1.0 - 2.0 1.0 1.0 0.20
G - 0.74 2.22 - 0.74 2.22 0.27
C 0.87 - 1.74 0.87 2.61 - 0.23 1 2 3 4 5 6 Overall freq. Pos.
A 2.0 2.0 - 1.33 - 0.67 0.30
T 1.0 1.0 - 2.0 1.0 1.0 0.20
G - 0.74 2.22 - 0.74 2.22 0.27
C 0.87 - 1.74 0.87 2.61 - 0.23 1 2 3 4 5 6 Pos.
A 1.0 1.0 - 0.41 - -0.58
T 0.0 0.0 - 1.0 0.0 0.0
G - -0.43 1.15 - -0.43 1.15
C -0.2 - 0.8 -0.2 1.38 - 1 2 3 4 5 6 Pos.
A 1.0 1.0 - 0.41 - -0.58
T 0.0 0.0 - 1.0 0.0 0.0
G - -0.43 1.15 - -0.43 1.15
A A C T C G Sum of logs = 6.33 C -0.2 - 0.8 -0.2 1.38 -
1 2 3 4 5 6 Pos.
A 1.0 1.0 - 0.41 - -0.58
T 0.0 0.0 - 1.0 0.0 0.0
G - -0.43 1.15 - -0.43 1.15
C -0.2 - 0.8 -0.2 1.38 - Profile
• Multiple sequence alignments with gaps
• Gap penalties
• Profile = PSSM that includes gap penalties
• Fine tuning gap parameters to achieve good profiles Building a profile: PSI-BLAST Query sequence
BLAST
MSA
Profile A C B E 1 2 3 ...
BLAST Additional homologs Iterate process Incorporated profile
A C B E New profile 1 2 3 ... MEME Suite Example Quick exercise
1. BLAST the protein P08246 against the Uniprot proteins. 2. Select the first 5 hits and download the sequences in fasta format 3. Launch the MEME program at http://meme-suite.org/ 4. Using the downloaded sequence file above, search for possible motifs using the MEME program. 5. Compare the results to that from Prosite. 6. Leave the results open for later. Profiles from Hidden Markov Models
• More efficient
• From speech recognition
• Based on Markov Models
• Statistical approach Some motif resources
• PROSITE
• PRINTS
• SMART
• InterPro http://www.ebi.ac.uk/interpro/about.html Domains
• Longer than motifs • conserved sequence patterns • Independent structural and functional unit • Average length, 100 aa • May (not) include motifs along boundries Domains
• HMM applied in domain identification due to its robustness. • Some domain databases include • Pfam-A • Pfam-B • Prodom • MEME suite Quick exercise
1. Identify the domain(s) of the P08246 protein.
2. Explain how you accomplished the task. PART 2
• Galaxy
• EMBOSS
• Bioinformatics software for sequence analysis
• Open source tools
• Commercial softwares Galaxy
• https://usegalaxy.org/
• One-stop shop
• from single sequence to NGS
• Open source
Sequence Analysis Introduction Introduction to galaxy
http://galaxy.bmc.lu.se/ Introduction to galaxy Introduction to galaxy
Which coding exon has the highest number of single nucleotide polymorphisms (SNPs) on chromosome 22? Introduction to galaxy Galaxy tutorial
Register
Login
Familiarize Galaxy tutorial
https://github.com/nekrut/galaxy/wiki/Galaxy101-1 Galaxy tutorial: demo and exercise
1. Complete the galaxy 101 tutorial.
2. Share the final workflow.
3. Briefly describe the workflow in your own words.
4. Re-use the workflow, but this time; choose All SNPs
dataset as feature. Galaxy Demo & Exercise EMBOSS
The European Molecular Biology Open Software Suite
Large user community
Available on the web, for many OS, servers and stand-alone
If you know how to use one, then you know how to use all
Mature and stable
Sequence Analysis Introduction EMBOSS What is it good for? Sequence alignment Database search with sequence patterns Motif identification and domain analysis Nucleotide sequence pattern analysis
Sequence Analysis Introduction EMBOSS FROM SOURCEFORGE
http://emboss.sourceforge.net/ EMBOSS programs within galaxy Many other portals http://www.ebi.ac.uk/Tools/emboss/ http://emboss.bioinformatics.nl/ http://imed.med.ucm.es/EMBOSS/ http://www.bioinformatics2.wsu.edu/emboss/ http://pro.genomics.purdue.edu/emboss/ Quick CpG islands background for next exercise
Region of high density CG dinucleotides along the DNA 200 – 500 nucleotides, enriched with CG Enriched CpG nucleotides The p in CpG islands represent the phosphodiester bond between the C and G nucleotides Mostly occur within the promoter of eukaryotic genes Lock gene in an inactive state Helps identify the transcription start site of a gene Exercise
From galaxy, emboss toolshed:
List all tools that analyze CpG islands
On the search field, type in “cpg”
Access the documentation for two of these tools,
Preferably cpgplot and newcpgreport Exercise • Write down the expected result
• Run the tools on the human gene ELANE
• Interpret the results. Software for sequence analysis
• Open source tools
• Commercial tools Software for sequence analysis
• Websites with links to open source tools and services
http://www.ebi.ac.uk/services
http://www.ncbi.nlm.nih.gov/guide/sequence-analysis/
http://bioinformatics.ca/links_directory/
http://bioinformatics.ca/links_directory/category/dna/structure-and-sequence-
feature-detection Software for sequence analysis
Open source GNU general public licenses (GNU GPL) • Continuous evolution of code • Community supported Examples • EMBOSS • mothur Software for sequence analysis Software for sequence analysis Software for sequence analysis
Commercial tools Proprietary Expensive licenses Examples • Geneious (Biomatters Ltd., Auckland, New Zealand) • CLC Genomics Workbench (CLC bio, Aarhus, Denmark) • Sequencher (Gene Codes Corporation, MI, USA) Commercial software and feature offerings
Software Company Cost (USD)a NGS analysesc Evolutionary analysesd Database searchinge Workflows
Avadis NGS Strand Scientific Intelligence $4500 ✓ ✗ ✗ ✓
CLC Genomics Workbench ClC bio, Qiagen $5500 ✓ ✓ ✓ ✓
CodonCode Aligner CodonCode $720 ✓ ✓ ✗ ✗
Genamics Expression Genamics $295 ✗ ✓ ✓ ✗
Geneious Biomatters $795 ✓ ✓ ✓ ✓
Full Lasergene Suite DNASTAR $5950 ✓ ✓ ✓ ✓
MacVector & Assembler MacVector $300 ✓ ✓ ✓ ✗
NextGENe Softgenetics $4049 ✓ ✗ ✗ ✗
Sequencher Gene Codes $2500 ✓ ✓ ✓ ✗
VectorNTI Advance Life Technologies Smith$600 DR, Brief✗ Bioinform.✓ 2014 Sep 1 ✓ ✓