Protein-DNA Interactions II
Total Page:16
File Type:pdf, Size:1020Kb
Protein-DNA Interactions II Bio 5488 Overview 1. Review of information content and weight matrices 2. Online PWM resources 3. Motif discovery: Greedy algorithm 4. Motif discovery: Gibbs sampling Information Content EcoR1 Random Rap1 GAATTC GCCTAC TGTATGGGTG GAATTC ACATTC TGTTCGGATT GAATTC TCATTC TGCATGGGTG GAATTC CGACTC TGTACAGGTG GAATTC GAATTC TGTATGGATG GAATTC ATATCG TGTTCGGGTT GAATTC GAAATG TGTATGGGTG Information Content Matrix of Frequencies A: 0.1 0.7 0.2 0.3 0.4 0.1 C: 0.1 0.1 0.1 0.3 0.2 0.1 G: 0.1 0.1 0.2 0.1 0.2 0.1 T: 0.7 0.1 0.5 0.3 0.2 0.7 aka Relative Entropy, Kullbach-Liebler Distance Obtaining a Weight Matrix Hertz and Stormo, Bioinformatics (1999) Online tools: PWMs JASPAR (jaspar.genereg.net) open-access curated, non-redundant, PWMs for multi-cellular eukaryotes hundreds of PWMs JASPAR_FAM - metamodels for structural families Online tools: PWMs AC M00213 XX ID F$RAP1_C XX DT 05.09.1995 (created); dbo. DT 30.11.1995 (updated); ewi. CO Copyright (C), Biobase GmbH. XX NA RAP1 XX DE yeast repressor/activator protein 1 XX BF T00715 RAP1; Species: yeast, Saccharomyces cerevisiae. XX PO A C G T 01 6.89 2.40 0.79 4.74 W 02 8.80 0.00 4.58 1.43 R 03 7.20 6.83 0.79 0.00 M 04 14.82 0.00 0.00 0.00 A TRANSFAC 05 0.00 14.82 0.00 0.00 C 06 0.00 14.82 0.00 0.00 C Free access w/ reduced functionality 07 0.00 14.82 0.00 0.00 C 08 14.82 0.00 0.00 0.00 A to professional version 09 2.13 0.67 2.87 9.15 T 10 11.92 0.76 1.49 0.64 A 11 0.76 14.06 0.00 0.00 C 12 11.45 3.36 0.00 0.00 A Public version dates to 2005 13 0.79 5.64 0.00 7.10 Y 14 1.64 7.18 1.61 4.39 Y XX BA total weight of sequences: 14.82 XX CC consind generated matrix (random_expectation: 0.03) XX // (www.gene-regulation.com/pub/databases.html/#transfac) Online tools: PWMs RegulonDB (regulondb.ccg.unam.mx) Bacterial regulons and operons in E. coli 68 PWMs (as of 2013) ; Background model ; Bernoulli model (order=0) ; Strand undef ; Background pseudo-frequency 0.01 ; Residue probabilities ; a 0.29066 ; c 0.20779 ; g 0.20481 ; t 0.29673 a | 7 5 2 6 8 18 2 9 1 2 1 3 0 2 5 11 4 c | 2 0 3 0 2 1 6 2 3 4 4 3 0 2 1 0 15 g | 5 2 7 0 0 0 5 2 0 2 0 6 15 0 0 3 0 t | 6 13 8 14 10 1 7 7 16 12 15 8 5 16 14 6 1 // a | 0.3 0.3 0.1 0.3 0.4 0.9 0.1 0.4 0.1 0.1 0.1 0.2 0.0 0.1 0.3 0.5 0.2 0.8 c | 0.1 0.0 0.2 0.0 0.1 0.1 0.3 0.1 0.2 0.2 0.2 0.2 0.0 0.1 0.1 0.0 0.7 0.1 g | 0.2 0.1 0.3 0.0 0.0 0.0 0.2 0.1 0.0 0.1 0.0 0.3 0.7 0.0 0.0 0.2 0.0 0.1 t | 0.3 0.6 0.4 0.7 0.5 0.1 0.3 0.3 0.8 0.6 0.7 0.4 0.3 0.8 0.7 0.3 0.1 0.1 // a | 0.2 -0.1 -1.0 0.0 0.3 1.1 -1.0 0.4 -1.6 -1.0 -1.6 -0.6 -3.0 -1.0 -0.1 0.6 -0.4 1.0 c | -0.7 -3.0 -0.3 -3.0 -0.7 -1.3 0.4 -0.7 -0.3 -0.0 -0.0 -0.3 -3.0 -0.7 -1.3 -3.0 1.2 -1.3 g | 0.2 -0.7 0.5 -3.0 -3.0 -3.0 0.2 -0.7 -3.0 -0.7 -3.0 0.4 1.3 -3.0 -3.0 -0.3 -3.0 -1.3 t | 0.0 0.8 0.3 0.8 0.5 -1.6 0.2 0.2 1.0 0.7 0.9 0.3 -0.2 1.0 0.8 0.0 -1.6 -1.0 // a | 0.1 -0.0 -0.1 0.0 0.1 1.0 -0.1 0.2 -0.1 -0.1 -0.1 -0.1 -0.0 -0.1 -0.0 0.3 -0.1 0.8 c | -0.1 -0.0 -0.0 -0.0 -0.1 -0.1 0.1 -0.1 -0.0 -0.0 -0.0 -0.0 -0.0 -0.1 -0.1 -0.0 0.9 -0.1 g | 0.0 -0.1 0.2 -0.0 -0.0 -0.0 0.0 -0.1 -0.0 -0.1 -0.0 0.1 0.9 -0.0 -0.0 -0.0 -0.0 -0.1 t | 0.0 0.5 0.1 0.6 0.2 -0.1 0.1 0.1 0.7 0.4 0.7 0.1 -0.0 0.7 0.6 0.0 -0.1 -0.1 // ; Sites 20 Online tools: Logos WebLogo: (weblogo.berkeley.edu) Creates logos from multiple sequence alignments (not online but useful: seqLogo package for R) Motif Finding Problem • A fundamental problem in molecular biology – Specific protein-DNA binding – Transcription factor binding site recognition • Statistical definition: – Given some sequences, find over-represented substrings (motif discovery) • Biological example: – Given some co-regulated promoters, find transcription factor binding model • Many algorithms/programs developed – consensus, Gibbs sampling, EM, projection, phylogenetic footprinting, etc. Motif Finding Algorithms • Motivation – Motif discovery is a problem whose straight- forward solution is intractable • Algorithms utilize different strategies – Greedy algorithm – Gibbs sampler The Problem Consider a set of sequences that are believed to harbor common subsequences (one per sequence) that are similar but not identical. Find the locations of the subsequences and a compact representation of the alignment of subsequences (e.g, a PWM). The locations should be selected in such a way to maximize the information content of the alignment. An (intractable) solution … k sequences (Exhaustive algorithm) Construct every possible combination of alignments and keep the one with the highest information content. Given a motif of width w, and k sequences of length l, there are L = (l-w+1) possible locations in each sequence, and Lk alignments to check. Real-world case The Data Set: Sequences containing sites for cAMP receptor protein (CRP) locus sequence colel taatgtttgtgctggtTTTTGTGGCATCGGGCGAGAATagcgcgtggtgtgaaagactgtTTTTTTGATCGTTTTCACAAAAatggaagtccacagtcttgacag ecoarabop gacaaaaacgcgtaacAAAAGTGTCTATAATCACGGCAgaaaagtccacattgaTTATTTGCACGGCGTCACACTTtgctatgccatagcatttttatccataag ecobglrl acaaatcccaataacttaattattgggatttgttatatataactttataaattcctaaaattacacaaagttaatAACTGTGAGCATGGTCATATTTttatcaat ecocrp cacaaagcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatGTATGCAAAGGACGTCACATTAccgtgcagtacagttgatagc ecocya acggtgctacacttgtatgtagcgcatctttctttacggtcaatcagcaAGGTGTTAAATTGATCACGTTTtagaccattttttcgtcgtgaaactaaaaaaacc ecodeop agtgaaTTATTTGAACCAGATCGCATTAcagtgatgcaaacttgtaagtagatttccttAATTGTGATGTGTATCGAAGTGtgttgcggagtagatgttagaata ecogale gcgcataaaaaacggctaaattcttgtgtaaacgattccacTAATTTATTCCATGTCACACTTttcgcatctttgttatgctatggttatttcataccataagcc ecoilvbpr gctccggcggggttttttgttatctgcaattcagtacaAAACGTGATCAICCCCTCAATTttccctttgctgaaaaattttccattgtctcccctgtaaagctgt ecolac aacgcaatTAATGTGAGTTAGCTCACTCATtaggcaccccaggctttacactttatgcttccggctcgtatgttgtgtggAATTGTGAGCGGATAACAATTTcac ecomale acattaccgccaaTTCTGTAACAGAGATCACACAAagcgacggtggggcgtaggggcaaggaggatggaaagaggttgccgtataaagaaactagagtccgttta ecomalk ggaggaggcgggaggatgagaacacggcTTCTGTGAACTAAACCGAGGTCatgtaaggaatttcgtgatgttgcttgcaaaaatcgtggcgattttatgtgcgca ecomalt gatcagcgtcgttttaggtgagttgttaataaagatttggAATTGTGACACAGTGCAAATTCagacacataaaaaaacgtcatcgcttgcattagaaaggtttct ecoompa gctgacaaaaaagattaaacataccttatacaagacttttttttcatATGCCTGACGGAGTTCACACTTgtaagttttcaactacgttgtagactttacatcgcc ecotnaa ttttttaaacattaaaattcttacgtaatttataatctttaaaaaaagcatttaatattgctccccgaacGATTGTGATTCGATTCACATTTaaacaatttcaga ecouxul cccatgagagtgaaatTGTTGTGATGTGGTTAACCCAAttagaattcgggattgacatgtcttaccaaaaggtagaacttatacgccatcteatccgatgcaagc pbr-p4 ctggcttaactatgcggcatcagagcagattgtactgagagtgcaccatatgCGGTGTGAAATACCGCACAGATgcgtaaggagaaaataccgcatcaggcgctc trn9cat CTGTGACGGAAGATCACTTCgcagaataaataaatcctggtgtccctgttgataccgggaagccctgggccaacttttggcgaAAATGAGACGTTGATCGGCACG tdc gatttttatactttaacttgttgatatttaaaggtatttaattgtaataacgatactctggaaagtattgaaagttaATTTGTGAGTGGTCGCACATATcctgtt For this case, there are 18 sequences of length 105 bp and we are looking for a motif of width 20 bp. There are 86 different 20 bp subsequences per example and ~7x1034 alignments to check. Stormo and Hartzell, Proc. Natl. Acad. Sci. (1989) Greedy Algorithm: The Idea The exhaustive approach is not possible. A compromise: solve a series of smaller problems in a step-wise exhaustive fashion. The trick is to start with a smaller set of sequences (k = 2) that we can solve exactly, and incorporate the rest of the sequences one by one. Greedy Algorithm The algorithm: 1. From the first sequence, generate scoring matrices for all possible subsequence locations. Initially the counts will be 1 or 0. 2. The next sequence is scanned with all of these matrices and the best scoring location identified. 3. The matrices are updated by folding in information from the newly identified sites in the next sequence. 4. Repeat steps 2 and 3 with the updated matrices and scanning the next sequence until they have all been processed. The details are in the matrix update – you can dump low information content matrices, keep only the highest X by score, etc. Here, all matrices are kept, and updated with the highest scoring site in the sequence under consideration. In case of a tie, keep both. Greedy Algorithm Example Toy problem: 3 sequences of length 7 bp, looking for 6 bp motif Extract all 6 bp subsequences from 1st sequence, Create matrices for each For each matrix, scan second sequence, align to best match, make new matrices Scan 3rd sequence and align to best match. Update matrices. No more sequences, take the best by I.C. Hertz, Hartzell III, Stormo, Bioinformatics (1990) Greedy Algorithm How to think about this? The exhaustive treatment was too expensive, but this algorithm is a step-wise approximation. The first step performs a full scan of each motif alignment on the first sequence against all alignments on the second. As we step-wise incorporate additional sequences, we carry along a finite amount of information (matrices), so the requirements don’t explode. Greedy Algorithm locus sequence colel taatgtttgtgctggtTTTTGTGGCATCGGGCGAGAATagcgcgtggtgtgaaagactgtTTTTTTGATCGTTTTCACAAAAatggaagtccacagtcttgacag ecoarabop gacaaaaacgcgtaacAAAAGTGTCTATAATCACGGCAgaaaagtccacattgaTTATTTGCACGGCGTCACACTTtgctatgccatagcatttttatccataag ecobglrl acaaatcccaataacttaattattgggatttgttatatataactttataaattcctaaaattacacaaagttaatAACTGTGAGCATGGTCATATTTttatcaat ecocrp cacaaagcgaaagctatgctaaaacagtcaggatgctacagtaatacattgatgtactgcatGTATGCAAAGGACGTCACATTAccgtgcagtacagttgatagc ecocya acggtgctacacttgtatgtagcgcatctttctttacggtcaatcagcaAGGTGTTAAATTGATCACGTTTtagaccattttttcgtcgtgaaactaaaaaaacc ecodeop agtgaaTTATTTGAACCAGATCGCATTAcagtgatgcaaacttgtaagtagatttccttAATTGTGATGTGTATCGAAGTGtgttgcggagtagatgttagaata