Weblogo Documentation Release 3.7.9.Dev2+G7eab5d1.D20210504

Total Page:16

File Type:pdf, Size:1020Kb

Weblogo Documentation Release 3.7.9.Dev2+G7eab5d1.D20210504 WebLogo Documentation Release 3.7.9.dev2+g7eab5d1.d20210504 Gavin E. Crooks May 04, 2021 Contents: 1 Distribution and Modification3 1.1 WebLogo API..............................................3 1.2 Alphabets and Sequences........................................3 1.3 Sequence IO...............................................7 1.3.1 Sequence file reading and writing...............................7 1.3.2 Supported File Formats.....................................8 1.4 Logo Data, Options, and Format.....................................9 1.5 Logo Formatting............................................. 11 Python Module Index 13 Index 15 i ii WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504 WebLogo is software designed to make the generation of sequence logos easy and painless. A sequence logo is a graphical representation of an amino acid or nucleic acid multiple sequence alignment. Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. In general, a sequence logo provides a richer and more precise description of, for example, a binding site, than would a consensus sequence. WebLogo features a web interface (http://weblogo.threeplusone.com), and a command line interface provides more options and control (http://weblogo.threeplusone.com/manual.html#CLI). These pages document the API. The main WebLogo webserver is located at http://weblogo.threeplusone.com Please consult the manual for installation instructions and more information: (Also located in the weblogolib/htdocs subdirectory.) http://weblogo.threeplusone.com/manual.html For help on the command line interface run weblogo --help To build a simple logo run weblogo < cap.fa > logo0.eps To run as a standalone webserver at localhost:8080 weblogo --serve Contents: 1 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504 2 Contents: CHAPTER 1 Distribution and Modification This package is distributed under the new BSD Open Source License. Please see the LICENSE.txt file for details on copyright and licensing. The WebLogo source code can be downloaded from https://github.com/WebLogo/weblogo WebLogo requires Python 3.6 or 3.7. Generating logos in PDF or bitmap graphics formats require that the ghostscript program ‘gs’ be installed. Scalable Vector Graphics (SVG) format also requires the program ‘pdf2svg’. 1.1 WebLogo API To create a logo in python code: >>> from weblogo import * >>> fin= open('cap.fa') >>> seqs= read_seq_data(fin) >>> logodata= LogoData.from_seqs(seqs) >>> logooptions= LogoOptions() >>> logooptions.title="A Logo Title" >>> logoformat= LogoFormat(logodata, logooptions) >>> eps= eps_formatter(logodata, logoformat) 1.2 Alphabets and Sequences Alphabetic sequences and associated tools and data. Seq is a subclass of a python string with additional annotation and an alphabet. The characters in string must be contained in the alphabet. Various standard alphabets are provided. Classes Alphabet-- A subset of non-null ascii characters Seq-- An alphabetic string SeqList-- A collection of Seq's 3 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504 Alphabets o generic_alphabet-- A generic alphabet. Any printable ASCII character. o protein_alphabet-- IUCAP/IUB Amino Acid one letter codes. o nucleic_alphabet-- IUPAC/IUB Nucleic Acid codes'ACGTURYSWKMBDHVN-' o dna_alphabet-- Same as nucleic_alphabet, with 'U' (Uracil) an alternative for 'T' (Thymidine). o rna_alphabet-- Same as nucleic_alphabet, with 'T' (Thymidine) an alternative for 'U' (Uracil). o reduced_nucleic_alphabet-- All ambiguous codes in 'nucleic_alphabet' are alternative to'N' (aNy) o reduced_protein_alphabet-- All ambiguous ('BZJ') and non-canonical amino acids codes ('U', Selenocysteine and 'O', Pyrrolysine) in 'protein_alphabet' are alternative to'X'. o unambiguous_dna_alphabet--'ACGT' o unambiguous_rna_alphabet--'ACGU' o unambiguous_protein_alphabet-- The twenty canonical amino acid one letter codes, in alphabetic order,'ACDEFGHIKLMNPQRSTVWY' Amino Acid Codes: Code Alt. Meaning ----------------- A Alanine B Aspartic acid or Asparagine C Cysteine D Aspartate E Glutamate F Phenylalanine G Glycine H Histidine I Isoleucine J Leucine or Isoleucine K Lysine L Leucine M Methionine N Asparagine O Pyrrolysine P Proline Q Glutamine R Arginine S Serine T Threonine U Selenocysteine V Valine W Tryptophan Y Tyrosine Z Glutamate or Glutamine X ? any * translation stop - .~ gap Nucleotide Codes: Code Alt. Meaning ------------------------------ A Adenosine C Cytidine (continues on next page) 4 Chapter 1. Distribution and Modification WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504 (continued from previous page) G Guanine T Thymidine U Uracil R G A (puRine) Y T C (pYrimidine) K G T (Ketone) M A C (aMino group) S G C (Strong interaction) W A T (Weak interaction) B G T C (not A) (B comes after A) D G A T (not C) (D comes after C) H A C T (not G) (H comes after G) V G C A (not T, not U) (V comes after U) N X? A G C T (aNy) - .~ A gap Refs: http://www.chem.qmw.ac.uk/iupac/AminoAcid/A2021.html http://www.chem.qmw.ac.uk/iubmb/misc/naseq. html Authors: GEC 2004,2005 class weblogo.seq.Alphabet An ordered subset of printable ascii characters. Status: Beta Authors: • GEC 2005 alphabetic(string) True if all characters of the string are in this alphabet. chr(n) The n’th character in the alphabet (zero indexed) or 0 chrs(sequence_of_ints) Convert a sequence of ordinals into an alphabetic string. letters() Letters of the alphabet as a string. normalize(string) Normalize an alphabetic string by converting all alternative symbols to the canonical equivalent in ‘letters’. ord(c) The ordinal position of the character c in this alphabet, or 255 if no such character. ords(string) Convert an alphabetic string into a byte array of ordinals. static which(seqs, alphabets=None) Returns the most appropriate unambiguous protein, RNA or DNA alphabet for a Seq or SeqList. If a list of alphabets is supplied, then the best alphabet is selected from that list. The heuristic is to count the occurrences of letters for each alphabet and downweight longer alphabets by the log of the alphabet length. Ties go to the first alphabet in the list. class weblogo.seq.Seq An alphabetic string. A subclass of “str” consisting solely of letters from the same alphabet. alphabet -- A string or Alphabet of allowed characters. 1.2. Alphabets and Sequences 5 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504 name -- A short string used to identify the sequence. description -- A string describing the sequence Authors : GEC 2005 back_translate() Translate a protein sequence back into coding DNA, using the standard genetic code. See webl- ogo.transform.GeneticCode for details and more options. complement() Returns complementary nucleic acid sequence. join(iterable) ! str Return a string which is the concatenation of the strings in the iterable. The separator between elements is S. lower() Return a lower case copy of the sequence. mask(letters=’abcdefghijklmnopqrstuvwxyz’, mask=’X’) Replace all occurrences of letters with the mask character. The default is to replace all lower case letters with ‘X’. ords() Convert sequence to an array of integers in the range [0, len(alphabet) ) remove(delchars) Return a new alphabetic sequence with all characters in ‘delchars’ removed. reverse() Return the reversed sequence. Note that this method returns a new object, in contrast to the in-place reverse() method of list objects. reverse_complement() Returns reversed complementary nucleic acid sequence (i.e. the other strand of a DNA sequence.) tally(alphabet=None) Counts the occurrences of alphabetic characters. Arguments: - alphabet – an optional alternative alphabet Returns : A list of character counts in alphabetic order. tostring() Converts Seq to a raw string. translate() Translate a nucleotide sequence to a polypeptide using full IUPAC ambiguities in DNA/RNA and amino acid codes, using the standard genetic code. See weblogo.transform.GeneticCode for details and more options. upper() Return a lower case copy of the sequence. word_count(k, alphabet=None) Return a count of all subwords in the sequence. >>> from weblogo.seq import * >>> Seq("abcabc").word_count(3) [('abc', 2), ('bca', 1), ('cab', 1)] 6 Chapter 1. Distribution and Modification WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504 words(k, alphabet=None) Return an iteration over all subwords of length k in the sequence. If an optional alphabet is provided, only words from that alphabet are returned. >>> list(Seq("abcabc").words(3)) ['abc', 'bca', 'cab', 'abc'] weblogo.seq.rna(string) Create an alphabetic sequence representing a stretch of RNA. weblogo.seq.dna(string) Create an alphabetic sequence representing a stretch of DNA. weblogo.seq.protein(string) Create an alphabetic sequence representing a stretch of polypeptide. class weblogo.seq.SeqList(alist=[], alphabet=None, name=None, description=None) A list of sequences. isaligned() Are all sequences of the same length and alphabet? ords(alphabet=None) Convert sequence list into a 2D array of ordinals. profile(alphabet=None) Counts the occurrences of characters in each column. Returns: Motif(counts, alphabet) tally(alphabet=None) Counts the occurrences of alphabetic characters. Parameters alphabet -- an optional alternative alphabet (-)– Returns : A list of character counts in alphabetic order.
Recommended publications
  • Sequence Motifs, Correlations and Structural Mapping of Evolutionary
    Talk overview • Sequence profiles – position specific scoring matrix • Psi-blast. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: PFAM BLOCKS PROSITE PRINTS InterPro • Correlated Mutations and structural insight • Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations PSSM – position specific scoring matrix • A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix Assuming a string S of length n S = s1s2s3...sn If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1 where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast). Sequence space PSI-BLAST • For a query sequence use Blast to find matching sequences. • Construct a multiple sequence alignment from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST • Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences.
    [Show full text]
  • Sequence Motifs, Information Content, and Sequence Logos Morten
    Sequence motifs, information content, and sequence logos Morten Nielsen, CBS, Depart of Systems Biology, DTU Objectives • Visualization of binding motifs – Construction of sequence logos • Understand the concepts of weight matrix construction – One of the most important methods of bioinformatics • How to deal with data redundancy • How to deal with low counts Outline • Pattern recognition • Weight matrix – Regular expressions construction and probabilities – Sequence weighting • Information content – Low (pseudo) counts • Examples from the real – Sequence logos world • Multiple alignment and • Sequence profiles sequence motifs Binding Motif. MHC class I with peptide Anchor positions Sequence information SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAV LLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTL HLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTI ILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSL LERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGV PLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQM KLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSV KTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKV SLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYV ILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI
    [Show full text]
  • Seq2logo: a Method for Construction and Visualization of Amino Acid Binding Motifs and Sequence Profiles Including Sequence Weig
    Downloaded from orbit.dtu.dk on: Dec 20, 2017 Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion Thomsen, Martin Christen Frølund; Nielsen, Morten Published in: Nucleic Acids Research Link to article, DOI: 10.1093/nar/gks469 Publication date: 2012 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Thomsen, M. C. F., & Nielsen, M. (2012). Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion. Nucleic Acids Research, 40(W1), W281-W287. DOI: 10.1093/nar/gks469 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Published online 25 May 2012 Nucleic Acids Research, 2012, Vol.
    [Show full text]
  • Interpreting a Sequence Logo When Initiating Translation, Ribosomes Bind to an Mrna at a Ribosome Binding Site Upstream of the AUG Start Codon
    Interpreting a Sequence Logo When initiating translation, ribosomes bind to an mRNA at a ribosome binding site upstream of the AUG start codon. Because mRNAs from different genes all bind to a ribosome, the genes encoding these mRNAs are likely to have a similar base sequence where the ribosomes bind. Therefore, candidate ribosome binding sites on mRNA can be identified by comparing DNA sequences (and thus the mRNA sequences) of several genes in a species, searching the region upstream of the start codon for shared (conserved) base sequences. The DNA sequences of 149 genes from the E. coli genome were aligned with the aim to identify similar base sequences as potential ribosome binding sites. Rather than presenting the data as a series of 149 sequences aligned in a column (a sequence alignment), the researchers used a sequence logo. The potential ribosome binding regions from 10 E. coli genes are shown in the sequence alignment in Figure 1. The sequence logo derived from the aligned sequences is shown in Figure 2. Note that the DNA shown is the nontemplate (coding) strand, which is how DNA sequences are typically presented. Figure 1 Sequence alignment for 10 E. coli genes. Figure 2 Sequence logo derived from sequence alignment 1) In the sequence logo, the horizontal axis shows the primary sequence of the DNA by nucleotide position. Letters for each base are stacked on top of each other according to their relative frequency at that position among the aligned sequences, with the most common base as the largest letter at the top of the stack.
    [Show full text]
  • A New Sequence Logo Plot to Highlight Enrichment and Depletion
    bioRxiv preprint doi: https://doi.org/10.1101/226597; this version posted November 29, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. A new sequence logo plot to highlight enrichment and depletion. Kushal K. Dey 1, Dongyue Xie 1, Matthew Stephens 1, 2 1 Department of Statistics, University of Chicago 2 Department of Human Genetics, University of Chicago * Corresponding Email : [email protected] Abstract Background : Sequence logo plots have become a standard graphical tool for visualizing sequence motifs in DNA, RNA or protein sequences. However standard logo plots primarily highlight enrichment of symbols, and may fail to highlight interesting depletions. Current alternatives that try to highlight depletion often produce visually cluttered logos. Results : We introduce a new sequence logo plot, the EDLogo plot, that highlights both enrichment and depletion, while minimizing visual clutter. We provide an easy-to-use and highly customizable R package Logolas to produce a range of logo plots, including EDLogo plots. This software also allows elements in the logo plot to be strings of characters, rather than a single character, extending the range of applications beyond the usual DNA, RNA or protein sequences. We illustrate our methods and software on applications to transcription factor binding site motifs, protein sequence alignments and cancer mutation signature profiles. Conclusion : Our new EDLogo plots, and flexible software implementation, can help data analysts visualize both enrichment and depletion of characters (DNA sequence bases, amino acids, etc) across a wide range of applications.
    [Show full text]
  • A Brief History of Sequence Logos
    Biostatistics and Biometrics Open Access Journal ISSN: 2573-2633 Mini-Review Biostat Biometrics Open Acc J Volume 6 Issue 3 - April 2018 Copyright © All rights are reserved by Kushal K Dey DOI: 10.19080/BBOAJ.2018.06.555690 A Brief History of Sequence Logos Kushal K Dey* Department of Statistics, University of Chicago, USA Submission: February 12, 2018; Published: April 25, 2018 *Corresponding author: Kushal K Dey, Department of Statistics, University of Chicago, 5747 S Ellis Ave, Chicago, IL 60637, USA. Tel: 312-709- 0680; Email: Abstract For nearly three decades, sequence logo plots have served as the standard tool for graphical representation of aligned DNA, RNA and protein sequences. Over the years, a large number of packages and web applications have been developed for generating these logo plots and using them handling and the overall scope of these plots in biological applications and beyond. Here I attempt to review some popular tools for generating sequenceto identify logos, conserved with a patterns focus on in how sequences these plots called have motifs. evolved Also, over over time time, since we their have origin seen anda considerable how I view theupgrade future in for the these look, plots. flexibility of data Keywords : Graphical representation; Sequence logo plots; Standard tool; Motifs; Biological applications; Flexibility of data; DNA sequence data; Python library; Interdependencies; PLogo; Depletion of symbols; Alphanumeric strings; Visualizes pairwise; Oligonucleotide RNA sequence data; Visualize succinctly; Predictive power; Initial attempts; Widespread; Stylistic configurations; Multiple sequence alignment; Introduction based comparisons and predictions. In the next section, we The seeds of the origin of sequence logos were planted in review the modeling frameworks and functionalities of some of early 1980s when researchers, equipped with large amounts of these tools [5].
    [Show full text]
  • Lecture 7: Sequence Motif Discovery
    Sequence motif: definitions COSC 348: Computing for Bioinformatics • In Bioinformatics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has Lecture 7: been proven or assumed to have a biological significance. Sequence Motif Discovery • Once we know the sequence pattern of the motif, then we can use the search methods to find it in the sequences (i.e. Lubica Benuskova Boyer-Moore algorithm, Rabin-Karp, suffix trees, etc.) • The problem is to discover the motifs, i.e. what is the order of letters the particular motif is comprised of. http://www.cs.otago.ac.nz/cosc348/ 1 2 Examples of motifs in DNA Sequence motif: notations • The TATA promoter sequence is an example of a highly • An example of a motif in a protein: N, followed by anything but P, conserved DNA sequence motif found in eukaryotes. followed by either S or T, followed by anything but P − One convention is to write N{P}[ST]{P} where {X} means • Another example of motifs: binding sites for transcription any amino acid except X; and [XYZ] means either X or Y or Z. factors (TF) near promoter regions of genes, etc. • Another notation: each ‘.’ signifies any single AA, and each ‘*’ Gene 1 indicates one member of a closely-related AA family: Gene 2 − WDIND*.*P..*...D.F.*W***.**.IYS**...A.*H*S*WAMRN Gene 3 Gene 4 • In the 1st assignment we have motifs like A??CG, where the Gene 5 wildcard ? Stands for any of A,U,C,G. Binding sites for TF 3 4 Sequence motif discovery from conservation Motif discovery based on alignment • profile analysis is another word for this.
    [Show full text]
  • Basic Local Alignment Search Tool (BLAST) Biochemistry
    Biochemistry 324 Bioinformatics Basic Local Alignment Search Tool (BLAST) Why use BLAST? • BLAST searches for any entry in a selected database that is similar to your query sequence (protein or nucleotide) • Identifying relatedness with BLAST is the first step to identify possible function of an unknown protein or gene • identifying orthologs and paralogs • discovering new genes or proteins • discovering variants of genes or proteins • investigating expressed sequence tags (ESTs) • exploring protein structure and function • Searching for matches in a database with the “needle” or “water” algorithm is not feasible – it is too slow • BLAST uses a heuristic approach – it is not guaranteed to be the optimal answer, but is close to it • BLAST is available at https://blast.ncbi.nlm.nih.gov • You can download and install BLAST+ on you personal computer: https://blast.ncbi.nlm.nih.gov/ The BLAST webpage Query sequence FastA or accession number Database Algorithm Parameters BLAST protein databases BLAST nucleotide databases Different BLAST “flavours” Algorithm parameters Max targets Short queries Expect threshold Word size Max matches Matrix Gap costs Compositional adjustment Filter Mask Algorithm parameters Max targets – maximum number of sequence matches Short queries – short sequences are more likely to be found, and word size can be adjusted Expect threshold – the expected number of hits in a random model Word size – the length of the seed that initiates the alignment Max matches – adjust matches to different ranges in query sequence to avoid
    [Show full text]
  • A Sequence Logo Generator
    Resource WebLogo: A Sequence Logo Generator Gavin E. Crooks,1 Gary Hon,1 John-Marc Chandonia,2 and Steven E. Brenner1,2,3 1Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA; 2Berkeley Structural Genomics Center, Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA WebLogo generates sequence logos, graphical representations of the patterns within a multiple sequence alignment. Sequence logos provide a richer and more precise description of sequence similarity than consensus sequences and can rapidly reveal significant features of the alignment otherwise difficult to perceive. Each logo consists of stacks of letters, one stack for each position in the sequence. The overall height of each stack indicates the sequence conservation at that position (measured in bits), whereas the height of symbols within the stack reflects the relative frequency of the corresponding amino or nucleic acid at that position. WebLogo has been enhanced recently with additional features and options, to provide a convenient and highly configurable sequence logo generator. A command line interface and the complete, open WebLogo source code are available for local installation and customization. Sequence logos were invented by Tom Schneider and Mike Ste- ference between the maximum possible entropy and the entropy phens (Schneider and Stephens 1990; Shaner et al. 1993) to dis- of the observed symbol distribution: play patterns in sequence conservation, and to assist in discov- ering and analyzing those patterns. As an example, the accom- N = − = − ͩ− ͪ panying figure (Fig. 1) shows how WebLogo can help interpret Rseq Smax Sobs log2 N ͚ pn log2 pn n=1 the sequence-specific binding of the protein CAP to its DNA rec- ognition site (Schultz et al.
    [Show full text]
  • Using Sequence Logos and Information Analysis of Lrp DNA Binding Sites to Investigate Discrepancies Between Natural Selection and SELEX Ryan K
    882–887 Nucleic Acids Research, 1999, Vol. 27, No. 3 1999 Oxford University Press Using sequence logos and information analysis of Lrp DNA binding sites to investigate discrepancies between natural selection and SELEX Ryan K. Shultzaberger1,2,+ and Thomas D. Schneider2,* 1Catoctin High School, 14745 Sabillasville Road, Thurmont, MD 21788, USA and 2Laboratory of Mathematical Biology, National Cancer Institute, Frederick Cancer Research and Development Center, PO Box B, Building 469, Room 144, Frederick, MD 21702-1201, USA Received August 26, 1998; Revised and Accepted December 1, 1998 ABSTRACT introduction, the SELEX technique has been used to study a variety of systems (9,10). In vitro experiments that characterize DNA–protein Since Lrp has many natural binding sites, a reasonably accurate interactions by artificial selection, such as SELEX, model for in vivo binding sequences can be created and compared are often performed with the assumption that the with sites produced by SELEX. Based on Claude Shannon’s experimental conditions are equivalent to natural information theory (11,12), molecular information theory ones. To test whether SELEX gives natural results, we (13,14) is a mathematical approach to explaining molecular compared sequence logos composed from naturally interactions. Using information theory, we constructed two separate occurring leucine-responsive regulatory protein (Lrp) models of Lrp binding sequences for comparison. These quantitative binding sites with those composed from SELEX- models, called sequence logos (15), graphically represent Lrp generated binding sites. The sequence logos were binding in both the natural and synthetic environments. Comparison significantly different, indicating that the binding of the models allowed us to test whether the sites selected in vitro conditions are disparate.
    [Show full text]
  • A New Way of Visualizing HMM Logo for Sequence -‐ Profile
    Spring Logo: A New Way of Visualizing HMM Logo for Sequence - Profile Alignment Mahshid Zeinaly IAT 814 Fall 2009 Fig.1. Spring Logo visualization employs the wave representation idea enabling analysts to compare probability and the score of the fitness together. The visualization also clariFies the reverse complement regions and the letters missed in the Hidden Markov Model. Abstract – One bottleneck in bioinFormatics algorithms such as sequence-proFile alignment has to do with evaluating the Final alignment score. To a limited degree, this score shows how much the sequence is similar to the model. However, the final score only shows the overall tally, but it does not give us any inFormation about the local areas oF sequences. Showing local areas oF interest could be greatly aided by visualization tools that display the sequence and the model to the analyst. This way an analyst could discover the results in all areas oF interest. This paper presents our design decisions in improving one oF the existing visualization methods – sequence logo-. Current visualization logos in this domain Focus on showing the similarity between the new sequence and the Hidden Markov Model (HMM), by representing the characters and the probability oF each character in each position. We present a novel logo display, “Spring Logo”, which emphasizes the degree oF Fitness in both probability and weight aspects. Our tool replaces comparing each character against the model one by one. Index Terms—Bioinformatics visualization, Protein Sequence, DNA sequence, Sequence Logos, HMM Logos, Sequence Profile Alignment 1 Introduction and Background Living cells consist oF their basic units called genes.
    [Show full text]
  • Alignmentviewer: Sequence Analysis of Large Protein Families[Version 2
    F1000Research 2020, 9:213 Last updated: 20 JUL 2021 SOFTWARE TOOL ARTICLE AlignmentViewer: Sequence Analysis of Large Protein Families [version 2; peer review: 2 approved] Roc Reguant 1-3, Yevgeniy Antipin4, Rob Sheridan5, Christian Dallago 1-3, Drew Diamantoukos2,3, Augustin Luna 2,3,6, Chris Sander 2,3,6, Nicholas Paul Gauthier2,3,6 1Department of Informatics, Technical University of Munich, Munich, Germany 2cBio Center, Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA 3Department of Cell Biology, Harvard Medical School, Boston, MA, USA 4Icahn School of Medicine, Mount Sinai, New York, NY, USA 5Knowledge Systems Group, Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA 6Broad Institute of MIT and Harvard, Cambridge, MA, USA v2 First published: 27 Mar 2020, 9:213 Open Peer Review https://doi.org/10.12688/f1000research.22242.1 Latest published: 15 Oct 2020, 9:213 https://doi.org/10.12688/f1000research.22242.2 Reviewer Status Invited Reviewers Abstract AlignmentViewer is a web-based tool to view and analyze multiple 1 2 sequence alignments of protein families. The particular strengths of AlignmentViewer include flexible visualization at different scales as version 2 well as analysis of conservation patterns and of the distribution of (revision) report proteins in sequence space. The tool is directly accessible in web 15 Oct 2020 browsers without the need for software installation. It can handle protein families with tens of thousands of sequences and is version 1 particularly suitable for evolutionary coupling analysis, e.g. via 27 Mar 2020 report report EVcouplings.org. Keywords 1. Erik Larsson Lekholm, University of alignment viewer, MSA, JavaScript, protein alignments, web-based, Gothenburg, Gothenburg, Sweden tool, 2.
    [Show full text]