WebLogo Documentation Release 3.7.9.dev2+g7eab5d1.d20210504

Gavin E. Crooks

May 04, 2021

Contents:

1 Distribution and Modification3 1.1 WebLogo API...... 3 1.2 Alphabets and Sequences...... 3 1.3 Sequence IO...... 7 1.3.1 Sequence file reading and writing...... 7 1.3.2 Supported File Formats...... 8 1.4 Logo Data, Options, and Format...... 9 1.5 Logo Formatting...... 11

Python Module Index 13

Index 15

i ii WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

WebLogo is software designed to make the generation of sequence logos easy and painless. A is a graphical representation of an or nucleic acid multiple . Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. In general, a sequence logo provides a richer and more precise description of, for example, a binding site, than would a . WebLogo features a web interface (http://weblogo.threeplusone.com), and a command line interface provides more options and control (http://weblogo.threeplusone.com/manual.html#CLI). These pages document the API. The main WebLogo webserver is located at http://weblogo.threeplusone.com Please consult the manual for installation instructions and more : (Also located in the weblogolib/htdocs subdirectory.) http://weblogo.threeplusone.com/manual.html For help on the command line interface run weblogo --help

To build a simple logo run weblogo < cap.fa > logo0.eps

To run as a standalone webserver at localhost:8080 weblogo --serve

Contents: 1 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

2 Contents: CHAPTER 1

Distribution and Modification

This package is distributed under the new BSD Open Source License. Please see the LICENSE.txt file for details on copyright and licensing. The WebLogo source code can be downloaded from https://github.com/WebLogo/weblogo WebLogo requires Python 3.6 or 3.7. Generating logos in PDF or bitmap graphics formats require that the ghostscript program ‘gs’ be installed. Scalable Vector Graphics (SVG) format also requires the program ‘pdf2svg’.

1.1 WebLogo API

To create a logo in python code:

>>> from weblogo import * >>> fin= open('cap.fa') >>> seqs= read_seq_data(fin) >>> logodata= LogoData.from_seqs(seqs) >>> logooptions= LogoOptions() >>> logooptions.title="A Logo Title" >>> logoformat= LogoFormat(logodata, logooptions) >>> eps= eps_formatter(logodata, logoformat)

1.2 Alphabets and Sequences

Alphabetic sequences and associated tools and data. Seq is a subclass of a python string with additional annotation and an alphabet. The characters in string must be contained in the alphabet. Various standard alphabets are provided. Classes

Alphabet-- A subset of non-null ascii characters Seq-- An alphabetic string SeqList-- A collection of Seq's

3 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

Alphabets o generic_alphabet-- A generic alphabet. Any printable ASCII character. o protein_alphabet-- IUCAP/IUB Amino Acid one letter codes. o nucleic_alphabet-- IUPAC/IUB Nucleic Acid codes'ACGTURYSWKMBDHVN-' o dna_alphabet-- Same as nucleic_alphabet, with 'U' (Uracil) an alternative for 'T' (Thymidine). o rna_alphabet-- Same as nucleic_alphabet, with 'T' (Thymidine) an alternative for 'U' (Uracil). o reduced_nucleic_alphabet-- All ambiguous codes in 'nucleic_alphabet' are alternative to'N' (aNy) o reduced_protein_alphabet-- All ambiguous ('BZJ') and non-canonical amino acids codes ('U', Selenocysteine and 'O', Pyrrolysine) in 'protein_alphabet' are alternative to'X'. o unambiguous_dna_alphabet--'ACGT' o unambiguous_rna_alphabet--'ACGU' o unambiguous_protein_alphabet-- The twenty canonical amino acid one letter codes, in alphabetic order,'ACDEFGHIKLMNPQRSTVWY'

Amino Acid Codes:

Code Alt. Meaning ------A Alanine B Aspartic acid or Asparagine C Cysteine D Aspartate E Glutamate F Phenylalanine G Glycine H Histidine I Isoleucine J Leucine or Isoleucine K Lysine L Leucine M Methionine N Asparagine O Pyrrolysine P Proline Q Glutamine R Arginine S Serine T Threonine U Selenocysteine V Valine W Tryptophan Y Tyrosine Z Glutamate or Glutamine X ? any * translation stop - .~ gap

Nucleotide Codes:

Code Alt. Meaning ------A Adenosine C Cytidine (continues on next page)

4 Chapter 1. Distribution and Modification WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

(continued from previous page) G Guanine T Thymidine U Uracil R G A (puRine) Y T C (pYrimidine) K G T (Ketone) M A C (aMino group) S G C (Strong interaction) W A T (Weak interaction) B G T C (not A) (B comes after A) D G A T (not C) (D comes after C) H A C T (not G) (H comes after G) V G C A (not T, not U) (V comes after U) N X? A G C T (aNy) - .~ A gap

Refs: http://www.chem.qmw.ac.uk/iupac/AminoAcid/A2021.html http://www.chem.qmw.ac.uk/iubmb/misc/naseq. html Authors: GEC 2004,2005 class weblogo.seq.Alphabet An ordered subset of printable ascii characters. Status: Beta Authors: • GEC 2005 alphabetic(string) True if all characters of the string are in this alphabet. chr(n) The n’th character in the alphabet (zero indexed) or 0 chrs(sequence_of_ints) Convert a sequence of ordinals into an alphabetic string. letters() Letters of the alphabet as a string. normalize(string) Normalize an alphabetic string by converting all alternative symbols to the canonical equivalent in ‘letters’. ord(c) The ordinal position of the character c in this alphabet, or 255 if no such character. ords(string) Convert an alphabetic string into a byte array of ordinals. static which(seqs, alphabets=None) Returns the most appropriate unambiguous protein, RNA or DNA alphabet for a Seq or SeqList. If a list of alphabets is supplied, then the best alphabet is selected from that list. The heuristic is to count the occurrences of letters for each alphabet and downweight longer alphabets by the log of the alphabet length. Ties go to the first alphabet in the list. class weblogo.seq.Seq An alphabetic string. A subclass of “str” consisting solely of letters from the same alphabet. alphabet -- A string or Alphabet of allowed characters.

1.2. Alphabets and Sequences 5 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

name -- A short string used to identify the sequence. description -- A string describing the sequence

Authors : GEC 2005

back_translate() Translate a protein sequence back into coding DNA, using the standard . See webl- ogo.transform.GeneticCode for details and more options. complement() Returns complementary nucleic acid sequence. join(iterable) → str Return a string which is the concatenation of the strings in the iterable. The separator between elements is S. lower() Return a lower case copy of the sequence. mask(letters=’abcdefghijklmnopqrstuvwxyz’, mask=’X’) Replace all occurrences of letters with the mask character. The default is to replace all lower case letters with ‘X’. ords() Convert sequence to an array of integers in the range [0, len(alphabet) ) remove(delchars) Return a new alphabetic sequence with all characters in ‘delchars’ removed. reverse() Return the reversed sequence. Note that this method returns a new object, in contrast to the in-place reverse() method of list objects. reverse_complement() Returns reversed complementary nucleic acid sequence (i.e. the other strand of a DNA sequence.) tally(alphabet=None) Counts the occurrences of alphabetic characters. Arguments: - alphabet – an optional alternative alphabet Returns : A list of character counts in alphabetic order. tostring() Converts Seq to a raw string. translate() Translate a nucleotide sequence to a polypeptide using full IUPAC ambiguities in DNA/RNA and amino acid codes, using the standard genetic code. See weblogo.transform.GeneticCode for details and more options. upper() Return a lower case copy of the sequence. word_count(k, alphabet=None) Return a count of all subwords in the sequence.

>>> from weblogo.seq import * >>> Seq("abcabc").word_count(3) [('abc', 2), ('bca', 1), ('cab', 1)]

6 Chapter 1. Distribution and Modification WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

words(k, alphabet=None) Return an iteration over all subwords of length k in the sequence. If an optional alphabet is provided, only words from that alphabet are returned.

>>> list(Seq("abcabc").words(3)) ['abc', 'bca', 'cab', 'abc']

weblogo.seq.(string) Create an alphabetic sequence representing a stretch of RNA. weblogo.seq.(string) Create an alphabetic sequence representing a stretch of DNA. weblogo.seq.protein(string) Create an alphabetic sequence representing a stretch of polypeptide. class weblogo.seq.SeqList(alist=[], alphabet=None, name=None, description=None) A list of sequences. isaligned() Are all sequences of the same length and alphabet? ords(alphabet=None) Convert sequence list into a 2D array of ordinals. profile(alphabet=None) Counts the occurrences of characters in each column. Returns: Motif(counts, alphabet) tally(alphabet=None) Counts the occurrences of alphabetic characters. Parameters alphabet -- an optional alternative alphabet (-)– Returns : A list of character counts in alphabetic order.

1.3 Sequence IO

• Sequence file reading and writing • Supported File Formats

1.3.1 Sequence file reading and writing

Biological sequence data is stored and transmitted using a wide variety of different file formats. This package provides convenient methods to read and write several of these file fomats. WebLogo is often capable of guessing the correct file type, either from the file extension or the structure of the file:

>>> import weblogo.seq_io >>> afile= open("test_weblogo/data/cap.fa") >>> seqs= weblogo.seq_io.read(afile)

Alternatively, each sequence file type has a separate module named FILETYPE_io (e.g. fasta_io, clustal_io):

1.3. Sequence IO 7 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

>>> import weblogo.seq_io.fasta_io >>> afile= open("test_weblogo/data/cap.fa") >>> seqs= weblogo.seq_io.fasta_io.read(afile)

Sequence data can also be written back to files:

>>> fout= open("out.fa","w") >>> weblogo.seq_io.fasta_io.write(fout, seqs)

1.3.2 Supported File Formats

Module Name Extension read write features ------array_io array, flatfile yes yes none clustal_io clustalw aln yes yes fasta_io fasta, Pearson fa yes yes none genbank_io genbank gb yes intelligenetics_io intelligenetics ig yes yes msf_io msf msf yes nbrf_io nbrf, pir pir yes nexus_io nexus nexus yes phylip_io phylip phy yes plain_io plain, raw yes yes none table_io table tbl yes yes none

Each IO module defines one or more of the following functions and variables: read(afile, alphabet=None) Read a file of sequence data and return a SeqList, a collection of Seq’s (Alphabetic strings) and features. read_seq(afile, alphabet=None) Read a single sequence from a file. iter_seq(afile, alphabet =None) Iterate over the sequences in a file. index(afile, alphabet = None) Instead of loading all of the sequences into memory, scan the file and return an index map that will load sequences on demand. Typically not implemented for formats with interleaved sequences. write(afile, seqlist) Write a collection of sequences to the specifed file. write_seq(afile, seq) Write one sequence to the file. Only implemented for non-interleaved, headerless formats, such as fasta and plain. example A string containing a short example of the file format names A list of synonyms for the file format. E.g. for fasta_io, ( ‘fasta’, ‘pearson’, ‘fa’). The first entry is the preferred format name. extensions A list of file name extensions used for this file format. e.g. fasta_io.extensions is (‘fa’, ‘fasta’, ‘fast’, ‘seq’, ‘fsa’, ‘fst’, ‘nt’, ‘aa’,’fna’,’mpfa’). The preferred or standard extension is first in the list. Attributes : • formats: Available seq_io format parsers • format_names: A map between format names and format parsers. • format_extensions: A map between filename extensions and parsers.

8 Chapter 1. Distribution and Modification WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

weblogo.seq_io.read(fin: TextIO, alphabet: weblogo.seq.Alphabet = None) → weblogo.seq.SeqList Read a sequence file and attempt to guess its format. First the filename extension (if available) is used to infer the format. If that fails, then we attempt to parse the file using several common formats. Note, fin cannot be unseekable stream such as sys.stdin returns : SeqList Raises : ValueError: If the file cannot be parsed. ValueError: Sequence do not conform to the alphabet. weblogo.seq_io.format_names() Return a map between format names and format modules weblogo.seq_io.format_extensions() Return a map between filename extensions and sequence file types

1.4 Logo Data, Options, and Format

class weblogo.logo.LogoData(length=None, alphabet=None, counts=None, entropy=None, en- tropy_interval=None, weight=None) The data needed to generate a sequence logo. Parameters • alphabet – The set of symbols to count. See also –sequence-type, –ignore-lower-case • length – All sequences must be the same length, else WebLogo will return an error • counts – An array of character counts • entropy – The relative entropy of each column • entropy_interval – entropy confidence interval classmethod from_counts(alphabet: weblogo.seq.Alphabet, counts: numpy.ndarray, prior: numpy.ndarray = None) → weblogo.logo.LogoData Build a LogoData object from counts. classmethod from_seqs(seqs: weblogo.seq.SeqList, prior: numpy.ndarray = None) → webl- ogo.logo.LogoData Build a LogoData object from a SeqList, a list of sequences. class weblogo.logo.LogoFormat(logodata, logooptions=None) Specifies the format of the logo. Requires LogoData and LogoOptions objects.

>>> logodata= LogoData.from_seqs(seqs) >>> logooptions= LogoOptions() >>> logooptions.title="A Logo Title" >>> format= LogoFormat(logodata, logooptions)

Raises ArgumentError – if arguments are invalid.

class weblogo.logo.LogoOptions(**kwargs) A container for all logo formatting options. Not all of these are directly accessible through the CLI or web interfaces. To display LogoOption defaults:

>>> from weblogo import * >>> LogoOptions()

1.4. Logo Data, Options, and Format 9 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

All physical lengths are measured in points. (72 points per inch, 28.3 points per cm) Parameters • creator_text – Embedded as comment in figures. • logo_title – Creates title for the sequence logo • logo_label – An optional figure label, added to the top left (e.g. ‘(a)’). • unit_name – See std_units for options. (Default ‘bits’) • yaxis_label – Defaults to unit_name • xaxis_label – Add a label to the x-axis, or hide x-axis altogether. • fineprint – Defaults to WebLogo name and version • show_yaxis – Display entropy scale along y-axis (default: True) • show_xaxis – Display sequence numbers along x-axis (default: True) • show_ends – Display label at the ends of the sequence (default: False) • show_fineprint – Toggle display of the WebLogo version information in the lower right corner. Optional, but we appreciate the acknowledgment. • show_errorbars – Draw errorbars (default: False) • show_boxes – Draw boxes around stack characters (default: True) • debug – Draw extra graphics debugging information. • rotate_numbers – Draw xaxis numbers with vertical orientation? • scale_width – boolean, scale width of characters proportional to ungaps • pad_right – Make a single line logo the same width as multiline logos (default: False) • stacks_per_line – Maximum number of logo stacks per logo line. (Default: 40) • yaxis_tic_interval – Distance between ticmarks on y-axis(default: 1.0) • yaxis_minor_tic_ratio – Distance between minor tic ratio • yaxis_scale – Sets height of the y-axis in designated units • xaxis_tic_interval – Distance between ticmarks on x-axis(default: 1.0) • number_interval – Distance between ticmarks (default: 1.0) • shrink_fraction – Proportional shrinkage of characters if show_boxes is true. • errorbar_fraction – Sets error bars display proportion • errorbar_width_fraction – Sets error bars display • errorbar_gray – Sets error bars’ gray scale percentage (default .75) • resolution – Dots per inch (default: 96). Used for bitmapped output formats • default_color – Symbol color if not otherwise specified • color_scheme – A custom color scheme can be specified using CSS2 (Cascading Style Sheet) syntax. E.g. ‘red’, ‘#F00’, ‘#FF0000’, ‘rgb(255, 0, 0)’, ‘rgb(100%, 0%, 0%)’ or ‘hsl(0, 100%, 50%)’ for the color red. • stack_width – Scale the visible stack width by the fraction of symbols in the column (I.e. columns with many gaps of unknowns are narrow.) (Default: yes)

10 Chapter 1. Distribution and Modification WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

• stack_aspect_ratio – Ratio of stack height to width (default: 5) • logo_margin – Default: 2 pts • stroke_width – Default: 0.5 pts • tic_length – Default: 5 pts • stack_margin – Default: 0.5 pts • small_fontsize – Small text font size in points • fontsize – Regular text font size in points • title_fontsize – Title text font size in points • number_fontsize – Font size for axis-numbers, in points. • text_font – Select font for labels • logo_font – Select font for Logo • title_font – Select font for Logo’s title • first_index – Index of first position in sequence data • logo_start – Lower bound of sequence to display • logo_end – Upper bound of sequence to display weblogo.logo.parse_prior(composition, alphabet, weight=None) Parse a description of the expected monomer distribution of a sequence. Valid compositions: • None or ‘none’ No composition sepecified • ‘auto’ or ‘automatic’ Use the typical average distribution for proteins and an equiprobable distribution for everything else. • ‘equiprobable’ All monomers have the same probability. • a percentage, e.g. ‘45%’ or a fraction ‘0.45’ The fraction of CG bases for nucleotide alphabets • a species name, e.g. ‘E. coli’, ‘H. sapiens’, Use the average CG percentage for the species’s genome. • An explicit distribution e.g. {‘A’:10, ‘C’:40, ‘G’:40, ‘T’:10} weblogo.logo.read_seq_data(fin, input_parser=, alphabet=None, ig- nore_lower_case=False, max_file_size=0) Read sequence data from the input stream and return a seqs object. The environment variable WEBLOGO_MAX_FILE_SIZE overides the max_file_size argument. Used to limit the load on the WebLogo webserver.

1.5 Logo Formatting

Logo formatting. Each formatter is a function f(data, format) that draws a representation of the logo. The main graphical formatter is eps_formatter. A mapping ‘formatters’ containing all available formatters . Each formatter returns binary data. The eps and data formats can decoded to strings, e.g. eps_as_string = eps_data.decode() weblogo.logo_formatter.pdf_formatter(logodata: weblogo.logo.LogoData, logoformat: webl- ogo.logo.LogoFormat) → bytes Generate a logo in PDF format.

1.5. Logo Formatting 11 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

weblogo.logo_formatter.jpeg_formatter(logodata: weblogo.logo.LogoData, logoformat: webl- ogo.logo.LogoFormat) → bytes Generate a logo in JPEG format. weblogo.logo_formatter.svg_formatter(logodata: weblogo.logo.LogoData, logoformat: webl- ogo.logo.LogoFormat) → bytes Generate a logo in Scalable Vector Graphics (SVG) format. Requires the program ‘pdf2svg’ be installed. weblogo.logo_formatter.png_formatter(logodata: weblogo.logo.LogoData, logoformat: webl- ogo.logo.LogoFormat) → bytes Generate a logo in PNG format. weblogo.logo_formatter.png_print_formatter(logodata: weblogo.logo.LogoData, logofor- mat: weblogo.logo.LogoFormat) → bytes Generate a logo in PNG format with print quality (600 DPI) resolution. weblogo.logo_formatter.txt_formatter(logodata: weblogo.logo.LogoData, logoformat: webl- ogo.logo.LogoFormat) → bytes Create a text representation of the logo data. weblogo.logo_formatter.eps_formatter(logodata: weblogo.logo.LogoData, logoformat: webl- ogo.logo.LogoFormat) → bytes Generate a logo in Encapsulated Postscript (EPS) weblogo.logo_formatter.formatters = {'eps': , 'jpeg': , 'logodata': , 'pdf': , 'png': , 'png_print': , 'svg': } Map between output format names and corresponing logo formatter weblogo.logo_formatter.default_formatter(logodata: weblogo.logo.LogoData, logoformat: weblogo.logo.LogoFormat) → bytes The default logo formatter. class weblogo.logo_formatter.GhostscriptAPI(path: os.PathLike = None) Interface to the command line program Ghostscript (‘gs’) convert(format: str, postscript: str, width: int, height: int, resolution: int = 300) → bytes Convert a string of postscript into a different graphical format Supported foramts are ‘png’, ‘pdf’, and ‘jpeg’. Raises ValueError – For an unregonized format. version() → str Returms: The ghostscript version string • genindex • modindex

12 Chapter 1. Distribution and Modification Python Module Index

w weblogo.logo,9 weblogo.logo_formatter, 11 weblogo.seq,3 weblogo.seq_io,7

13 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504

14 Python Module Index Index

A J Alphabet (class in weblogo.seq),5 join() (weblogo.seq.Seq method),6 alphabetic() (weblogo.seq.Alphabet method),5 jpeg_formatter() (in module webl- ogo.logo_formatter), 11 B back_translate() (weblogo.seq.Seq method),6 L letters() (weblogo.seq.Alphabet method),5 C LogoData (class in weblogo.logo),9 chr() (weblogo.seq.Alphabet method),5 LogoFormat (class in weblogo.logo),9 chrs() (weblogo.seq.Alphabet method),5 LogoOptions (class in weblogo.logo),9 complement() (weblogo.seq.Seq method),6 lower() (weblogo.seq.Seq method),6 convert() (weblogo.logo_formatter.GhostscriptAPI method), 12 M mask() (weblogo.seq.Seq method),6 D default_formatter() (in module webl- N ogo.logo_formatter), 12 normalize() (weblogo.seq.Alphabet method),5 dna() (in module weblogo.seq),7 O E ord() (weblogo.seq.Alphabet method),5 eps_formatter() (in module webl- ords() (weblogo.seq.Alphabet method),5 ogo.logo_formatter), 12 ords() (weblogo.seq.Seq method),6 ords() (weblogo.seq.SeqList method),7 F format_extensions() (in module weblogo.seq_io), P 9 parse_prior() (in module weblogo.logo), 11 format_names() (in module weblogo.seq_io),9 pdf_formatter() (in module webl- formatters (in module weblogo.logo_formatter), 12 ogo.logo_formatter), 11 from_counts() (weblogo.logo.LogoData class png_formatter() (in module webl- method),9 ogo.logo_formatter), 12 from_seqs() (weblogo.logo.LogoData class method), png_print_formatter() (in module webl- 9 ogo.logo_formatter), 12 profile() (weblogo.seq.SeqList method),7 G protein() (in module weblogo.seq),7 GhostscriptAPI (class in weblogo.logo_formatter), 12 R read() (in module weblogo.seq_io),8 I read_seq_data() (in module weblogo.logo), 11 isaligned() (weblogo.seq.SeqList method),7 remove() (weblogo.seq.Seq method),6

15 WebLogo Documentation, Release 3.7.9.dev2+g7eab5d1.d20210504 reverse() (weblogo.seq.Seq method),6 reverse_complement() (weblogo.seq.Seq method), 6 rna() (in module weblogo.seq),7 S Seq (class in weblogo.seq),5 SeqList (class in weblogo.seq),7 svg_formatter() (in module webl- ogo.logo_formatter), 12 T tally() (weblogo.seq.Seq method),6 tally() (weblogo.seq.SeqList method),7 tostring() (weblogo.seq.Seq method),6 translate() (weblogo.seq.Seq method),6 txt_formatter() (in module webl- ogo.logo_formatter), 12 U upper() (weblogo.seq.Seq method),6 V version() (weblogo.logo_formatter.GhostscriptAPI method), 12 W weblogo.logo (module),9 weblogo.logo_formatter (module), 11 weblogo.seq (module),3 weblogo.seq_io (module),7 which() (weblogo.seq.Alphabet static method),5 word_count() (weblogo.seq.Seq method),6 words() (weblogo.seq.Seq method),6

16 Index