Visual Software Tools for Bioinformatics
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE IN PRESS Journal of Visual Languages Journal of Visual Languages and Computing & Computing 19 (2008) 291–301 www.elsevier.com/locate/jvlc Visual software tools for bioinformatics Timothy Arndtà Department of Computer and Information Sciences, Cleveland State University, 2121 Euclid Avenue, Cleveland, OH 44115-2124, USA Received 11 June 2007; accepted 15 June 2007 Abstract Bioinformatics is the application of techniques from computer science, statistics and mathematics to problems in molecular biology. This interdisciplinary approach is rapidly revolutionizing biology. A survey of software tools for bioinformatics is presented. A special emphasis is placed on the visual aspects of these tools. The most important visualization tasks in bioinformatics are data sequence visualization and visualizing protein structures. The visualization of interactions between molecules in a metabolic pathway or network is an emerging area. Many important visualization techniques have yet to be applied in this application area. r 2007 Elsevier Ltd. All rights reserved. Keywords: Bioinformatics; Software tools; Reviews 1. Introduction Bioinformatics has been defined as the application of information technology (computer science, mathematics and statistics) to the management of biological information. In particular, bioinformatics has been widely associated with molecular biology that is largely concerned with the study of three types of molecules—DNA, RNA and protein. The central dogma of molecular biology describes how the information stored in DNA is transcribed into RNA and then translated into protein. Each of these three molecules is a polymer—a string of simpler units, nucleotides in the case of DNA and RNA, amino acids in the case of protein. Each nucleotide contains one of four bases—adenine (abbreviated A), ÃTel.: +1 216 687 4779; fax: +1 216 687 5448. E-mail address: [email protected] 1045-926X/$ - see front matter r 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.jvlc.2007.06.001 ARTICLE IN PRESS 292 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 cytosine (C), guanine (G) and thymine (T). Uracil (U) takes the place of thymine in RNA. There are 20 naturally occurring amino acids. Each amino acid can be specified by either a three letter or a one letter code. For example, tryptophan is specified by either Trp or W. It is easily seen that the one letter code is more appropriate for computer processing. The DNA molecule has the famous double helix structure in which each base from one strand of the double helix pairs with a base from the other strand. An A base pairs only with a T base, while a C base pairs only with a G base. Due to this, given the sequence of one of the strands, we can infer the sequence data for the complementary strand. Thus, a DNA molecule can be specified by giving the sequence of one the strands, for example, AAACGTC etc. The story is a bit more complex for RNA and proteins, which are single stranded. We still can usefully characterize the molecule by giving the sequence data (a string of bases for RNA, a string of amino acids for protein), however this does not completely characterize the molecules, since they can fold into irregular shapes which are functionally important. For these molecules, the sequence data is referred to as the primary structure while the secondary structure is the three-dimensional form of local segments of the polymer. Typical local structures for proteins are alpha helices and beta sheets while the stem-loop is a typical RNA secondary structure. The tertiary structure of a protein is its three-dimensional structure given by the atomic coordinates, while quaternary structure is the arrangement of multiple folded proteins in a protein complex. Fig. 1 below shows the secondary structure of the myoglobin protein which contains several alpha helices and random coils, but no beta sheets. The visual representations for the two secondary structures is typical. One of the most important tasks of a bioinformatics tool is to perform sequence alignment. Given two different but related sequences (of possibly different lengths), the tool attempts to find the best match between them. In Fig. 2, the alignment between two Fig. 1. The secondary structure of the myoglobin protein. ARTICLE IN PRESS T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 293 Fig. 2. A sequence alignment between two zinc finger proteins produced by ClustalW. Fig. 3. A multiple sequence alignment of several proteins produced by ClustalW. zinc finger proteins produced by the freely available ClustalW program is shown. Notice that amino acids that have similar chemical properties are shown with the same color. The third row below the two sequences being aligned gives information about the goodness of the match in each column—a ‘‘*’’ symbol means that the two amino acids are identical, a ‘‘:’’ represents a conserved substitution (substitution of a similar amino acid), while a ‘‘.’’ represents a semi-conserved substitution. When trying to deduce the evolutionary history of several organisms or genes, it is necessary to align the sequence data from each of the organisms/genes using a process called multiple sequence alignment. Fig. 3 shows a multiple sequence alignment produced by ClustalW for several instances of a particular protein from several different organisms. Note once again how color is used to help in the interpretation of the result. Usually, a multiple sequence alignment results in a consensus sequence which shows the base or amino acid which occurs the most times in each column of the alignment. An alternative way to view sequence alignments is with the sequence logo format that was developed by Tom Schneider at the National Cancer Institute. This method shows more information about the alignment such as whether more than one base or amino acid occurs in each column, and the relative frequency of occurrence in the column. The sequence logo ARTICLE IN PRESS 294 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 shows the frequencies of bases in each column as the relative height of the letter representing the base, along with the degree of sequence conservation as the total height of a stack of letters, measured in bits of information [1]. An example is shown in Fig. 4. The sequence logo shown is generated using the DELILA programs. A related visualization technique developed by the same author is the sequence walker [2]. A sequence walker displays information about a single sequence of a multiple sequence alignment. The height of letters in the graphic representation indicates how much the base matches the consensus value at each position. Bases that have a positive match value are shown right side up while bases that have negative values are shown upside down and below the ‘‘horizon’’. Bases that do not appear in the set of aligned sequences are shown negatively and in a black box. The zero coordinate (a position by which a set of binding sites—the place on a molecule that a protein binds to—is aligned) is inside a rectangle that has a light green background if the sequence has been evaluated as a binding site, and a pink background otherwise. An example of a sequence walker is shown in Fig. 5. Multiple sequence alignments are sometimes used to infer evolutionary history that can be used to generate a phylogenetic tree that shows the evolutionary history of a number of organisms. Many bioinformatics tools allow for the generation and manipulation of such trees. An example Tree of Life (TOL) generated using Interactive Tree of Life (iTOL)[3], an online phylogenetic tree viewer is shown in Fig. 6. The following section will survey several tools, both commercial and free, which can be used for doing bioinformatics. Fig. 4. A sequence logo. Fig. 5. A sequence Walker for a human donor splice site. ARTICLE IN PRESS T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 295 Fig. 6. A phylogenetic tree created using iTOL. 2. Bioinformatics tools There exist a huge number of bioinformatics tools, so any survey will necessarily be incomplete. In this section, I will introduce a number of representative tools. Visualizing protein structures is an important tool for molecular and structural biologists. Visualization of the 3-D shape and structure of a protein can help the biologist identify catalytic and interactive sites and in other ways characterize a protein. Strap [4] (available for Windows, Linux, Mac and Unix) is an example of a comprehensive software suite to view proteins. It has a large set of tools and functionalities. The file manager allows the user to create proteins manually as well as to import sample files from internet-based databases. One can use the 3-D tab to view 3-D structures of proteins in the PDB file format. The align tab in the menu gives the user several options to align two or more protein sequences with different settings. The Predict tool is an innovative tool, which predicts the 3-D structure of a part of the protein sequence. The analyze menu has tools to determine the phylogenetic tree, view the dot plot and other data pertaining to the protein at hand. One can download plugins for the software to enhance the functionality. A screen shot of Strap is shown in Fig. 7. Geneious [5] (available for Windows, Mac and Linux) is a well-designed sequence visualizer. The software is very intuitive. Geneious can be used for both DNA and protein ARTICLE IN PRESS 296 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 Fig. 7. Strap. sequences. It has tools to allow the user to view the sequences in different ways, giving detailed statistics and other information.