ARTICLE IN PRESS

Journal of Visual Languages Journal of Visual Languages and Computing & Computing 19 (2008) 291–301 www.elsevier.com/locate/jvlc

Visual software tools for

Timothy ArndtÃ

Department of Computer and Sciences, Cleveland State University, 2121 Euclid Avenue, Cleveland, OH 44115-2124, USA

Received 11 June 2007; accepted 15 June 2007

Abstract

Bioinformatics is the application of techniques from computer science, statistics and mathematics to problems in molecular biology. This interdisciplinary approach is rapidly revolutionizing biology. A survey of software tools for bioinformatics is presented. A special emphasis is placed on the visual aspects of these tools. The most important visualization tasks in bioinformatics are data sequence visualization and visualizing protein structures. The visualization of interactions between molecules in a metabolic pathway or network is an emerging area. Many important visualization techniques have yet to be applied in this application area. r 2007 Elsevier Ltd. All rights reserved.

Keywords: Bioinformatics; Software tools; Reviews

1. Introduction

Bioinformatics has been defined as the application of information technology (computer science, mathematics and statistics) to the management of biological information. In particular, bioinformatics has been widely associated with molecular biology that is largely concerned with the study of three types of molecules—DNA, RNA and protein. The central dogma of molecular biology describes how the information stored in DNA is transcribed into RNA and then translated into protein. Each of these three molecules is a polymer—a string of simpler units, nucleotides in the case of DNA and RNA, amino acids in the case of protein. Each nucleotide contains one of four bases—adenine (abbreviated A),

ÃTel.: +1 216 687 4779; fax: +1 216 687 5448. E-mail address: [email protected]

1045-926X/$ - see front matter r 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.jvlc.2007.06.001 ARTICLE IN PRESS 292 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 cytosine (C), guanine (G) and thymine (T). Uracil (U) takes the place of thymine in RNA. There are 20 naturally occurring amino acids. Each can be specified by either a three letter or a one letter code. For example, tryptophan is specified by either Trp or W. It is easily seen that the one letter code is more appropriate for computer processing. The DNA molecule has the famous double helix structure in which each base from one strand of the double helix pairs with a base from the other strand. An A base pairs only with a T base, while a C base pairs only with a G base. Due to this, given the sequence of one of the strands, we can infer the sequence data for the complementary strand. Thus, a DNA molecule can be specified by giving the sequence of one the strands, for example, AAACGTC etc. The story is a bit more complex for RNA and proteins, which are single stranded. We still can usefully characterize the molecule by giving the sequence data (a string of bases for RNA, a string of amino acids for protein), however this does not completely characterize the molecules, since they can fold into irregular shapes which are functionally important. For these molecules, the sequence data is referred to as the primary structure while the secondary structure is the three-dimensional form of local segments of the polymer. Typical local structures for proteins are alpha helices and beta sheets while the stem-loop is a typical RNA secondary structure. The tertiary structure of a protein is its three-dimensional structure given by the atomic coordinates, while quaternary structure is the arrangement of multiple folded proteins in a protein complex. Fig. 1 below shows the secondary structure of the myoglobin protein which contains several alpha helices and random coils, but no beta sheets. The visual representations for the two secondary structures is typical. One of the most important tasks of a bioinformatics tool is to perform . Given two different but related sequences (of possibly different lengths), the tool attempts to find the best match between them. In Fig. 2, the alignment between two

Fig. 1. The secondary structure of the myoglobin protein. ARTICLE IN PRESS T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 293

Fig. 2. A sequence alignment between two zinc finger proteins produced by ClustalW.

Fig. 3. A multiple sequence alignment of several proteins produced by ClustalW. zinc finger proteins produced by the freely available ClustalW program is shown. Notice that amino acids that have similar chemical properties are shown with the same color. The third row below the two sequences being aligned gives information about the goodness of the match in each column—a ‘‘*’’ symbol means that the two amino acids are identical, a ‘‘:’’ represents a conserved substitution (substitution of a similar amino acid), while a ‘‘.’’ represents a semi-conserved substitution. When trying to deduce the evolutionary history of several organisms or genes, it is necessary to align the sequence data from each of the organisms/genes using a process called multiple sequence alignment. Fig. 3 shows a multiple sequence alignment produced by ClustalW for several instances of a particular protein from several different organisms. Note once again how color is used to help in the interpretation of the result. Usually, a multiple sequence alignment results in a which shows the base or amino acid which occurs the most times in each column of the alignment. An alternative way to view sequence alignments is with the format that was developed by Tom Schneider at the National Cancer Institute. This method shows more information about the alignment such as whether more than one base or amino acid occurs in each column, and the relative frequency of occurrence in the column. The sequence logo ARTICLE IN PRESS 294 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 shows the frequencies of bases in each column as the relative height of the letter representing the base, along with the degree of sequence conservation as the total height of a stack of letters, measured in bits of information [1]. An example is shown in Fig. 4. The sequence logo shown is generated using the DELILA programs. A related visualization technique developed by the same author is the sequence walker [2]. A sequence walker displays information about a single sequence of a multiple sequence alignment. The height of letters in the graphic representation indicates how much the base matches the consensus value at each position. Bases that have a positive match value are shown right side up while bases that have negative values are shown upside down and below the ‘‘horizon’’. Bases that do not appear in the set of aligned sequences are shown negatively and in a black box. The zero coordinate (a position by which a set of binding sites—the place on a molecule that a protein binds to—is aligned) is inside a rectangle that has a light green background if the sequence has been evaluated as a binding site, and a pink background otherwise. An example of a sequence walker is shown in Fig. 5. Multiple sequence alignments are sometimes used to infer evolutionary history that can be used to generate a phylogenetic tree that shows the evolutionary history of a number of organisms. Many bioinformatics tools allow for the generation and manipulation of such trees. An example Tree of Life (TOL) generated using Interactive Tree of Life (iTOL)[3], an online phylogenetic tree viewer is shown in Fig. 6. The following section will survey several tools, both commercial and free, which can be used for doing bioinformatics.

Fig. 4. A sequence logo.

Fig. 5. A sequence Walker for a human donor splice site. ARTICLE IN PRESS T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 295

Fig. 6. A phylogenetic tree created using iTOL.

2. Bioinformatics tools

There exist a huge number of bioinformatics tools, so any survey will necessarily be incomplete. In this section, I will introduce a number of representative tools. Visualizing protein structures is an important tool for molecular and structural biologists. Visualization of the 3-D shape and structure of a protein can help the biologist identify catalytic and interactive sites and in other ways characterize a protein. Strap [4] (available for Windows, Linux, Mac and Unix) is an example of a comprehensive software suite to view proteins. It has a large set of tools and functionalities. The file manager allows the user to create proteins manually as well as to import sample files from internet-based databases. One can use the 3-D tab to view 3-D structures of proteins in the PDB file format. The align tab in the menu gives the user several options to align two or more protein sequences with different settings. The Predict tool is an innovative tool, which predicts the 3-D structure of a part of the protein sequence. The analyze menu has tools to determine the phylogenetic tree, view the dot plot and other data pertaining to the protein at hand. One can download plugins for the software to enhance the functionality. A screen shot of Strap is shown in Fig. 7. Geneious [5] (available for Windows, Mac and Linux) is a well-designed sequence visualizer. The software is very intuitive. Geneious can be used for both DNA and protein ARTICLE IN PRESS 296 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301

Fig. 7. Strap. sequences. It has tools to allow the user to view the sequences in different ways, giving detailed statistics and other information. The software supports viewing 3-D structures, chromatograms and trees. One can align two or more sequences. Geneious cleanly separates the DNA sequences, proteins, trees, alignments and other types of documents using a tree-oriented file manager. The file manager is generic, supporting manual file creation as well as importing files in various formats as specified in the file list. A few tutorials are also built into the software for guiding the user. The ‘‘Collaboration’’ tool allows the users to setup different accounts to work independently and share data simultaneously. The EMBL and NCBI menus allow the user to search the two major databases for sequences and compare them. One can also create an agent to compare the sequences in the local database with those in the EMBL and NCBI databases. Other common tools include scanning for Open Reading Frames (ORFs) and primer design. An example of the use of Geneious is shown in Fig. 8. CLC Combined Workbench [6] (available for Windows, Mac and Linux) is one of the most comprehensive bioinformatics tools available. It powerfully integrates all the features of the CLC Gene and Protein Workbenches. The graphical interface is sophisticated and intuitive. The menu bar provides links to a large list of tools to operate on DNA sequences, proteins, cloning, enzyme digestion and many other applications. The software also allows the creation of a comprehensive report on proteins, which includes statistics of the molecular structure, and graphs as well as BLAST reports from the NCBI database. The user can also view 3-D structures of proteins. One can also create a report on DNA sequences, which is not as comprehensive as the protein report, but which includes all the statistical data of the sequence. CLC Combined Workbench is shown in Fig. 9. Genchek [7] (available for Linux, Mac, Windows and Unix) is a comprehensive software suit. It has support for several operations on DNA and proteins. The software is not very ARTICLE IN PRESS T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 297

Fig. 8. Geneious.

Fig. 9. CLC Combined Workbench. intuitive and requires time to adapt to it. However, once the user is accustomed to the functionality of the software, he/she can perform detailed analysis of the data and perform operations at a very high degree of precision by fine-tuning several parameters. ARTICLE IN PRESS 298 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301

The software clearly separates the DNA workspace from the protein workspace. While working on DNA, tools are available to trim the sequence, perform restriction enzyme analysis, six frame analysis, ORF analysis, primer design, local and global alignment as well as multiple sequence alignment using MSA or ClustalW. One can also perform cloning in various modes on DNA sequences. While working on proteins, tools are available to translate the protein, perform protease digestion, display the sequence as a helical wheel or a beta staircase and display various statistical analysis of the protein. Similar to DNA mode, one can perform local and global alignment between the proteins and even multiple sequence alignment using the MSA or ClustalW option. The general functionality includes ‘Blasting’ the sequences, searching for patterns and using the local database to store and compare sequences. The ‘web interface’ tab includes various features that help in accessing the Internet for performing operations and searching for features. Gencheck is shown in Fig. 10. GENtle [8] is a multi-purpose bioinformatics software tool with a variety of functions to operate on DNA and proteins. The tools are well designed and intuitive. They are detail oriented and provide all the information associated with the current operation or sequence. For instance, the DNA sequence viewer automatically displays all the restriction enzymes that can cut the sequence and their respective positions. The calculator tool gives the statistical data of the sequence. An image viewer is also available. The file manager accepts files in many different formats, but also has the limitation of exiting when encountering an unacceptable file or operation without informing the user of the problem. This causes the user to lose all unsaved data without giving an opportunity to save the data. One can also use the virtual gel viewer for restriction enzyme digestion analysis. The web interface allows the user to BLAST the sequence on the NCBI database. Other tools allow one to

Fig. 10. Genchek. ARTICLE IN PRESS T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 299 manage the local database. There are many other tools to operate on sequences in detail. Overall, the design of the software allows the user to quickly adapt to it. An example of GENtle in action is shown in Fig. 11.

Fig. 11. GENtle.

Fig. 12. MB DNA analysis. ARTICLE IN PRESS 300 T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301

MB DNA analysis [9] looks basic at first glance, however, it is a little complex and requires time to learn. The functionality is not very straightforward to use and needs lot of trial-and-error attempts to fully understand the working and functionality of the software. Once acquainted with the software, the user can create DNA and protein sequences, analyze them, generate statistical data about the sequence, draw dot plots, determine ORFs, design primers and even display the charge graph for a protein sequence. The database tools help the user to view the files stored in the database and maintain the database. The ‘‘Read Aloud’’ tool is rather innovative, speaking the sequence selected to the user. This tool may seem trivial but can be handy at times. The original setup file comes with three plugins that help in viewing helices, performing multiple sequence alignments and report management. These plugins are rather impressive in their detailed output. The multiple alignment plugin gave detailed data about the alignment. Overall, this is a very impressive piece of software, with room for improvement. A screen shot of the program is given in Fig. 12.

3. Conclusions

The tools surveyed integrate a variety of methods for visualizing sequence data and protein structures. While not yet widely available, new methods are emerging for the visualization of metabolic pathways as well. Pathfinder [10] is a tool for the dynamic visualization of metabolic pathways based on annotation data. Directed acyclic graphs represent the pathways and graph layout algorithms are used for dynamic drawing and visualization of pathways. MetNetVR [11] is an innovative approach that combines graph layouts in 3D space, computer graphics and virtual reality technologies for interactive visualization of high dimensional metabolic networks. While there are a wide variety of bioinformatics tools available, there is still an opportunity for data visualization experts to introduce innovative visualization techniques (as exemplified by the Sequence Logo and Sequence Walker) in the field. For further information on bioinformatics, see [12,13].

Acknowledgment

I would like to thank my student Anand Doshi for his help in the preparation of this article.

References

[1] T.D. Schneider, R.M. Stephens, Sequence logos: a new way to display consensus sequences, Nucleic Acids Research 18 (1990) 6097–6100. [2] T.D. Schneider, Sequence Walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences, Nucleic Acids Research 25 (1997) 4408–4415. [3] Interactive Tree of Life. /http://itol.embl.de/S. [4] Strap. /http://www.charite.de/bioinf/strap/S. [5] Geneious. /http://www.geneious.com/S. [6] CLC Combined Workbench. /http://www.clcbio.com/index.php?id=92S. [7] Genchek. /http://www.ocimumbio.com/web/bioinformatics/prod_details.asp?prod_id=27&prodType=1S. [8] GENtle. /http://gentle.magnusmanske.de/S. [9] MB DNA analysis. /http://www.molbiosoft.de/S. ARTICLE IN PRESS T. Arndt / Journal of Visual Languages and Computing 19 (2008) 291–301 301

[10] A. Goesmann, M. Haubrock, F. Meyer, J. Kalinowski, R. Giegerich, Pathfinder: reconstruction and dynamic visualization of metabolic pathways, Bioinformatics 18 (1) (2002) 124–129. [11] Y. Yang, E.S. Wurtele, C. Cruz-Neira, J.A. Dickerson, Hierarchical visualization of metabolic networks using virtual reality, in: Proceedings of the 2006 ACM International Conference on Virtual Reality Continuum and its Applications, Hong Kong, China, 2006, pp. 377–381. [12] D.E. Krane, M.L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings, 2003. [13] C. Gibbs, P. Jambeck, Developing Bioinformatics Computer Skills, O’Reilly, Sebastopol, 2001.