dieGOS GO & SPECIES ENRICHMENT ANALYSIS

by DIEGO MONCAYO M.

Supervisors: Dr Mikael Bod´en. Dr Minh Duc Cao.

Department of Electrical and Computer Engineering, University of Queensland.

Submitted for the degree of Master of Computer . NOVEMBER 2013. ii iii

32/1 Mitre St. St. Lucia QLD 4072 Tel. (04)3102 6180 November 18, 2013

The Dean School of Engineering University of Queensland St Lucia, Q 4072

Dear Professor Paul Strooper,

In accordance with the requirements of the degree of Master of Computer Science in the School of Information Technology and Electrical Engineering, I present the following entitled “dieGOS - GO & Species Enrichment Analysis”. This work was performed under the supervision of Dr Mikael Bod´enand Dr Minh Duc Cao. I declare that the work submitted in this thesis is my own, except as acknowl- edged in the text and footnotes, and has not been previously submitted for a degree at the University of Queensland or any other institution.

Yours sincerely,

DIEGO MONCAYO M. iv v

To my beloved family, especially to my parents. Acknowledgments

I wish to express my sincere gratitude to the main supervisor of this thesis, Dr Mikael Bod´en,for providing me the opportunity to do this project and for giving his valuable guidance, advice, and help. I also would like to thank him for the support and assistance during the recovery following my accident. Besides, I would also like to thank to Dr Minh Duc Cao for his recommendations and comments during his supervision of this thesis work and to all the Research Group, especially to Dr Yosephine Gumulya and PhD. student Julian Zaugg, for their suggestions about the tool.

vi Abstract

This thesis addresses the lack of tools for determining statistical significance of an user-specified sequence motif associated to the sequences of a species and to the annotations related to those sequences. The tool called dieGOS is implemented as a published web service in the Internet. It enables the user to input different types of sequence motif, to select the species of interest, to choose the statistical significance test of preference and to obtain tabular and visual results. The web service considerably assists the researchers to investigate the significance of a particular motif for a group of species and between the gene ontology terms associated with those species. The process to obtain the results this tool provides is usually by developing or applying different tools over big files. This tool integrates all the functionality in one simple view but with powerful options due to advanced input and output visual components to simplify the user interaction. As the user experience depends on the algorithms responsible to process the data, they have been optimized for being effective and efficient and thus are able to return the results in an acceptable time frame. The results obtained by applying the tool to the data of a real study were slightly different. However, the results are consistent with the numbers involved in the statistical tests.

vii Contents

Acknowledgments vi

Abstract vii

List of Figures xii

List of Tables xiii

1 Introduction 1 1.1 Challenges ...... 2

2 Background 4 2.1 Sequence Motif ...... 4 2.1.1 PROSITE Pattern notation ...... 4 2.1.2 Matrices ...... 4 2.2 Protein Localisation Signal ...... 5 2.3 UniProtKB ...... 6 2.4 Taxonomy ...... 7 2.4.1 NCBI Taxonomy database ...... 7 2.5 Gene Ontology ...... 7 2.5.1 Ontology ...... 7 2.5.2 Gene Ontology Project ...... 8 2.5.3 Structure of GO ...... 8 2.6 Statistical Test ...... 8 2.6.1 Fisher’s exact test ...... 8 2.6.2 Mann Withney Wilcoxon test ...... 10

3 Previous Work 11 3.1 Repository ...... 11 3.2 Gene Ontology enrichment tools ...... 11

viii CONTENTS ix

4 Materials 12 4.1 Data ...... 12 4.1.1 Protein Sequences ...... 12 4.1.2 Gene Ontology ...... 12 4.1.3 Taxonomy ...... 13 4.1.4 Statistics of Database Files ...... 13 4.2 Technology ...... 13 4.2.1 Computer ...... 14 4.2.2 Programming Language ...... 14 4.2.3 Integrated Development Environment ...... 14 4.2.4 Libraries ...... 14 4.2.5 Server ...... 14

5 Methods 16 5.1 Motif to Protein algorithm ...... 16 5.1.1 Loading and accessing protein sequence data ...... 17 5.1.2 Sequence Matching and Scoring ...... 18 5.2 GO Associations ...... 19 5.2.1 Mapping sequences and annotations ...... 20 5.2.2 Extracting GO terms from DAG ...... 20 5.3 Taxonomy Tree ...... 21 5.3.1 Generating Taxonomy ...... 21 5.4 Statistical Test ...... 22 5.4.1 Species Enrichment ...... 22 5.4.2 Gene Ontology Enrichment ...... 22 5.5 Web Service ...... 22 5.5.1 dieGOS ...... 23 5.5.2 Input ...... 23 5.5.3 Species Tree ...... 25 5.5.4 Statistical Test ...... 26 5.5.5 Results ...... 26 5.5.6 Visualization ...... 26

6 Results and Discussion 31 6.1 Results Comparison ...... 31 6.2 Similar tools Comparison ...... 33

7 Conclusions 36 7.1 Possible future work ...... 36 7.2 Comments ...... 36 x CONTENTS

Appendices 38

A Similar Tools 39 A.1 AmiGO ...... 39 A.2 BiNGO ...... 39 A.3 Bioconductor ...... 40 A.4 DAVID...... 40 A.5 ermineJ ...... 41 A.6 FuSSiMeG ...... 41 A.7 g:Profiler ...... 41 A.8 GOrilla ...... 42 A.9 GREAT ...... 42 A.10 GOBU ...... 42 A.11 GOFFA ...... 43 A.12 GeneMANIA ...... 43 A.13 GeneMerge ...... 43 A.14 GeneTools ...... 44 A.15 TermFinder ...... 44 A.16 GOTermMapper ...... 44 A.17 GoBean ...... 44 A.18 GraphWeb ...... 45 A.19 GoMiner ...... 45 A.20 NOA ...... 45 A.21 Onto-Express ...... 46 A.22 OE2GO ...... 46 A.23 PiNGO ...... 46 A.24 ProteInOn ...... 46 A.25 STEM ...... 47 A.26 StRAnGER ...... 47 A.27 ToppGene ...... 47 A.28 agriGO ...... 48

B Package Documentation 49 B.1 Sequence ...... 49 B.1.1 Sequence.java ...... 49 B.1.2 PWMScore.java ...... 49 B.1.3 RegExp.java ...... 52 B.2 stats ...... 52 B.3 go ...... 52 B.4 net ...... 52 CONTENTS xi

B.5 taxonomy ...... 52 B.6 data ...... 53 B.7 bean ...... 53 B.8 controller ...... 53 List of Figures

2.1 Flow diagram of the UniprotKB annotation process...... 7 2.2 Screenshot from the software OBO-Edit, showing a small set of terms from the ontology...... 9

5.1 Timeline for each stage...... 16 5.2 10 runs readFastaFile...... 17 5.3 Used heap after read fasta file...... 18 5.4 10 runs RegExp.search...... 19 5.5 dieGOS Logo ...... 23 5.6 dieGOS: Input component ...... 24 5.7 dieGOS: Input Advanced Options for PWM component ...... 25 5.8 dieGOS: Species Tree ...... 26 5.9 dieGOS: Statistical Significance Test ...... 27 5.10 dieGOS: Results ...... 27 5.11 dieGOS: Tree Enriched Species ...... 28 5.12 dieGOS: Bundle Species and GO terms ...... 29 5.13 visualVM: Startup process ...... 30

6.1 Species enrichment of NLS motif using FET ...... 32 6.2 Species enrichment of NLS motif using MWW ...... 33 6.3 Enrichment results graph ...... 34

B.1 Classes diagram of sequence package...... 50 B.2 Classes diagram of stats package...... 53 B.3 Classes diagram of go package...... 54 B.4 Classes diagram of net package...... 55 B.5 Classes diagram of taxonomy package...... 55 B.6 Classes diagram of data package...... 56 B.7 Classes diagram of bean package...... 57 B.8 Classes diagram of controller package...... 58

xii List of Tables

2.1 Position Frequency Matrix for Cytochrome P450 ...... 5 2.2 Position Weight Matrix for Cytochrome P450 ...... 6 2.3 Confusion Matrix for hypergeometric distribution ...... 9

4.1 Database files and Statistics ...... 13 4.2 Libraries and description ...... 15

5.1 Comparison normal and concurrent PWM Scoring ...... 19 5.2 Statistics GO Domain DAG...... 20 5.3 GO files after processing ...... 21 5.4 10 runs FET for species and GO terms ...... 23

6.1 Results: Comparison contingency matrices ...... 33 6.2 Comparison between GO enrichment tools...... 35

xiii xiv LIST OF TABLES Chapter 1

Introduction

This project aims to develop a tool that determines the statistical significance of an user-specified sequence motif associated to a species and to identify the most representative Gene Ontology (GO) terms of the matching sequences for that species. The tool also should allow scientists worldwide to access and use it. Apart from the tool’s applicability, the aim of the project includes designing and implementing effective and efficient algorithms, both in terms of time and space computational complexity The project evaluates the tool capability to find whether certain known protein localisation signals (PLS) are species-specific and what molecular functions are asso- ciated with the species of interest. There are functional gaps in similar enrichment tools that this project addresses. It includes the analysis and comparison of the functionality of these GO enrichment tools and this tool. Since this project involves accessing and searching large databases and applying complex statistical tests, it considers computational complexity of the implemented algorithms. After obtaining a specific motif using motif discovery tools, scientists could use it as input to this application, which characterizes the specificity of the motif to different species. For example, we can test whether a particular PLS is plant spe- cific or not. Thus, the outcome of this project contributes to the wider scientific community It is important for the tool to support different representations of the reached sequence motif. The motif is mainly represented as a regular expression or a Position Weight Matrix (PWM) and there are many formats available for these representa- tions. The tool enables the user to input the formats available up to date (PROSITE notation, Regular Expression, Clustal file format, Meme file format, Distrib file for- mat). The user also needs to select the species of interest from the taxonomy tree but due to the number of species, it is necessary a mechanism to easily browse the tree and search for a particular species in it. The tool provides these capabilities for

1 2 CHAPTER 1. INTRODUCTION browsing through a expandable taxonomy tree and for searching a species based on its name or taxon number. Data from NCBI and UniProt are used to build this tree. In order to determine if the protein properties are shared between species, the tool looks for matches between the user-specified sequence motif and a database of well-annotated protein sequences. UniProt and specifically SwissProt is the chosen protein database to seek motif matches because of its high quality, added manually, annotations of . This will ensure that the result of the analysis is based on reliable information. On the other hand, there are many proteins with computa- tionally annotated (unreviewed), which are related to other species that may not be present in SwissProt. A statistical significance test such as Fisher’s exact test (FET) is necessary to determine if the matching and non matching sequences for a species are significantly different from every other species. Mann-Whitney-Wilcoxon (MWW) test can also be used if a (PWM) representation of a motif is applied and a threshold to split sequences is not known. In order to determine if the Molecular Function or other GO domain property are statistically associated with a species, FET is used with the attained GO terms (annotations) from the matching-sequences as input. Tabular and visual results are necessary to effectively analyse the information generated. The tabular format includes functionality such as: searching, pagina- tion, ordering of results, exporting features and links to information referring to the enriched species and GO terms. The visualizations of results enable the analysis of the enrichment between species and between GO Terms part of the species. An easy access mechanism to make the tool available for open access is necessary and the Internet seems to be the best option. The implementation of the function- ality of this tool is designed to work as a web service that let the user to specify the input motif and the species to be analysed, and to obtain useful results. The user should expect to perform many different tests, therefore the data should be processed and the results returned in an acceptable time frame.

1.1 Challenges

This following list shows the main challenges faced during the development of the tool and the explanation of the proposed solutions are covered in the Methods section.

• Protein access The number of proteins in SwissProt database is around 540,000 to date. Quick access to the data is necessary to extract key protein information such as species and sequence. 1.1. CHALLENGES 3

• Protein scan As the motif can be represented as a regular expression (regexp) or a PWM, looking for matches against a large protein database has to be performed efficiently in both cases.

• GO annotations access Due to the relations between GO terms and the enormous number of GO annotations, a data structure capable of quickly extracting the direct and transitive annotations of the proteins is required.

• Statistical test The statistical significance test expends a lot of computational time, specially related to the hypergeometric distribution that uses very large numbers as a result of the factorials. There is the necessity for alternative, computational feasible approximations for these calculations. Chapter 2

Background

2.1 Sequence Motif

A motif is a recurring nucleotide or amino acid pattern that is presumed to have a biologically significant function. This is usually composed of a short sequence of contiguous residues, however, sometimes it shows a more distributed pattern. Functionally related sequences often share similar distribution patterns for specific functional residues [40]. Sequence motifs need to be described and represented in order to work with them.

2.1.1 PROSITE Pattern notation

The PROSITE notation uses a variant of regular expression (regexp) syntax for describing a motif, as well as other notations. Regexp pattern matching is a well studied topic in Computer Science. Many algorithms and tools have been developed to match a regexp in a text string. Although this approach to represent motifs is widely used, a regexp is too rigid to represent highly divergent protein motifs. The limitation of this notation is the inability to represent probabilities for the presence of certain residue or nucleotide in the sequence [34]. e.g. Consensus pattern for Cytochrome P450 cysteine heme-iron ligand signature [FW]-[SGNH]-x-[GD]-{F}- [RKHPT]-{P}-C-[LIVMFAP]-[GAD].

2.1.2 Matrices

Another possible description for a motif is a Position Frequency Matrix (PFM). Rather than only keeping track of the most common base at each position, this matrix registers how often each base occurs in known sites [11]. Table 2.1 shows an example of PFM calculated from the Cytochrome P450 cysteine heme-iron ligand signature in Clustal format (Alignment).

4 2.2. PROTEIN LOCALISATION SIGNAL 5

Sym Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 Pos8 Pos9 Pos10 A 0.001 0.001 0.050 0.001 0.050 0.001 0.071 0.001 0.072 0.057 C 0.001 0.001 0.050 0.001 0.005 0.001 0.004 0.981 0.001 0.001 D 0.001 0.001 0.050 0.032 0.006 0.001 0.007 0.001 0.001 0.013 E 0.001 0.001 0.050 0.001 0.009 0.001 0.011 0.001 0.001 0.001 F 0.964 0.001 0.050 0.001 0.001 0.001 0.032 0.001 0.019 0.001 G 0.001 0.680 0.050 0.949 0.005 0.001 0.043 0.001 0.001 0.913 H 0.001 0.007 0.050 0.001 0.005 0.132 0.018 0.001 0.001 0.001 I 0.001 0.001 0.050 0.001 0.028 0.001 0.165 0.001 0.291 0.001 K 0.001 0.001 0.050 0.001 0.162 0.018 0.030 0.001 0.001 0.001 L 0.001 0.001 0.050 0.001 0.024 0.001 0.041 0.001 0.228 0.001 M 0.001 0.001 0.050 0.001 0.015 0.001 0.069 0.001 0.015 0.001 N 0.001 0.029 0.050 0.001 0.007 0.001 0.147 0.001 0.001 0.001 P 0.001 0.001 0.050 0.001 0.273 0.013 0.001 0.001 0.247 0.001 Q 0.001 0.001 0.050 0.001 0.029 0.001 0.055 0.001 0.001 0.001 R 0.001 0.001 0.050 0.001 0.261 0.809 0.077 0.001 0.001 0.001 S 0.001 0.267 0.050 0.001 0.030 0.001 0.102 0.001 0.001 0.001 T 0.001 0.001 0.050 0.001 0.017 0.013 0.030 0.001 0.001 0.001 V 0.001 0.001 0.050 0.001 0.067 0.001 0.080 0.001 0.114 0.001 W 0.018 0.001 0.050 0.001 0.001 0.001 0.002 0.001 0.001 0.001 Y 0.001 0.001 0.050 0.001 0.003 0.001 0.014 0.001 0.001 0.001

Table 2.1: Position Frequency Matrix for Cytochrome P450

A PWM contains log odds weights for computing a match score. To construct a PWM it is necessary that a set of sequences be aligned and the most conserved columns are extracted. We can define the probability of the residue type a occurring in column u of the PWM as qu,a, and the probability of residue type an occurring at any position in any sequence as pa. The probability pa can be extracted directly from the PFM and it is not restricted to the aligned sequences, a background model of sequences can be used [40]. The log-odds form for a PWM element can be written as qu,a mu,a = log (2.1) pa

The probability of qu,a can be determined using. n + bp q = u,a a (2.2) u,a N + b

Where nu,a is the number of times that the amino acid a is observed in the column u in the N sequences. pa is the prior probability of a and b is the so-called pseudo- count and its value is set to N 1/2. Table 2.2 presents the log odds using the PFM presented above and a uniform distributed background (1/20 = 0.05).

2.2 Protein Localisation Signal

Proteins are synthesized on ribosomes in the intracellular fluid. However they usually need to be transported to their final destination. Typically, a Protein Localisation 6 CHAPTER 2. BACKGROUND

Sym Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 Pos8 Pos9 Pos10 A -3.90 -3.90 +0.00 -3.90 -0.01 -3.90 +0.35 -3.90 +0.36 +0.13 C -3.90 -3.90 +0.00 -3.90 -2.29 -3.90 -2.51 +2.98 -3.90 -3.90 D -3.90 -3.90 +0.00 -0.43 -2.11 -3.90 -1.95 -3.90 -3.90 -1.34 E -3.90 -3.90 +0.00 -3.90 -1.70 -3.90 -1.50 -3.90 -3.90 -3.90 F +2.96 -3.90 +0.00 -3.90 -3.90 -3.90 -0.43 -3.90 -0.96 -3.90 G -3.90 +2.61 +0.00 +2.94 -2.29 -3.90 -0.16 -3.90 -3.90 +2.90 H -3.90 -1.95 +0.00 -3.90 -2.29 +0.97 -1.01 -3.90 -3.90 -3.90 I -3.90 -3.90 +0.00 -3.90 -0.57 -3.90 +1.19 -3.90 +1.76 -3.90 K -3.90 -3.90 +0.00 -3.90 +1.18 -1.01 -0.50 -3.90 -3.90 -3.90 L -3.90 -3.90 +0.00 -3.90 -0.72 -3.90 -0.19 -3.90 +1.52 -3.90 M -3.90 -3.90 +0.00 -3.90 -1.19 -3.90 +0.32 -3.90 -1.19 -3.90 N -3.90 -0.53 +0.00 -3.90 -1.95 -3.90 +1.08 -3.90 -3.90 -3.90 P -3.90 -3.90 +0.00 -3.90 +1.70 -1.34 -3.90 -3.90 +1.60 -3.90 Q -3.90 -3.90 +0.00 -3.90 -0.53 -3.90 +0.09 -3.90 -3.90 -3.90 R -3.90 -3.90 +0.00 -3.90 +1.65 +2.78 +0.43 -3.90 -3.90 -3.90 S -3.90 +1.68 +0.00 -3.90 -0.50 -3.90 +0.72 -3.90 -3.90 -3.90 T -3.90 -3.90 +0.00 -3.90 -1.07 -1.34 -0.50 -3.90 -3.90 -3.90 V -3.90 -3.90 +0.00 -3.90 +0.29 -3.90 +0.47 -3.90 +0.83 -3.90 W -1.01 -3.90 +0.00 -3.90 -3.90 -3.90 -3.21 -3.90 -3.90 -3.90 Y -3.90 -3.90 +0.00 -3.90 -2.80 -3.90 -1.26 -3.90 -3.90 -3.90

Table 2.2: Position Weight Matrix for Cytochrome P450

Signal (PLS) consists of one or more short amino acids sequences that help to direct the protein to its target subcellular location and enable the interaction with the cellular transport machinery [40]. Nuclear Localisation Signal (NLS) is a specific type of PLS that identify a protein to be imported into the cell nucleous. As NLS have shown to be highly preserved between eukaryotic [21], the proposed tool can be used to examine if the proteins that match the specific motif share the same properties with other species.

2.3 UniProtKB

The Universal Protein Knowledgebase (UniProtKB) is one of the four databases of the Universal Protein Resource (UniProt). It is the central access point for ex- tensively curated protein information and is composed of Swiss-Prot and TrEMBL. Swiss-Prot is manually annotated and reviewed and TrEMBL is automatically an- notated and is not reviewed. The annotation process used in UniProtKB is shown in Fig. 2.1 as a flow diagram with manually curated entries of UniProtKB/Swiss- Prot and automated entries of UniProtKB/TrEMBL [19]. UniprotKB includes complete proteome and reference proteome sets. This data is updated and dis- tributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org. (See [1]) 2.4. TAXONOMY 7

Figure 2.1: Flow diagram of the UniprotKB annotation process.

2.4 Taxonomy

Taxonomy (biology) is the science of defining groups of biological organisms based on shared characteristics and giving names to those groups [39]. Organisms are grouped together into taxa (singular: taxon) and these groups may be aggregated as well to form a super group and thus create a taxonomic hierarchy.

2.4.1 NCBI Taxonomy database

The NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) includes organism names and taxonomic lineages for each of the sequences rep- resented in the INSDCs nucleotide and protein sequence databases (EMBL/Gen- Bank/DDBJ). The taxonomy database is manually curated to maintain a phyloge- netic taxonomy for the organisms represented in the sequence databases. Uniprot uses this taxonomic lineage data of the source organism in its database. This project will use the same data for building a Taxonomy tree that will enable the user to browse through the tree and find the species of interest [17].

2.5 Gene Ontology

2.5.1 Ontology

At the present, there is a huge amount of information in biological databases. The objective of this project is to acquire knowledge using this information. The problem 8 CHAPTER 2. BACKGROUND is that this information is not always characterized by a uniformity of terms. In Computer Science one of the methods that aims to satisfy this need for common understanding of concepts is the creation of ontologies. Ontology can be defined is a way of representing a common understanding of a domain based on a consensus about the meaning and relationship between terms [33].

2.5.2 Gene Ontology Project

The Gene Ontology project (http://www.geneontology.org/) is a collabo- rative effort to maintain and develop ontologies to support biologically meaning- ful annotation of genes and their products [2]. It provides a “common language” throughout ontology terms and their structure. The use of a consistent vocabulary allows genes from different species to be compared based on their GO annotations (terms). The Gene Ontology Annotation database (GOA) is the source that will be used in this project.

2.5.3 Structure of GO

The structure of the GO is important for the project because it defines the rela- tionship between ontology terms. They resemble a hierarchy, such that higher level terms are more general and are assigned to more genes, and more specific descendant terms are related to parents by either “is part” or “part of” relationships. Never- theless, a term may have more than one parent terms. The relationships form a directed acyclic graph (DAG) [2]. Three GO domains (cellular component, biologi- cal process, and molecular function) are each represented by an ontology term. The domain terms are unrelated and do not have a common parent node. This design allows inferring the domain and relationships that will be evaluated for statistical significance (refer to Fig. 2.2). This process is called ‘GO enrichment analysis’.

2.6 Statistical Test

2.6.1 Fisher’s exact test

Fisher’s exact test [18] is a statistical significance test applied in contingency tables analysis to examine the significance of the association between observations. In other words, the test determines if the observations reflect a pattern rather than just by chance. In our case, we want to determine whether proteins that contain a specific motif share some properties between species. The test computes the probability of observed data based on hypergeometric distribution. The motif-matching proteins with a property to be tested (for example 2.6. STATISTICAL TEST 9

Figure 2.2: Screenshot from the software OBO-Edit, showing a small set of terms from the ontology.

Match NoMatch Row Total Species a b a + b Non-Species c d c + d Column Total a + c b + d a + b + c + d

Table 2.3: Confusion Matrix for hypergeometric distribution species-specific) are part of the foreground and all the other proteins that have the property are part of the background. Table 2.3 represents the confusion matrix for species enrichment where a, b, c, d are the count of proteins grouped by matching and non-matching the motif, and by being or not being part of the species. In order to determine the most enriched GO terms the same principle applies. For each GO term in the motif-matching proteins, we count the numbers of proteins in the foreground set and the background set with and without this term. The test will return the p-value of the term; the smaller the p-value, the greater evidence of statistical significance [12]. The following is the equation to calculate the p-value.

a+b c+d (a + b)! (c + d)! (a + c)! (b + d)! p = a c = (2.3) b  a! b! c! d! a+c n where k is the binomial coefficient and the symbol ! indicates the factorial operator. Numerical difficulties result from the large values taken by the factorials. Com- putational feasible alternatives as gamma function, or log-gamma function use ap- proximations to obtain the p-value. 10 CHAPTER 2. BACKGROUND

2.6.2 Mann Withney Wilcoxon test

MWW test [28] (also called the MannWhitney U, Wilcoxon rank-sum test, or WilcoxonMannWhitney test) is a non-parametric test of the null hypothesis that two populations are the same against an alternative hypothesis, especially that a particular population tends to have larger values than the other. This test will be applied to examine the significance of a PWM motif in the species when a cut off for the scores is not needed to be specified whether an input sequence matches the motif or not. This approach is comparable to input ranked lists of genes. After scoring the protein sequenced using the PWM, the resulting list is split based on the property to be tested, e.g. species-membership and the output is two ranked lists. In order to compute the p-value, associated with a Mann-Whitney U statistic for two independent samples (ranked lists) the following formula can be used. The

sum up of the values in the lists are denoted as R1 and R2, and the respective sample

sizes as n1 and n2. n (n + 1) U = R − 1 1 (2.4) 1 1 2

U2 = n1 ∗ n2 − U1 (2.5)

Umin = n1 ∗ n2 − Umax (2.6)

where Umax is the bigger value between U1 and U2. U is approximately normally distributed for large samples and the standardized value Z is calculated by

U − m Z = min u (2.7) σu

where mu and σu are the mean and the standard deviation of U whose signifi- cance can be verified using the probability density function of the standard normal distribution. mu and σu are given by

n1 ∗ n2 m = (2.8) u 2 r n n (n1 + n2 + 1) σ = 1 2 (2.9) u 12 Chapter 3

Previous Work

This section includes a general review of the source code repository used as base for this tool. It also introduces the early work done related to Gene Ontology enrichment analysis tools from where a comparison between their functionalities and this project tool will be performed (refer to Section 6.2).

3.1 Repository

The UQ Bioinformatics Research Group repository (https://bioinf-scmb. biosci.uq.edu.au) includes several bioinformatics tools for different purposes as well as the source code implemented in different programming languages mainly Java and Python. The repository is actively updated and upgraded by the group members and it is managed by Dr Mikael Bod´en who is the main developer. This project took advantage of the source code already available as a reference. However most of the tool’s source code is developed from scratch.

3.2 Gene Ontology enrichment tools

A list of GO tools can be found in the web page of Gene Ontology project (http: //www.geneontology.org/GO.tools.shtml). However the web page is no longer updated regularly and the registry is now handled by NEUROLEX (http:// neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools). The list is categorized by functionality with tools developed by the GO Consortium and by third parties. The category Term enrichment Tools in the latter list is where die- GOS belongs. From this list, the tools that are similar in functionality to dieGOS are reviewed and compared (For more information refer to Appendix A).

11 Chapter 4

Materials

This chapter is devoted to describe what data are used and what technologies are employed for the operation and development of dieGOS.

4.1 Data

This project makes use of a wide variety of data from different sources. The data are downloaded from the public repositories as plain text file(s). Each file contains data that are related to other file’s data by identifiers. These identifiers are used by the tool for indexing and integration between data in different files. The data can be classified as three main components:

• Protein sequences

• Taxonomy

• Gene Ontology

4.1.1 Protein Sequences

The file containing the protein data is downloaded from the UniprotKB repository as a compressed file named “ sprot.fasta.gz”. The file contains one big Fasta file called “uniprot sprot.fasta” with ≈ 540000 protein sequences. Each sequence is also associated with additional data, e.g., sequence identifier (accession number) and the organism name.

4.1.2 Gene Ontology

GO data used in this tool is found in 2 files that can be directly downloaded from the Gene Ontology Project (Section 2.5.2) web page.

12 4.2. TECHNOLOGY 13

Filename Date Entries Size uniprot sprot.fasta 16-Oct-13 541561 259.1 MB speclist.txt 16-Oct-13 22778 1.9 MB nodes.dmp 02-Nov-13 1088999 73.8 MB gene ontology ext.obo.txt 31-Oct-13 2001317 29.3 MB gene association.goa uniprot 16-Oct-13 195819516 35.67 GB

Table 4.1: Database files and Statistics

The structure of the GO terms is defined as a DAG (Section 2.5.3) in the file “gene ontology ext.obo.txt”. The biggest file is “gene association.goa uniprot”, which contains the GO annotations and can be downloaded from the section Anno- tations → unfiltered files.

4.1.3 Taxonomy

In order to construct the taxonomy tree, two files are required. The name of the first file is “speclist.txt” and can be obtained from the UniProt’s download web page. This file contains a controlled vocabulary of species where each entry is composed of a six-digit number used to indicate to which taxonomic node in the NCBI taxonomy an organism is assigned and the organism’s name. The second is a compressed file called “taxdump.tar.gz” that can be downloaded from NCBI taxonomy’s database (Section 2.4.1). The file “nodes.dmp” in the com- pressed file contains the relations between the nodes in the taxonomy.

4.1.4 Statistics of Database Files

Table 4.1 shows some statistics for a better understanding of the proportion of data between database files. It includes the file name, modification date, number of entries and size. It also presents the large number of annotations that the tool has to deal with. Even though this file is the biggest, the amount of data in the other files is also considerable.

4.2 Technology

This section introduces the materials related to the technology used for the con- struction of the tool. It describes each component of the development environment and the technology necessary to run the application in a server. 14 CHAPTER 4. MATERIALS

4.2.1 Computer

A MacBook Pro 13-inch laptop was used to develop, to deploy and to test the application. The computer has the following specifications.

• Processor 2.9 GHz Intel Core i7

• Memory 8 GB 1600 MHz DDR3

• Software OS X 10.9 (13A603)

4.2.2 Programming Language

The tool was mainly developed using Java and partly in JavaScript. Java SoyLatte was employed during development. It is a Java 6 Port for Mac OS X Intel machines that is supported by The Dynamic Code Evolution Virtual Machine (DCE VM) in order to avoid redeploying and reloading the application each time a change is made in the server side.

4.2.3 Integrated Development Environment

Eclipse Java EE IDE for Web Developers, version: Indigo Service Release 2 was used for the Java programming part. WebStorm 7 helped in the JavaScript development and FireBug supplemented these IDEs with JavaScript debugging in the client side.

4.2.4 Libraries

Table 4.2 presents a list of the packages used by the tool in order to operate and a short description of their function.

4.2.5 Server

In order to put the application online, it needs to be provided as a Web Server with Servlet capabilities (Servlet container), e.g., Apache Tomcat. However, the applica- tion can also be run on application servers such as JBoss or GlassFish. For testing purposes apache-tomcat-6.0.37 was used because the server where this application will be deployed is using it. 4.2. TECHNOLOGY 15

Library Description Server Side jsf-api-2.1.11.jar JavaServer Faces API. jsf-impl-2.1.11.jar Oracle’s implementation of the JSF 2.2 spec- ification. jstl1.2.jar The JavaServer Pages Standard Tag Library is a collection of useful JSP tags. weld-servlet-1.1.10.Final.jar Reference implementation of CDI: Contexts and Dependency Injection. el-api-2.2.jar Expression Language API enables the pre- sentation layer to communicate with the ap- plication logic. el-impl-2.2.jar Expression Language Implementation for el- api. javax.servlet-api-3.0.1.jar Manages the contracts between a servlet class and the runtime environment. Gene Ontology obo.jar OBO-Edit OBO API bbop.jar OBO-Edit BBOP API commons-logging-1.1.1.jar Bridge between different logging implemen- tations. log4j-1.2.15.jar A logging library for Java PrimeFaces primefaces-3.5.jar Component library for JavaServer Faces. itext-2.1.7.jar Dynamic PDF generation and manipulation. commons-io-2.4.jar Library of utilities to assist with developing IO functionality. commons-fileupload-1.3.jar File upload capability to web applications. poi-3.9.jar API for Microsoft Documents (xls, doc, and ppt) bluesky-1.0.10.jar Theme for PrimeFaces. primefaces-extensions-0.7.1.jar Additional JSF 2 components for Prime- Faces. commons-lang3-3.1.jar Helper utilities for the java.lang API Client Side d3.js JavaScript library that provides powerful vi- sualization components.

Table 4.2: Libraries and description Chapter 5

Methods

This chapter is devoted to describe the methodology and procedures employed to implement the tool. It is presented as stages of the project where each stage includes the decisions made to overcome the challenges (Section 1.1). Fig. 5.1 presents the established timeline for each stage of the project.

Figure 5.1: Timeline for each stage.

5.1 Motif to Protein algorithm

The first part of the project was the design and development of a method for search- ing and extracting the proteins that match a specific motif. Before the matching can take place, all the protein sequences need to be uploaded. As mentioned before the number of sequences reported by Uniprot-SwissProt is ≈ 540k This number will

16 5.1. MOTIF TO PROTEIN ALGORITHM 17 continue to increase as new proteins are discovered and so an efficient algorithm was developed for reading the Fasta file and access to its data. The tool is able to receive as input two different representations of sequence motif. The approach to develop an efficient matching is also covered in this section. The algorithms are analysed in regard of computational time and space complexity. Big O notation is used to analyse the algorithms, this notation provides an upper bound on the growth rate of the function [37].

5.1.1 Loading and accessing protein sequence data

The method readFastaFile reads the given FASTA formatted file and return the list of sequences contained within it. Each protein sequence is composed of the actual sequence plus additional information. This information contained in the definition line is parsed by method parseDef. The computational complexity of the algorithm in big O notation if M = NumberofSequences, N = SizeofSequence, O = Sizeofheader and P = databases then Space is O(M ∗ (N + O)) Time is O(M ∗ ((N ∗ 2) + P ))). JUnit is a framework for writing repeatable tests and is used to test the algorithms. Fig. 5.2 shows the results after after this algorithm was run 10 times using the fasta file from UniProt (Section 2.3). Fig. 5.3 was taken from VisualVM (http://visualvm.java.net/) and presents the memory footprint after reading the file.

Figure 5.2: 10 runs readFastaFile.

It is important to note that an Alphabet composed of 20 amino acids symbols (See [9]) “ACDEFGHIKLMNPQRSTVWY” is employed and that the sequences with other symbols are rejected. 538925 sequences were loaded in total. The size of the original file (See Table 4.1) is half of the memory used. This is mainly due to Java’s use of Unicode for encoding, meaning each Java char is 16 bits. Meanwhile, the file encoding format is usually UTF-8 or 8 bit per character. In regard to the size of the data, an efficient data structure to map the protein identifiers to the information related (sequence, species) is necessary. A imple- 18 CHAPTER 5. METHODS

Figure 5.3: Used heap after read fasta file. mentation of Map like HashMap (Hash table) is used for this purpose. This data structure implements an associative array that can map keys to values. The case is the Uniprot accession number and the value is the instance of Sequence The average time complexity in big O notation is Space O(n) Search O(1) [38].

5.1.2 Sequence Matching and Scoring

If the PROSITE notation is used, this is converted to regex and then the matching is performed with the built-in functions in the java.util.regex package. The method prositeToRE in the class RegExp.java converts a PROSITE regular expression string to a normal regular expression (See Section B.1.3). E.g. PROSITE Notation: [DENG]-{A}-[DENQGSTARK]-x(0,2)-[DENQARK]-[LIVFY]- {CP}-G-{C}-W-[FYWLRH]-{D}-[LIVMTA] Normal Regex: [DENG][ˆA][DENQGSTARK].0,2[DENQARK][LIVFY][ˆCP]G[ˆC] W[FYWLRH][ˆD][LIVMTA]. The method search returns true if the sequence match the regular expression or false otherwise. Regexes can be compiled into Deterministic Finite Automata that can match a string of length N in O(N) time. Fig. 5.4 shows 10 runs of the Consensus pattern for Cytochrome P450 cysteine heme-iron ligand signature ([FW]- [SGNH]-x-[GD]-{F}-[RKHPT]-{P}-C-[LIVMFAP]-[GAD]) against all the sequences in the fasta file previously used. If a PWM is used, the algorithm to determine the match score has to be fast enough to go through all the protein sequences in an acceptable period. In this case the operations are parallelized to take advantage of the processor power. The success of this stage is evaluated in terms of the time and space complexity of the algorithm used. Other aspects can also be evaluated such as design, structure, documentation, indentation, and readability among others. Table 5.1 shows 10 5.2. GO ASSOCIATIONS 19

Figure 5.4: 10 runs RegExp.search.

Normal PWMScore Concurrent PWMScore

Table 5.1: Comparison normal and concurrent PWM Scoring runs of a normal PWM Scoring and a concurrent PWM Scoring using the Clustal alignment of Cytochrome P450 cysteine heme-iron ligand signature. The length of the PWM is 10 positions. The class PWMScore includes the method call which is implemented as a Callable object. It runs concurrently (Refer to Appendix B.1.2). The time com- plexity using big O notation if M = Sizeofsequence and N = SizeofP W M is Space O(M + N) and Time O((M − N) ∗ N). The worst case in Time is when N = M/2 resulting in O(M 2) The package Sequence provides all the methods to read fasta files and to per- form the matching and scoring of a motif against the loaded protein sequences.

5.2 GO Associations

This step is focused on identifying the GO terms associated to the proteins sequences and the relations between GO terms. These GO term annotations will reveal rela- tionships between proteins. The problem faced is the amount of GO annotations data. A data structure is 20 CHAPTER 5. METHODS

GO Domain Vertices Edges Molecular Function 9550 11498 Biological Process 24683 52288 Cellular Component 3146 4798

Table 5.2: Statistics GO Domain DAG. essential to efficiently organize the GO terms and annotations, and to quickly access the data. As in this case the relationships between GO terms form a directed acyclic graph (DAG), a specific data structure to store and access this data is required.

5.2.1 Mapping sequences and annotations

The annotations file is a huge file (Section 4.1.2) that can not be located in memory and therefore it needs to be preprocessed in order to work with it. The method to read this file is joinGeneAssociation in the class Mapping of the package go. This extracts the annotations for a specific list of fasta sequences and its output is a file containing the name of the sequence and the annotations. It took 840257 ms. (14 min.) to perform the mapping. A total of 4325432 annotations were mapped to 538925 protein names.

5.2.2 Extracting GO terms from DAG

The Ontology file includes cross-products, inter-ontology links, and has part re- lationships (See Section 2.5.3). The OBO format is the text file format used by OBO-Edit (http://www.oboedit.org/). OBO-Edit is an open source ontology editor written in Java. The publicly available libraries obo.jar and bbop.jar (Sec- tion 4.2.4) are used to parse the file and load its contents in the DAG data structure. It is possible to infer the transitive GO terms (parent terms) from the DAG (net- work). The method printGONetworksReduced in the class GODefinition uses the information in the network to creating two files, one file contains the GO term ids with its GO domain and description and the other contains the GO terms and its transitive GO terms. Table 5.2 shows the statistics related to each DAG of the three GO Domains. The size of the obo and annotations file is 29.3 MB and 35.67 respectively. GB. Table 5.3 shows the resulting files after preprocessing. These files are read once the tool is started and their data is kept in Maps with the identifiers as keys and the objects as values. As stated before the average time complexity in big O notation is Space O(n) Search O(1). The following is the output of the application when it loads the files. 5.3. TAXONOMY TREE 21

Filename Size seqs annotations.txt 38.5 MB gene ontolog ext.obo.desc 2.8 MB gene ontology ext.obo.term 4.4 MB

Table 5.3: GO files after processing

Constructor GOAnnotation 538925 sequences to 4325432 annotations 22326 annot to 4325432 sequence names 37379 term descriptions 37379 term to 509303 transitive terms

During the progress seminar, these two major components of the research project were presented. The presentation included the details of the work done so far includ- ing: new findings and improvements, as well as issues and challenges, and specified the work remaining to be done.

5.3 Taxonomy Tree

One objective of the application is to make the selection of species easier for the user. In order to accomplish this, an interactive taxonomy tree was created, allowing the user to browse and select species of interest.

5.3.1 Generating Taxonomy

The taxonomy tree is created based on the species related to the protein sequences in the fasta file. The other files necessary to build the taxonomy are described in Section 4.1.3. The method generateLimitedTaxonomyTree in the class TaxonomyTree creates a tree for the user interface. First it gets the species from the protein sequences, then it maps the species name to the taxonomic node id and based on the relations between the nodes the data structure of the tree is populated. In order to keep this data in memory and to have fast access to it a Map is also used and thus it is the most used data structure in this tool. The following is the output of the application when it is loading the taxonomy related data.

Construct TaxonomyTree 22778 size map taxon to species 1088999 relations in taxonomy 22 CHAPTER 5. METHODS

12627 taxons in sequences 12623 Common nodes taxonomy and seqs 22661 Common nodes taxonomy and specList 13155 nodes in Taxonomy Tree Added level to nodes Added species count to nodes Sequences in tree 538441

5.4 Statistical Test

Fisher’s exact test (FET) (Section 2.6.1) or Mann Withney Wilcoxon test (MWW) (Section 2.6.2) are statistical significance tests where its p-value is the probability that the observed data occur randomly.

5.4.1 Species Enrichment

A two-tailed FET or the U MWW p-value can be used to determine if a particular motif is overrepresented in a specific species relative to the rest. Calculating the hypergeometric distribution probabilities, such as FET of the amount of data used in the tool is a computationally consuming process. An ef- fective function to perform these calculations is implemented. Gamma function approximation is used instead of the hypergeometric distribution. Table 5.4 shows the time in ms to perform enrichment using FET for species, species + non-transitive GO terms and species + transitive GO terms. The package stats includes among others, the classes FisherExactTest and MannWithney, and each contains the method pValue to calculate the statistical significance.

5.4.2 Gene Ontology Enrichment

In order to determine whether a property of the motif-matching proteins for example Molecular Function, is shared among species, GO enrichment analysis is performed using FET based on the annotated proteins for a particular species. The results assist to identify terms that are not likely to appear by chance alone.

5.5 Web Service

The last part is the implementation of the tool as a web service, which can be accessed and used by scientists in a user friendly way. It can be evaluated in terms of performance and usability. 5.5. WEB SERVICE 23

Species NonTrans Trans 48 728 883 12 809 877 6 716 868 7 746 841 13 707 871 8 734 866 6 732 815 6 725 881 6 753 1328 12 739 808

Table 5.4: 10 runs FET for species and GO terms

5.5.1 dieGOS dieGOS is the implementation as a web service of the tool proposed in this thesis project. The name is the acronym of dynamic interactive enrichment of GO and Species. The aim of the tool is usability and relevance maintaining an pleasant look and feel. Fig. 5.5 displays the logo of the tool, which comprises the letters of the name surrounded by a DNA Strand and a sunburst image enclosing the last letter that represents the species

Figure 5.5: dieGOS Logo

5.5.2 Input

The input component features different options for the user. Each option includes a link for a explanatory help dialogue.

• Regular Expression PROSITE pattern notation Normal Regular Expression 24 CHAPTER 5. METHODS

• Position Weight Matrix Clustal format [22] Meme format [3] Distribs format (Representation of the distribution of frequencies used internally).

Figure 5.6: dieGOS: Input component

Advanced Options for PWM

These options are necessary for fine tuning the parameters of the enrichment when the user employs a PWM (Section 2.1.2). It is important to define a threshold for the matching proteins with bigger scores and to establish the background probabilities. The options include:

• Slider for establishing the threshold in percentage between matching and non matching protein sequences.

• DataTable for specifying the frequencies of each amino acid. The probabilities will be updated automatically.

• Set the probabilities directly from the species tree.

• Set the probabilities from a fasta file. 5.5. WEB SERVICE 25

Figure 5.7: dieGOS: Input Advanced Options for PWM component

5.5.3 Species Tree

This component is divided in two parts: search and tree. The search section enables the user to find the species of interest. It includes a text box to specify the species or taxon number with a button to start the search. After the results are returned, the user can click on a species from the result list and the species will be selected in the species tree. The species tree includes all the species containing proteins in the uniprot-swissprot fasta file plus the species in the NCBI taxonomy database to complete the tree. It includes controls to delete the selection, collapse the tree and a tick to enable the propagate down selection.

Advanced Options for Species Tree

The user can specify the background species rank in order to limit the test to a specific level in the tree. 26 CHAPTER 5. METHODS

Figure 5.8: dieGOS: Species Tree

5.5.4 Statistical Test

This component allows the user to select the test to be used. It can be Fisher’s Exact Test or Mann Whitney Wilcoxon.

Gene Ontology Enrichment

This component is included in Statistical test component as Advanced Options. It lets the user to include GO enrichment of transitive or non transitive terms.

5.5.5 Results

The results are presented in a data table with many controls enabled. The user can order the results based on the p-value, see the enrichment of GO terms for each species and export the results in four different formats (xls, pdf, csv, xml).

5.5.6 Visualization

This options enables data analysis interpretation to the user. The first is a graph with each enriched species where the size and colors are based on the p-value and species membership. It also includes the name of the parent species. The user can drag the species and change its place. 5.5. WEB SERVICE 27

Figure 5.9: dieGOS: Statistical Significance Test

Figure 5.10: dieGOS: Results 28 CHAPTER 5. METHODS

The bundle is a powerful visualization that allows the user to identify the relation between species and GO terms. The user can drag the bundle to rotate the labels and when the mouse pointer pass over, the chords change colour.

Figure 5.11: dieGOS: Tree Enriched Species

Fig. 5.13 shows the resources usage during the server startup. After the applica- tion is loaded, the used memory is 1364 Mb and about extra 150 Mb are added for the daemons that manage the web user requests. The time spent in starting up is 33825 ms (33 sec). 5.5. WEB SERVICE 29

Figure 5.12: dieGOS: Bundle Species and GO terms 30 CHAPTER 5. METHODS

Figure 5.13: visualVM: Startup process Chapter 6

Results and Discussion

6.1 Results Comparison

In order to validate the predictions of the tool, we can make use of a specific study such as [6] and compare the experimental results showing that certain NLS is plant- specific. The paper states that two NLSs, previously described as plant specific, bind to and are functional with plant, mammalian, and yeast importin-a proteins but interact with rice importin-a more strongly. Differences worth mentioning are:

• The threshold is not specified. The tool takes only the most enriched se- quences. %0.2 is a representative sample (538925*0.02=1077).

• The background for the PWM was also not provided. The tool uses the back- ground from the Eukaryota category because the species of interest are part of it.

• The comparison is between Streptophyta phylum and the other species but the tool performs the test independently of the other species (against the background).

• The tool uses a two tailed FET instead of the one tailed as in the paper.

The alignment from the study was provided for this comparison of results. A PWM was constructed from this file and was used to obtain the p-value for different species. The calculated values are:

• Mammalia P = 0.23

• Arabidopsis P = 0.05

• human P = 1.42e-11

• mouse P = 2.75e-06

31 32 CHAPTER 6. RESULTS AND DISCUSSION

• S. cerevisiae P = 0.03 Fig. 6.1 shows the output results of the tool. There are differences between the results of the paper and the tool. Human and Arabidopsis are more enriched. Saccharomyces are particularly enriched in the results from the tool and not in the paper. However it is because the number of proteins with this motif are significantly underrepresented in relation to the others. Table 6.1 shows the two contingency matrices.

Figure 6.1: Species enrichment of NLS motif using FET

The Molecular Function associated to the proteins that match the motif are shown in the inner table and it makes easy the analysis and comparison between species. It shows “protein domain specific binding” as the most enriched GO term for all the species except S. cerevisiae. If the motif is a NLS representation then this results are consistent. The results of the tool applying MWW are presented in Fig. 6.2 and show en- richment of all the species, with Saccharomyces cervisiae showing significantly less 6.2. SIMILAR TOOLS COMPARISON 33

Cont. Mat. Human Cont. Mat. S. cerevisiae

Table 6.1: Results: Comparison contingency matrices enrichment. This results exhibit the same pattern as the former but they are pre- sented with p-value 0.0 indicating complete enrichment.

Figure 6.2: Species enrichment of NLS motif using MWW

The results can be visualized in the graph (Species Flower) which shows the relations in the taxonomy (how far they are between each other) and the enrichment as the size of the nodes.

6.2 Similar tools Comparison

A review in [23] gives an overview of Ontological analysis tools showing their strengths and limitations with interesting highlights in areas not covered until that date. This review was performed in 2005, therefore the analysed tools have changed and have been updated in their new releases. A review of the current tools was performed for this project but not at the same depth. After reviewing the functionality of the tools, most of them require as input a target set of genes and a background gene set, calculating the enrichment in the target set compared to the background set. Each tool addresses a specific problem related to GO enrichment and each has its specific approach. Table 6.2 displays each tool in the row and the capabilities in the columns. The capabilities are split as follows.

• Input Alternatives. 34 CHAPTER 6. RESULTS AND DISCUSSION

Figure 6.3: Enrichment results graph

Gene List Regular Expression PWM

• Enrichment Options. Set Background Select GO Domain Select type of Evidence for the existence of a protein Specify Species

• Output Choices. Export Results. Visualize Results 6.2. SIMILAR TOOLS COMPARISON 35

Input Options Output Tool List RE PWM Back Dom Evid Spec Expt Vis AmiGO BiNGO DAVID ermineJ g:Profiler GOrilla GREAT GOBU GOFFA GeneMANIA TermFinder GoMiner NOA OE OE2GO PiNGO ProteInOn STEM StRAnGER ToppGene agriGO dieGOS

Table 6.2: Comparison between GO enrichment tools. Chapter 7

Conclusions

This new tool covers the gap of the similar tools by providing a species specific enrichment besides the normal GO enrichment, a wide range of input (not only a set of genes), facilities for searching species in the phylogenetic tree and selecting one or all the species below its rank, results exported in xls, pdf, csv or xml format, and interactive visualizations. All the process to obtain the results is executed in a reasonable time frame.

7.1 Possible future work

The base to continue improving the tool has been established with this project. There are many improvements that can be implemented, for example (though not limited to):

• Multiple testing correction Bonferroni (FWER) or Benjamini&Hochberg (FDR).

• Multi threading algorithms

• More options for enrichment analysis

• More detailed data about the protein sequences matching the motif

• Interaction with other tools

7.2 Comments

The development of the tool has been an interesting experience due to the interaction with scientists and researchers in the field whom shared their views in order to achieve a useful and practical tool. The product of this work is a functional tool that aims to be used by a wide range of biologists and professionals involved in Molecular Biosciences.

36 7.2. COMMENTS 37

Even though the performance of the algorithms are acceptable, a continuous monitoring of the server is necessary to assure a good service under heavy workload. There is a lot of work invested in this tool thus I hope all the effort was worth and foresee a long way to go for the tool from now on. 38 CHAPTER 7. CONCLUSIONS Appendix A

Similar Tools

A.1 AmiGO

AmiGO (http://amigo.geneontology.org/cgi-bin/amigo/go.cgi) is the official Web-based set of tools for searching and browsing the Gene Ontology database. Although searching and browsing are the main functionalities of AmiGO, it also provides GO Term Enrichment through GO-TermFinder (http://search. cpan.org/dist/GO-TermFinder/) (See Section A.15). The tool is used to discover what a set of genes may have in common by examining annotations and finding significant shared GO terms (More [5]). The web interface lets the user to insert the list of genes, the background set and provides 4 options: A database filter, if using electronically inferred data, and thresholds for the minimum p-value and minimum number of gene products. dieGOS does not provide capabilities for searching or browsing the Gene Ontol- ogy database. However, similar to AmiGO, dieGOS provides enrichment function- ality and allows a sequence motif as input. Additionally, dieGOS provides advanced options to the user when dealing with PWMs and a simple interface for selecting species from the taxonomy tree.

A.2 BiNGO

The Biological Networks Gene Ontology (BiNGO) (http://www.psb.ugent. be/cbd/papers/BiNGO/Home.html) is Java tool used to determine overrepre- sented GO Terms in a set of genes. BiNGO can be used either on a list of genes, pasted as text, or interactively on subgraphs of biological networks in Cytoscape (http://www.cytoscape.org/). BiNGO maps the predominant functional themes of the tested gene set, and uses of Cytoscape’s visualization environment to produce a visual representation of the results ([27]).

39 40 APPENDIX A. SIMILAR TOOLS

The main difference between BiNGO and dieGOS is that the former is a Desktop application that supports graphs as input and the latter is a Web tool that supports motif sequence motif (Section 2.1) as input and multiple species enrichment.

A.3 Bioconductor

Bioconductor (http://www.bioconductor.org/) provides tools for the anal- ysis of genomic data. It supplies packages developed in R programming language. Bioconductor is also available as an Amazon Machine Image (AMI). Packages for GO ontology includes software tools to query; join with diverse additional gene, microarray, and sequence annotations; incorporate GO into annotation, differential expression, and gene set enrichment work flows; and visualize [20]. In order to use Bioconductor packages, researchers need training in using R, computational and statistical methods.

A.4 DAVID

The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.7 (http://david.abcc.ncifcrf.gov/) is an online tool that provides an ex- tensive set of functional annotation tools for a given list of genes [10]. DAVID tools can:

• Identify enriched GO terms.

• Discover enriched functional-related gene groups.

• Cluster redundant annotation terms.

• Visualize genes on BioCarta & KEGG pathway maps.

• Search for other functionally related genes not in the list.

• List interacting proteins.

• Highlight protein functional domains and motifs.

• Convert gene identifiers from one type to another.

• And more.

Undoubtedly, DAVID is one of the most powerful GO term enrichment and statistical analysis tool. However it only permits the user to input a list of gene names. Although dieGOS does not include all the capabilities as DAVID, it is A.5. ERMINEJ 41 designed for other types of analysis. It enables the user with more flexibility for input when working with sequence motif and with multiple species enrichment analysis for determining its significance.

A.5 ermineJ ermineJ (http://bioinformatics.ubc.ca/ermineJ/) is a Data analysis software developed in Java for rankings of genes. One application is to determine whether particular biological pathways are particularly significant in the data. The software is designed to be used by biologists with little or no informatics background. A command-line interface is also available for users with script skills. ErmineJ is a Desktop tool that can be used with any genome-wide analysis that yields rankings of genes. This tool is not limited to GO annotations ([25]).

A.6 FuSSiMeG

FuSSiMeG (http://xldb.fc.ul.pt/biotools/rebil/ssm/) is an online tool that is being discontinued and users are suggested to use ProteinOn (Sec- tion A.24) instead. It measures the functional similarity between two proteins using the semantic similarity between the GO terms annotations.

A.7 g:Profiler g:Profiler (http://biit.cs.ut.ee/gprofiler/) is an online tool for char- acterizing and manipulating gene lists. g:Profiler provides a web interface with visualization capabilities. ([30, 31]) g:Profiler has the following tools:

• g:GOSt retrieves significant GO terms, KEGG and REACTOME pathways, and TRANSFAC motifs to a user-specified group of genes, proteins or microar- ray probes.

• g:Convert allows conversion between gene or protein names, database IDs and microarray probes.

• g:Orth retrieves orthologs for a given set of genes, proteins or probes in a selected organism. Graphical representation also shows orthologs present in all g:Profiler organisms. 42 APPENDIX A. SIMILAR TOOLS

• g:Sorter searches for similar expression profiles to a given gene, protein or probe in a large set of public microarray datasets from the Gene Expression Omnibus (GEO) database.

• g:Cocoa provides a compact interface for comparing enrichments of multiple gene lists. g:Profile offers different tools with many options for dealing with gene lists. dieGOS does not provide all the options that g:Profiler does but it gives the user the option to compare the significance of a sequence motif between different species (Which is not present in g:Profile).

A.8 GOrilla

GOrilla (http://cbl-gorilla.cs.technion.ac.il/) is a web-based appli- cation that identifies enriched GO terms in ranked lists of genes, without requiring the user to provide explicit target and background sets. GOrilla employs a flexible threshold statistical approach to discover GO terms that are significantly enriched at the top of a ranked gene list. (More [14]) dieGOS tool allows the user to specify “Mann Whitney Wilcoxon” (Section 2.6.2) statistical significance test to perform the analysis of a ranked list without requiring an explicit target and background test. dieGOS also provides advanced options for this type of analysis. Aditionally, GOrilla is limited to 7 organisms and this tools provide the complete list of species extracted from the protein sequences of Uniprot-Swissprot (Section 2.3) database.

A.9 GREAT

Genomic Regions Enrichment of Annotations Tool (GREAT) (http://great. stanford.edu/public/html/splash.php) is an online tool to analyze the functional significance of cis-regulatory regions identified by localized measurements of DNA binding events across an entire genome ([29]). GREAT operates on top of genomes finding significant functional regions. die- GOS, unlike GREAT, is used to find significance of a specific motif (Section 2.1) between species.

A.10 GOBU

Gene Ontology Browsing Utility (GOBU) (http://bc02.iis.sinica.edu. tw/gobu/manual/index.html) is a program written in Java for integrating A.11. GOFFA 43 biological annotation catalogs. Beside its exploration capabilities, it provides GO enrichment analysis for sets of genes and compare the results through a “parallel display” ([26]). The concept of ’comparing different sets of genes’ is applied in dieGOS, but the list of genes can be inferred directly from the input motif and the selected species. Another difference is that this tool is a web tool and GOBU is a desktop application.

A.11 GOFFA

Gene Ontology For Functional Analysis (GOFFA) ArrayTrack (http://www.fda. gov/ScienceResearch/BioinformaticsTools/Arraytrack/default.htm) is a tool developed for ArrayTrackTM database that takes a list of genes and identi- fies terms in Gene Ontology associated with those genes. ArrayTrackTM is a DNA microarray database and uses R/Bioconductor (See Section A.3). GOFFA needs Java 6 and runs as a Java Web Start application ([35]). GOFFA is a very powerful tool to analyze, and interpret omics data and its GO enrichment functionality for ArrayTrackTM. dieGOS is focused in finding if a user-specified motif (Section 2.1) is statistically associated with a species and what GO Terms are enriched for that species.

A.12 GeneMANIA

GeneMANIA (http://genemania.org/) predicts the function of a list of gene sets. It finds other related genes using a functional association data which includes protein and genetic interactions, pathways, co-expression, co-localization and pro- tein domain similarity. Based on the gene list, it also returns the connections be- tween genes ([36]). GeneMANIA performs the prediction for a list of genes based on a network of interactions. dieGos does not use network interactions data for its calculations, in- stead it relies on matching and scoring a motif against all the UniprotKB-SSwissProt proteins database. Although GeneMANIA is a powerful predictor it lack features that dieGOS provides such as motif input and statistical significant tests.

A.13 GeneMerge

GeneMerge http://genemerge.cbcb.umd.edu/ is a web-based and standalone application that uses a set of genes as input and provides rank scores for over- representation of particular functions or categories in the data. 44 APPENDIX A. SIMILAR TOOLS

Unfortunately this tool was not available during the writing of this thesis for comparison purposes.

A.14 GeneTools

GeneTools (http://www.genetools.microarray.ntnu.no/common/intro. php) is a collection of web-based tools that integrates information from different re- sources. This site contains two tools: the NMC Annotation Database V2.0 and eGOn V2.0 (explore Gene Ontology). Unfortunately this tool was not available during the writing of this thesis for comparison purposes.

A.15 TermFinder

TermFinder (http://go.princeton.edu/cgi-bin/GOTermFinder) is a tool for finding significant GO terms from a specific organism, helping to determine what they may have in common. It is written in Perl and can be run, either as a command line application, in single or batch mode, or as a web-based CGI script ([4]). TermFinder aims are similar to what dieGOS’ however the latter provides more options for input, the option to run enrichment for multiple species and visualisation capabilities.

A.16 GOTermMapper

GO Term Mapper (http://go.princeton.edu/cgi-bin/GOTermMapper) is a tool that bins a gene list to a static set of ancestor GO terms for finding the GO terms shared among them from an organism.

A.17 GoBean

GoBean (http://neon.gachon.ac.kr/GoBean/) is a Java application for GO enrichment analysis and utilizes the NetBeans platform framework. Unfortunatelly this tool was not available during the writing of this thesis for comparison purposes. A.18. GRAPHWEB 45 A.18 GraphWeb

GraphWeb (http://biit.cs.ut.ee/graphweb/) is an online tool for graph- based analysis of biological networks that allows the detection of modules from bio- logical, heterogeneous and multi-species networks. It interprets discovered modules using Gene Ontology, pathways, and cis-regulatory motifs ([32]). GraphWeb was developed by the same people as g:Profiler (See Section A.7) and its functionality is not comparable with dieGOS.

A.19 GoMiner

GoMiner (http://discover.nci.nih.gov/gominer/GoCommandWebInterface. jsp) is an online tool that organizes lists of genes in under and overexpressed genes based on their GO Terms. High-Throughput GoMiner is an enhancement of GoMiner which performs automated batch processing and is implemented as a command line interface and as a web interface. Its aim is to automate the analysis of multiple microarrays and then to integrate the results. GoMiner’s capabilities include:

• Automated batch processing of microarrays.

• Produces a report that rank-orders the multiple microarray results according to the number of significant GO categories

• Integrates the multiple results in an global clustered image of the relationships of significant GO categories

• Provides a ‘false discovery rate’ multiple comparisons calculation

• Provides annotations and visualizations for relating transcription factor bind- ing sites to genes and GO categories.

As High-Throughput GoMiner tries automate the analysis of multiple microar- rays, DieGOS tries to automate the analysis of a specific motif between multiple species.

A.20 NOA

Network Ontology Analysis (NOA) (http://app.aporc.org/NOA/) is a col- lection of Gene Ontology tools that analyzes functions of gene network instead of gene list. NOA can be used from Cytoscape (http://www.cytoscape.org/) 46 APPENDIX A. SIMILAR TOOLS as a plugin that performs ontology overrepresentation analysis based on the network connections among annotate nodes. NOA differs from dieGOS in that it operates over gene networks. Although it is a very powerful tool, its aims are different from dieGOS’.

A.21 Onto-Express

Onto-Express (OE) (http://vortex.cs.wayne.edu/projects.htm#Onto-Express) is web-based tool in the Onto-Tools suite that translates lists of genes into functional profiles. The functional profiles are constructed using GO terms. Statistical signifi- cance values are calculated for each domain. User account is required (More [24]). Even though OE is a Java applet, its GO enrichment functionality is basic and lacks visualisation capabilities.

A.22 OE2GO

Onto-Express To Go (OE2GO) (http://vortex.cs.wayne.edu/projects. htm#OE2GO) is a tool in the Onto-Tools that addresses the problem of organisms that do not have annotations. OE2GO is built on top of OE but the users have an option to use either the Onto-Tools database as a source of functional annotations or provide their own annotations in a separate file. OE2GO has the same functionality as OE plus and an alternative GO annotation database.

A.23 PiNGO

PiNGO (http://www.psb.ugent.be/esb/PiNGO/Home.html) is a Java-based tool that predicts the categorization of a gene based on the annotations of its neighbors in a network, using the enrichment statistics of BiNGO (Section A.2). PiNGO and BinGO are implemented as plugins for Cytoscape (http://www. cytoscape.org/).

A.24 ProteInOn

ProteInOn (http://lasige.di.fc.ul.pt/webtools/proteinon/) is an online tool that calculates semantic similarity between GO terms or proteins an- notated with GO terms ([16]). A.25. STEM 47

This tool provides specific capabilities for calculating the semantic similarity between GO terms however it does not provide advanced options for GO enrichment.

A.25 STEM

The Short Time-series Expression Miner (STEM) (http://www.cs.cmu.edu/ ˜jernst/stem/) is a program developed in Java for clustering, comparing, and vi- sualizing short time series gene expression data from microarray experiments. STEM recognizes significant temporal expression profiles and the genes associated with these profiles ([15]). STEM is available for free to academic and non-profit users. STEM also determines and visualizes the behavior of genes belonging to a given GO category or user defined gene set which can naturally be ordered sequentially.

A.26 StRAnGER

StRAnGER (Statistical Ranking of ANotated Genomic Experimental Results) (http: //grissom.gr/stranger/) is an online tool that performs statistical analysis from annotated lists of differentially expressed genes, using biological vocabularies, like the Gene Ontology or the KEGG pathways terms ([7]). StRAnGER does not use only GO terms but also KEGG pathways for its analy- sis. dieGOS does not offer this option but provides other input and analysis options.

A.27 ToppGene

ToppGene Suite (http://toppgene.cchmc.org/) is an online tool for gene list enrichment analysis and candidate gene prioritization based on functional annota- tions and protein interactions network (More [8]). It is composed of:

• ToppFun is a tool that detects functional enrichment of gene list based on Transcriptome, Proteome, Regulome (TFBS and miRNA), Ontologies (GO, Pathway), Phenotype (human disease and mouse phenotype), Pharmacome (Drug-Gene associations), literature co-citation, and other features.

• ToppGene is a tool that prioritizes or ranks candidate genes based on func- tional fuzzy-based similarity measure.

• ToppNet is a tool that prioritizes or ranks candidate genes based on topological features in protein-protein interaction network. 48 APPENDIX A. SIMILAR TOOLS

• ToppGenet is a tool that identifies and prioritizes the neighboring genes of the seeds in protein-protein interaction network based on functional similarity to the ”seed” list (ToppGene) or topological network (ToppNet).

ToppGene suite is a complete set of tools for enrichment analysis but it does not have the options provided by dieGOS such as input motif (See Section 2.1)

A.28 agriGO

The agriGO (http://bioinfo.cau.edu.cn/agriGO/) is an online tool and database for the gene ontology analysis. It currently supports 45 species and 292 datatypes with special focus on agricultural species. It provides: PAGE (Para- metric Analysis of Gene set Enrichment), BLAST4ID (Transfer IDs by BLAST) and SEACOMPARE (Cross comparison of SEA). These tools open possibilities for data mining and systematic result exploration. PAGE and SEACOMPARE enable cross-comparisons of results ([13]). Finally, agriGO presents a functional analysis tool for GO enrichment where the user can choose the species of interest but it is devoid of options that dieGOS offers for input. Appendix B

Package Documentation

This appendix is devoted to present the structure of the packages used in this project and to present the main methods or classes referenced from the thesis content. For more information see the JavaDocs package generated in the companion CD.

B.1 Sequence

This package contains the classes used for working with protein sequences. The classes include functionality for:

• Reading Fasta files and extract information from the header such as organism name and evidence of protein existence.

• Matching sequences from a regular expression

• Scoring sequences from a PWM

B.1.1 Sequence.java readFastaFile

Listing B.1: Definition of the method readFastaFile

public static List readFastaFile(InputStream is, Alphabet alphabet) { return seqlist; }

B.1.2 PWMScore.java

Class used for getting the Score of a sequence based in a PWM. It is implemented to run concurrently.

49 50 APPENDIX B. PACKAGE DOCUMENTATION

Figure B.1: Classes diagram of sequence package.

Listing B.2: Definition of the Class PWMScore

package sequence;

import java.util.Map; import java.util.concurrent.Callable;

import util.Triple;

/** * Used to concurrently obtain the PWM Score * @author Diego Moncayo * */ public class PWMScore implements Callable >{ Sequence seq; int pwmlen; Map mapAlphabetPositions; double [][] pwmMatrix;

/** * Initializes variables * @param sequence * @param pwmlen * @param mapAlphabetPositions B.1. SEQUENCE 51

* @param pwmMatrix */ public PWMScore(Sequence sequence, int pwmlen, Map mapAlphabetPositions, double [][] pwmMatrix){ this.seq = sequence; this.pwmlen = pwmlen; this.mapAlphabetPositions = mapAlphabetPositions; this.pwmMatrix = pwmMatrix.clone(); }

/** * Compares a PWM against a Sequence * Returns the name, maximum score and the position of the sequence: * @return seqName, position and maxScore */ public synchronized Triple call() { if (seq.length < pwmlen){ //System.out.println("Size of seq is less than pwm lenght"); return new Triple(seq.name,0,0d); } double maxScore = 0d; int pos = 0; int w=pwmlen; // width of PSSM double[] score=new double[seq.length-w+1]; char[] sym=seq.sequence.toCharArray();

int[] indices=new int[sym.length]; for (int i=0; i

for (int i=0; i maxScore) { maxScore = score[i]; pos = i; } } return new Triple(seq.name, pos, maxScore); } } 52 APPENDIX B. PACKAGE DOCUMENTATION

B.1.3 RegExp.java

Class used for matching a Regular Expression to a sequence string. prositeToRE

Listing B.3: Definition of the Method prositeToRE

* Converts a PROSITE notation pattern string to a regular expression. * @author Diego Moncayo * @return java regular expression */ public static String prositeToRE(String PROSITEPattern){ String rePattern; rePattern = PROSITEPattern; rePattern = rePattern.replace(" ",""); rePattern = rePattern.replace("-",""); rePattern = rePattern.replace("x","."); rePattern = rePattern.replace("{","[ˆ"); rePattern = rePattern.replace("}","]"); rePattern = rePattern.replace("(","{"); rePattern = rePattern.replace(")","}"); rePattern = rePattern.replace(">","$"); return rePattern; }

B.2 stats

Provides classes for performing FET and PWW statistical significance tests.

B.3 go

Provides classes for loading and accessing GO terms and annotations.

B.4 net

Provides classes and interfaces for loading, saving and processing network data.

B.5 taxonomy

Provides classes to read the common vocabulary list of organisms used by UniProt and the taxonomy file provided by NBCI. B.6. DATA 53

Figure B.2: Classes diagram of stats package.

B.6 data

Provides classes to read data used by the web application.

B.7 bean

Provides the artifacts to connect the user interface with Java objects.

B.8 controller

Provides classes to control the user interface. 54 APPENDIX B. PACKAGE DOCUMENTATION

Figure B.3: Classes diagram of go package. B.8. CONTROLLER 55

Figure B.4: Classes diagram of net package.

Figure B.5: Classes diagram of taxonomy package. 56 APPENDIX B. PACKAGE DOCUMENTATION

Figure B.6: Classes diagram of data package. B.8. CONTROLLER 57

Figure B.7: Classes diagram of bean package. 58 APPENDIX B. PACKAGE DOCUMENTATION

Figure B.8: Classes diagram of controller package. Bibliography

[1] R Apweiler, C O’onovan, M Magrane, Y Alam-Faruque, R Antunes, B Bely, M Bingley, L Bower, B Bursteinas, G Chavali, et al. Reorganizing the pro- tein space at the universal protein resource (uniprot). Nucleic acids research, 40(Database issue):D71–5, 2012.

[2] , Catherine A Ball, Judith A Blake, , Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene ontology: tool for the unification of biology. Nature , 25(1):25–29, 2000.

[3] Timothy L Bailey, Nadya Williams, Chris Misleh, and Wilfred W Li. Meme: discovering and analyzing and protein sequence motifs. Nucleic acids re- search, 34(suppl 2):W369–W373, 2006.

[4] Elizabeth I Boyle, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. Go:: Termfinderopen source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics, 20(18):3710– 3715, 2004.

[5] Seth Carbon, Amelia Ireland, Christopher J Mungall, ShengQiang Shu, Brad Marshall, , et al. Amigo: online access to ontology and annota- tion data. Bioinformatics, 25(2):288–289, 2009.

[6] Chiung-Wen Chang, Rafael Lemos Miguez Cou˜nago,Simon J Williams, Mikael Bod´en, and BoˇstjanKobe. Crystal structure of rice importin-α and structural basis of its interaction with plant-specific nuclear localization signals. The Plant Cell Online, 24(12):5074–5088, 2012.

[7] Aristotelis A Chatziioannou and Panagiotis Moulos. Exploiting statistical methodologies and controlled vocabularies for prioritized functional analysis of genomic experiments: the stranger web application. Frontiers in neuroscience, 5, 2011.

59 60 BIBLIOGRAPHY

[8] Jing Chen, Eric E Bardes, Bruce J Aronow, and Anil G Jegga. Toppgene suite for gene list enrichment analysis and candidate gene prioritization. Nucleic acids research, 37(suppl 2):W305–W311, 2009.

[9] IUPAC-IUB Comm. A one-letter notation for amino acid sequences. tentative rules. , 7(8):2703–2705, 1968.

[10] Glynn Dennis Jr, Brad T Sherman, Douglas A Hosack, Jun Yang, Wei Gao, H Clifford Lane, Richard A Lempicki, et al. David: database for annotation, visualization, and integrated discovery. Genome Biol, 4(5):P3, 2003.

[11] Patrik D’haeseleer. What are dna sequence motifs? Nature biotechnology, 24(4):423–425, 2006.

[12] Yadolah Dodge. The Oxford dictionary of statistical terms. , 2006.

[13] Zhou Du, Xin Zhou, Yi Ling, Zhenhai Zhang, and Zhen Su. agrigo: a go analysis toolkit for the agricultural community. Nucleic acids research, 38(suppl 2):W64–W70, 2010.

[14] Eran Eden, Roy Navon, Israel Steinfeld, Doron Lipson, and Zohar Yakhini. Gorilla: a tool for discovery and visualization of enriched go terms in ranked gene lists. BMC bioinformatics, 10(1):48, 2009.

[15] Jason Ernst and Ziv Bar-Joseph. Stem: a tool for the analysis of short time series gene expression data. BMC bioinformatics, 7(1):191, 2006.

[16] Daniel Faria, Catia Pesquita, Francisco M Couto, and Andr´eFalc˜ao.Proteinon: A web tool for protein semantic similarity. 2007.

[17] Scott Federhen. The ncbi taxonomy database. Nucleic acids research, 40(D1):D136–D143, 2012.

[18] Ronald A Fisher. On the interpretation of χ 2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society, 85(1):87–94, 1922.

[19] Nicholas Furnham, John S Garavelli, , and Janet M Thornton. Missing in action: enzyme functional annotations in biological databases. Na- ture chemical biology, 5(8):521–525, 2009.

[20] Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10):R80, 2004. BIBLIOGRAPHY 61

[21] Glenn R Hicks and Natasha V Raikhel. Protein import into the nucleus: an integrated view. Annual review of cell and developmental biology, 11(1):155– 188, 1995.

[22] Desmond G Higgins, Alan J Bleasby, and Rainer Fuchs. Clustal v: improved software for multiple sequence alignment. Computer applications in the bio- sciences: CABIOS, 8(2):189–191, 1992.

[23] Purvesh Khatri and Sorin Dr˘aghici. Ontological analysis of gene expres- sion data: current tools, limitations, and open problems. Bioinformatics, 21(18):3587–3595, 2005.

[24] Purvesh Khatri, Sorin Draghici, G Charles Ostermeier, and Stephen A Krawetz. Profiling gene expression using onto-express. Genomics, 79(2):266–270, 2002.

[25] Homin K Lee, William Braynen, Kiran Keshav, and Paul Pavlidis. Erminej: tool for functional analysis of gene expression data sets. BMC bioinformatics, 6(1):269, 2005.

[26] Wen-Dar Lin, Yun-Ching Chen, Jan-Ming Ho, and Chung-Der Hsiao. Gobu: toward an integration interface for biological objects. Journal of information science and engineering, 22(1):19–29, 2006.

[27] Steven Maere, Karel Heymans, and Martin Kuiper. Bingo: a cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics, 21(16):3448–3449, 2005.

[28] Henry B Mann and Donald R Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 18(1):50–60, 1947.

[29] Cory Y McLean, Dave Bristor, Michael Hiller, Shoa L Clarke, Bruce T Schaar, Craig B Lowe, Aaron M Wenger, and Gill Bejerano. Great improves functional interpretation of cis-regulatory regions. Nature biotechnology, 28(5):495–501, 2010.

[30] J¨uriReimand, Tambet Arak, and Jaak Vilo. g: Profilera web server for functional interpretation of gene lists (2011 update). Nucleic acids research, 39(suppl 2):W307–W315, 2011.

[31] J¨uriReimand, Meelis Kull, Hedi Peterson, Jaanus Hansen, and Jaak Vilo. g: Profilera web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic acids research, 35(suppl 2):W193–W200, 2007. 62 BIBLIOGRAPHY

[32] J¨uriReimand, Laur Tooming, Hedi Peterson, Priit Adler, and Jaak Vilo. Graphweb: mining heterogeneous biological networks for gene modules with functional significance. Nucleic acids research, 36(suppl 2):W452–W459, 2008.

[33] Diana Marcela S´anchez, Jos´eMar´ıaCavero, and Esperanza Marcos Mart´ınez. The road toward ontologies. In Ontologies, pages 3–20. Springer, 2007.

[34] Christian JA Sigrist, Lorenzo Cerutti, Nicolas Hulo, Alexandre Gattiker, Lau- rent Falquet, Marco Pagni, , and Philipp Bucher. Prosite: a documented database using patterns and profiles as motif descriptors. Brief- ings in bioinformatics, 3(3):265–274, 2002.

[35] Hongmei Sun, Hong Fang, Tao Chen, Roger Perkins, and Weida Tong. Goffa: gene ontology for functional analysis–a fda gene ontology tool for analysis of genomic and proteomic data. BMC bioinformatics, 7(Suppl 2):S23, 2006.

[36] David Warde-Farley, Sylva L Donaldson, Ovi Comes, Khalid Zuberi, Rashad Badrawi, Pauline Chao, Max Franz, Chris Grouios, Farzana Kazi, Chris- tian Tannus Lopes, et al. The genemania prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic acids research, 38(suppl 2):W214–W220, 2010.

[37] Wikipedia. Big o notation Wikipedia, the free encyclopedia, 2013. [Online; accessed 03-November-2013].

[38] Wikipedia. Hash table Wikipedia, the free encyclopedia, 2013. [Online; accessed 03-November-2013].

[39] Wikipedia. Taxonomy (biology) Wikipedia, the free encyclopedia, 2013. [On- line; accessed 25-October-2013].

[40] Marketa J Zvelebil and Jeremy O Baum. Understanding bioinformatics. Gar- land Science, 2008.