dieGOS GO & SPECIES ENRICHMENT ANALYSIS
by DIEGO MONCAYO M.
Supervisors: Dr Mikael Bod´en. Dr Minh Duc Cao.
Department of Electrical and Computer Engineering, University of Queensland.
Submitted for the degree of Master of Computer Science. NOVEMBER 2013. ii iii
32/1 Mitre St. St. Lucia QLD 4072 Tel. (04)3102 6180 November 18, 2013
The Dean School of Engineering University of Queensland St Lucia, Q 4072
Dear Professor Paul Strooper,
In accordance with the requirements of the degree of Master of Computer Science in the School of Information Technology and Electrical Engineering, I present the following thesis entitled “dieGOS - GO & Species Enrichment Analysis”. This work was performed under the supervision of Dr Mikael Bod´enand Dr Minh Duc Cao. I declare that the work submitted in this thesis is my own, except as acknowl- edged in the text and footnotes, and has not been previously submitted for a degree at the University of Queensland or any other institution.
Yours sincerely,
DIEGO MONCAYO M. iv v
To my beloved family, especially to my parents. Acknowledgments
I wish to express my sincere gratitude to the main supervisor of this thesis, Dr Mikael Bod´en,for providing me the opportunity to do this project and for giving his valuable guidance, advice, and help. I also would like to thank him for the support and assistance during the recovery following my accident. Besides, I would also like to thank to Dr Minh Duc Cao for his recommendations and comments during his supervision of this thesis work and to all the Bioinformatics Research Group, especially to Dr Yosephine Gumulya and PhD. student Julian Zaugg, for their suggestions about the tool.
vi Abstract
This thesis addresses the lack of tools for determining statistical significance of an user-specified sequence motif associated to the protein sequences of a species and to the annotations related to those sequences. The tool called dieGOS is implemented as a published web service in the Internet. It enables the user to input different types of sequence motif, to select the species of interest, to choose the statistical significance test of preference and to obtain tabular and visual results. The web service considerably assists the researchers to investigate the significance of a particular motif for a group of species and between the gene ontology terms associated with those species. The process to obtain the results this tool provides is usually by developing or applying different tools over big files. This tool integrates all the functionality in one simple view but with powerful options due to advanced input and output visual components to simplify the user interaction. As the user experience depends on the algorithms responsible to process the data, they have been optimized for being effective and efficient and thus are able to return the results in an acceptable time frame. The results obtained by applying the tool to the data of a real study were slightly different. However, the results are consistent with the numbers involved in the statistical tests.
vii Contents
Acknowledgments vi
Abstract vii
List of Figures xii
List of Tables xiii
1 Introduction 1 1.1 Challenges ...... 2
2 Background 4 2.1 Sequence Motif ...... 4 2.1.1 PROSITE Pattern notation ...... 4 2.1.2 Matrices ...... 4 2.2 Protein Localisation Signal ...... 5 2.3 UniProtKB ...... 6 2.4 Taxonomy ...... 7 2.4.1 NCBI Taxonomy database ...... 7 2.5 Gene Ontology ...... 7 2.5.1 Ontology ...... 7 2.5.2 Gene Ontology Project ...... 8 2.5.3 Structure of GO ...... 8 2.6 Statistical Test ...... 8 2.6.1 Fisher’s exact test ...... 8 2.6.2 Mann Withney Wilcoxon test ...... 10
3 Previous Work 11 3.1 Repository ...... 11 3.2 Gene Ontology enrichment tools ...... 11
viii CONTENTS ix
4 Materials 12 4.1 Data ...... 12 4.1.1 Protein Sequences ...... 12 4.1.2 Gene Ontology ...... 12 4.1.3 Taxonomy ...... 13 4.1.4 Statistics of Database Files ...... 13 4.2 Technology ...... 13 4.2.1 Computer ...... 14 4.2.2 Programming Language ...... 14 4.2.3 Integrated Development Environment ...... 14 4.2.4 Libraries ...... 14 4.2.5 Server ...... 14
5 Methods 16 5.1 Motif to Protein algorithm ...... 16 5.1.1 Loading and accessing protein sequence data ...... 17 5.1.2 Sequence Matching and Scoring ...... 18 5.2 GO Associations ...... 19 5.2.1 Mapping sequences and annotations ...... 20 5.2.2 Extracting GO terms from DAG ...... 20 5.3 Taxonomy Tree ...... 21 5.3.1 Generating Taxonomy ...... 21 5.4 Statistical Test ...... 22 5.4.1 Species Enrichment ...... 22 5.4.2 Gene Ontology Enrichment ...... 22 5.5 Web Service ...... 22 5.5.1 dieGOS ...... 23 5.5.2 Input ...... 23 5.5.3 Species Tree ...... 25 5.5.4 Statistical Test ...... 26 5.5.5 Results ...... 26 5.5.6 Visualization ...... 26
6 Results and Discussion 31 6.1 Results Comparison ...... 31 6.2 Similar tools Comparison ...... 33
7 Conclusions 36 7.1 Possible future work ...... 36 7.2 Comments ...... 36 x CONTENTS
Appendices 38
A Similar Tools 39 A.1 AmiGO ...... 39 A.2 BiNGO ...... 39 A.3 Bioconductor ...... 40 A.4 DAVID...... 40 A.5 ermineJ ...... 41 A.6 FuSSiMeG ...... 41 A.7 g:Profiler ...... 41 A.8 GOrilla ...... 42 A.9 GREAT ...... 42 A.10 GOBU ...... 42 A.11 GOFFA ...... 43 A.12 GeneMANIA ...... 43 A.13 GeneMerge ...... 43 A.14 GeneTools ...... 44 A.15 TermFinder ...... 44 A.16 GOTermMapper ...... 44 A.17 GoBean ...... 44 A.18 GraphWeb ...... 45 A.19 GoMiner ...... 45 A.20 NOA ...... 45 A.21 Onto-Express ...... 46 A.22 OE2GO ...... 46 A.23 PiNGO ...... 46 A.24 ProteInOn ...... 46 A.25 STEM ...... 47 A.26 StRAnGER ...... 47 A.27 ToppGene ...... 47 A.28 agriGO ...... 48
B Package Documentation 49 B.1 Sequence ...... 49 B.1.1 Sequence.java ...... 49 B.1.2 PWMScore.java ...... 49 B.1.3 RegExp.java ...... 52 B.2 stats ...... 52 B.3 go ...... 52 B.4 net ...... 52 CONTENTS xi
B.5 taxonomy ...... 52 B.6 data ...... 53 B.7 bean ...... 53 B.8 controller ...... 53 List of Figures
2.1 Flow diagram of the UniprotKB annotation process...... 7 2.2 Screenshot from the software OBO-Edit, showing a small set of terms from the ontology...... 9
5.1 Timeline for each stage...... 16 5.2 10 runs readFastaFile...... 17 5.3 Used heap after read fasta file...... 18 5.4 10 runs RegExp.search...... 19 5.5 dieGOS Logo ...... 23 5.6 dieGOS: Input component ...... 24 5.7 dieGOS: Input Advanced Options for PWM component ...... 25 5.8 dieGOS: Species Tree ...... 26 5.9 dieGOS: Statistical Significance Test ...... 27 5.10 dieGOS: Results ...... 27 5.11 dieGOS: Tree Enriched Species ...... 28 5.12 dieGOS: Bundle Species and GO terms ...... 29 5.13 visualVM: Startup process ...... 30
6.1 Species enrichment of NLS motif using FET ...... 32 6.2 Species enrichment of NLS motif using MWW ...... 33 6.3 Enrichment results graph ...... 34
B.1 Classes diagram of sequence package...... 50 B.2 Classes diagram of stats package...... 53 B.3 Classes diagram of go package...... 54 B.4 Classes diagram of net package...... 55 B.5 Classes diagram of taxonomy package...... 55 B.6 Classes diagram of data package...... 56 B.7 Classes diagram of bean package...... 57 B.8 Classes diagram of controller package...... 58
xii List of Tables
2.1 Position Frequency Matrix for Cytochrome P450 ...... 5 2.2 Position Weight Matrix for Cytochrome P450 ...... 6 2.3 Confusion Matrix for hypergeometric distribution ...... 9
4.1 Database files and Statistics ...... 13 4.2 Libraries and description ...... 15
5.1 Comparison normal and concurrent PWM Scoring ...... 19 5.2 Statistics GO Domain DAG...... 20 5.3 GO files after processing ...... 21 5.4 10 runs FET for species and GO terms ...... 23
6.1 Results: Comparison contingency matrices ...... 33 6.2 Comparison between GO enrichment tools...... 35
xiii xiv LIST OF TABLES Chapter 1
Introduction
This project aims to develop a tool that determines the statistical significance of an user-specified sequence motif associated to a species and to identify the most representative Gene Ontology (GO) terms of the matching sequences for that species. The tool also should allow scientists worldwide to access and use it. Apart from the tool’s applicability, the aim of the project includes designing and implementing effective and efficient algorithms, both in terms of time and space computational complexity The project evaluates the tool capability to find whether certain known protein localisation signals (PLS) are species-specific and what molecular functions are asso- ciated with the species of interest. There are functional gaps in similar enrichment tools that this project addresses. It includes the analysis and comparison of the functionality of these GO enrichment tools and this tool. Since this project involves accessing and searching large databases and applying complex statistical tests, it considers computational complexity of the implemented algorithms. After obtaining a specific motif using motif discovery tools, scientists could use it as input to this application, which characterizes the specificity of the motif to different species. For example, we can test whether a particular PLS is plant spe- cific or not. Thus, the outcome of this project contributes to the wider scientific community It is important for the tool to support different representations of the reached sequence motif. The motif is mainly represented as a regular expression or a Position Weight Matrix (PWM) and there are many formats available for these representa- tions. The tool enables the user to input the formats available up to date (PROSITE notation, Regular Expression, Clustal file format, Meme file format, Distrib file for- mat). The user also needs to select the species of interest from the taxonomy tree but due to the number of species, it is necessary a mechanism to easily browse the tree and search for a particular species in it. The tool provides these capabilities for
1 2 CHAPTER 1. INTRODUCTION browsing through a expandable taxonomy tree and for searching a species based on its name or taxon number. Data from NCBI and UniProt are used to build this tree. In order to determine if the protein properties are shared between species, the tool looks for matches between the user-specified sequence motif and a database of well-annotated protein sequences. UniProt and specifically SwissProt is the chosen protein database to seek motif matches because of its high quality, added manually, annotations of proteins. This will ensure that the result of the analysis is based on reliable information. On the other hand, there are many proteins with computa- tionally annotated (unreviewed), which are related to other species that may not be present in SwissProt. A statistical significance test such as Fisher’s exact test (FET) is necessary to determine if the matching and non matching sequences for a species are significantly different from every other species. Mann-Whitney-Wilcoxon (MWW) test can also be used if a (PWM) representation of a motif is applied and a threshold to split sequences is not known. In order to determine if the Molecular Function or other GO domain property are statistically associated with a species, FET is used with the attained GO terms (annotations) from the matching-sequences as input. Tabular and visual results are necessary to effectively analyse the information generated. The tabular format includes functionality such as: searching, pagina- tion, ordering of results, exporting features and links to information referring to the enriched species and GO terms. The visualizations of results enable the analysis of the enrichment between species and between GO Terms part of the species. An easy access mechanism to make the tool available for open access is necessary and the Internet seems to be the best option. The implementation of the function- ality of this tool is designed to work as a web service that let the user to specify the input motif and the species to be analysed, and to obtain useful results. The user should expect to perform many different tests, therefore the data should be processed and the results returned in an acceptable time frame.
1.1 Challenges
This following list shows the main challenges faced during the development of the tool and the explanation of the proposed solutions are covered in the Methods section.
• Protein access The number of proteins in SwissProt database is around 540,000 to date. Quick access to the data is necessary to extract key protein information such as species and sequence. 1.1. CHALLENGES 3
• Protein scan As the motif can be represented as a regular expression (regexp) or a PWM, looking for matches against a large protein database has to be performed efficiently in both cases.
• GO annotations access Due to the relations between GO terms and the enormous number of GO annotations, a data structure capable of quickly extracting the direct and transitive annotations of the proteins is required.
• Statistical test The statistical significance test expends a lot of computational time, specially related to the hypergeometric distribution that uses very large numbers as a result of the factorials. There is the necessity for alternative, computational feasible approximations for these calculations. Chapter 2
Background
2.1 Sequence Motif
A motif is a recurring nucleotide or amino acid pattern that is presumed to have a biologically significant function. This is usually composed of a short sequence of contiguous residues, however, sometimes it shows a more distributed pattern. Functionally related sequences often share similar distribution patterns for specific functional residues [40]. Sequence motifs need to be described and represented in order to work with them.
2.1.1 PROSITE Pattern notation
The PROSITE notation uses a variant of regular expression (regexp) syntax for describing a motif, as well as other notations. Regexp pattern matching is a well studied topic in Computer Science. Many algorithms and tools have been developed to match a regexp in a text string. Although this approach to represent motifs is widely used, a regexp is too rigid to represent highly divergent protein motifs. The limitation of this notation is the inability to represent probabilities for the presence of certain residue or nucleotide in the sequence [34]. e.g. Consensus pattern for Cytochrome P450 cysteine heme-iron ligand signature [FW]-[SGNH]-x-[GD]-{F}- [RKHPT]-{P}-C-[LIVMFAP]-[GAD].
2.1.2 Matrices
Another possible description for a motif is a Position Frequency Matrix (PFM). Rather than only keeping track of the most common base at each position, this matrix registers how often each base occurs in known sites [11]. Table 2.1 shows an example of PFM calculated from the Cytochrome P450 cysteine heme-iron ligand signature in Clustal format (Alignment).
4 2.2. PROTEIN LOCALISATION SIGNAL 5
Sym Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 Pos8 Pos9 Pos10 A 0.001 0.001 0.050 0.001 0.050 0.001 0.071 0.001 0.072 0.057 C 0.001 0.001 0.050 0.001 0.005 0.001 0.004 0.981 0.001 0.001 D 0.001 0.001 0.050 0.032 0.006 0.001 0.007 0.001 0.001 0.013 E 0.001 0.001 0.050 0.001 0.009 0.001 0.011 0.001 0.001 0.001 F 0.964 0.001 0.050 0.001 0.001 0.001 0.032 0.001 0.019 0.001 G 0.001 0.680 0.050 0.949 0.005 0.001 0.043 0.001 0.001 0.913 H 0.001 0.007 0.050 0.001 0.005 0.132 0.018 0.001 0.001 0.001 I 0.001 0.001 0.050 0.001 0.028 0.001 0.165 0.001 0.291 0.001 K 0.001 0.001 0.050 0.001 0.162 0.018 0.030 0.001 0.001 0.001 L 0.001 0.001 0.050 0.001 0.024 0.001 0.041 0.001 0.228 0.001 M 0.001 0.001 0.050 0.001 0.015 0.001 0.069 0.001 0.015 0.001 N 0.001 0.029 0.050 0.001 0.007 0.001 0.147 0.001 0.001 0.001 P 0.001 0.001 0.050 0.001 0.273 0.013 0.001 0.001 0.247 0.001 Q 0.001 0.001 0.050 0.001 0.029 0.001 0.055 0.001 0.001 0.001 R 0.001 0.001 0.050 0.001 0.261 0.809 0.077 0.001 0.001 0.001 S 0.001 0.267 0.050 0.001 0.030 0.001 0.102 0.001 0.001 0.001 T 0.001 0.001 0.050 0.001 0.017 0.013 0.030 0.001 0.001 0.001 V 0.001 0.001 0.050 0.001 0.067 0.001 0.080 0.001 0.114 0.001 W 0.018 0.001 0.050 0.001 0.001 0.001 0.002 0.001 0.001 0.001 Y 0.001 0.001 0.050 0.001 0.003 0.001 0.014 0.001 0.001 0.001
Table 2.1: Position Frequency Matrix for Cytochrome P450
A PWM contains log odds weights for computing a match score. To construct a PWM it is necessary that a set of sequences be aligned and the most conserved columns are extracted. We can define the probability of the residue type a occurring in column u of the PWM as qu,a, and the probability of residue type an occurring at any position in any sequence as pa. The probability pa can be extracted directly from the PFM and it is not restricted to the aligned sequences, a background model of sequences can be used [40]. The log-odds form for a PWM element can be written as qu,a mu,a = log (2.1) pa
The probability of qu,a can be determined using. n + bp q = u,a a (2.2) u,a N + b
Where nu,a is the number of times that the amino acid a is observed in the column u in the N sequences. pa is the prior probability of a and b is the so-called pseudo- count and its value is set to N 1/2. Table 2.2 presents the log odds using the PFM presented above and a uniform distributed background (1/20 = 0.05).
2.2 Protein Localisation Signal
Proteins are synthesized on ribosomes in the intracellular fluid. However they usually need to be transported to their final destination. Typically, a Protein Localisation 6 CHAPTER 2. BACKGROUND
Sym Pos1 Pos2 Pos3 Pos4 Pos5 Pos6 Pos7 Pos8 Pos9 Pos10 A -3.90 -3.90 +0.00 -3.90 -0.01 -3.90 +0.35 -3.90 +0.36 +0.13 C -3.90 -3.90 +0.00 -3.90 -2.29 -3.90 -2.51 +2.98 -3.90 -3.90 D -3.90 -3.90 +0.00 -0.43 -2.11 -3.90 -1.95 -3.90 -3.90 -1.34 E -3.90 -3.90 +0.00 -3.90 -1.70 -3.90 -1.50 -3.90 -3.90 -3.90 F +2.96 -3.90 +0.00 -3.90 -3.90 -3.90 -0.43 -3.90 -0.96 -3.90 G -3.90 +2.61 +0.00 +2.94 -2.29 -3.90 -0.16 -3.90 -3.90 +2.90 H -3.90 -1.95 +0.00 -3.90 -2.29 +0.97 -1.01 -3.90 -3.90 -3.90 I -3.90 -3.90 +0.00 -3.90 -0.57 -3.90 +1.19 -3.90 +1.76 -3.90 K -3.90 -3.90 +0.00 -3.90 +1.18 -1.01 -0.50 -3.90 -3.90 -3.90 L -3.90 -3.90 +0.00 -3.90 -0.72 -3.90 -0.19 -3.90 +1.52 -3.90 M -3.90 -3.90 +0.00 -3.90 -1.19 -3.90 +0.32 -3.90 -1.19 -3.90 N -3.90 -0.53 +0.00 -3.90 -1.95 -3.90 +1.08 -3.90 -3.90 -3.90 P -3.90 -3.90 +0.00 -3.90 +1.70 -1.34 -3.90 -3.90 +1.60 -3.90 Q -3.90 -3.90 +0.00 -3.90 -0.53 -3.90 +0.09 -3.90 -3.90 -3.90 R -3.90 -3.90 +0.00 -3.90 +1.65 +2.78 +0.43 -3.90 -3.90 -3.90 S -3.90 +1.68 +0.00 -3.90 -0.50 -3.90 +0.72 -3.90 -3.90 -3.90 T -3.90 -3.90 +0.00 -3.90 -1.07 -1.34 -0.50 -3.90 -3.90 -3.90 V -3.90 -3.90 +0.00 -3.90 +0.29 -3.90 +0.47 -3.90 +0.83 -3.90 W -1.01 -3.90 +0.00 -3.90 -3.90 -3.90 -3.21 -3.90 -3.90 -3.90 Y -3.90 -3.90 +0.00 -3.90 -2.80 -3.90 -1.26 -3.90 -3.90 -3.90
Table 2.2: Position Weight Matrix for Cytochrome P450
Signal (PLS) consists of one or more short amino acids sequences that help to direct the protein to its target subcellular location and enable the interaction with the cellular transport machinery [40]. Nuclear Localisation Signal (NLS) is a specific type of PLS that identify a protein to be imported into the cell nucleous. As NLS have shown to be highly preserved between eukaryotic [21], the proposed tool can be used to examine if the proteins that match the specific motif share the same properties with other species.
2.3 UniProtKB
The Universal Protein Knowledgebase (UniProtKB) is one of the four databases of the Universal Protein Resource (UniProt). It is the central access point for ex- tensively curated protein information and is composed of Swiss-Prot and TrEMBL. Swiss-Prot is manually annotated and reviewed and TrEMBL is automatically an- notated and is not reviewed. The annotation process used in UniProtKB is shown in Fig. 2.1 as a flow diagram with manually curated entries of UniProtKB/Swiss- Prot and automated entries of UniProtKB/TrEMBL [19]. UniprotKB includes complete proteome and reference proteome sets. This data is updated and dis- tributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org. (See [1]) 2.4. TAXONOMY 7
Figure 2.1: Flow diagram of the UniprotKB annotation process.
2.4 Taxonomy
Taxonomy (biology) is the science of defining groups of biological organisms based on shared characteristics and giving names to those groups [39]. Organisms are grouped together into taxa (singular: taxon) and these groups may be aggregated as well to form a super group and thus create a taxonomic hierarchy.
2.4.1 NCBI Taxonomy database
The NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) includes organism names and taxonomic lineages for each of the sequences rep- resented in the INSDCs nucleotide and protein sequence databases (EMBL/Gen- Bank/DDBJ). The taxonomy database is manually curated to maintain a phyloge- netic taxonomy for the organisms represented in the sequence databases. Uniprot uses this taxonomic lineage data of the source organism in its database. This project will use the same data for building a Taxonomy tree that will enable the user to browse through the tree and find the species of interest [17].
2.5 Gene Ontology
2.5.1 Ontology
At the present, there is a huge amount of information in biological databases. The objective of this project is to acquire knowledge using this information. The problem 8 CHAPTER 2. BACKGROUND is that this information is not always characterized by a uniformity of terms. In Computer Science one of the methods that aims to satisfy this need for common understanding of concepts is the creation of ontologies. Ontology can be defined is a way of representing a common understanding of a domain based on a consensus about the meaning and relationship between terms [33].
2.5.2 Gene Ontology Project
The Gene Ontology project (http://www.geneontology.org/) is a collabo- rative effort to maintain and develop ontologies to support biologically meaning- ful annotation of genes and their products [2]. It provides a “common language” throughout ontology terms and their structure. The use of a consistent vocabulary allows genes from different species to be compared based on their GO annotations (terms). The Gene Ontology Annotation database (GOA) is the source that will be used in this project.
2.5.3 Structure of GO
The structure of the GO is important for the project because it defines the rela- tionship between ontology terms. They resemble a hierarchy, such that higher level terms are more general and are assigned to more genes, and more specific descendant terms are related to parents by either “is part” or “part of” relationships. Never- theless, a term may have more than one parent terms. The relationships form a directed acyclic graph (DAG) [2]. Three GO domains (cellular component, biologi- cal process, and molecular function) are each represented by an ontology term. The domain terms are unrelated and do not have a common parent node. This design allows inferring the domain and relationships that will be evaluated for statistical significance (refer to Fig. 2.2). This process is called ‘GO enrichment analysis’.
2.6 Statistical Test
2.6.1 Fisher’s exact test
Fisher’s exact test [18] is a statistical significance test applied in contingency tables analysis to examine the significance of the association between observations. In other words, the test determines if the observations reflect a pattern rather than just by chance. In our case, we want to determine whether proteins that contain a specific motif share some properties between species. The test computes the probability of observed data based on hypergeometric distribution. The motif-matching proteins with a property to be tested (for example 2.6. STATISTICAL TEST 9
Figure 2.2: Screenshot from the software OBO-Edit, showing a small set of terms from the ontology.
Match NoMatch Row Total Species a b a + b Non-Species c d c + d Column Total a + c b + d a + b + c + d
Table 2.3: Confusion Matrix for hypergeometric distribution species-specific) are part of the foreground and all the other proteins that have the property are part of the background. Table 2.3 represents the confusion matrix for species enrichment where a, b, c, d are the count of proteins grouped by matching and non-matching the motif, and by being or not being part of the species. In order to determine the most enriched GO terms the same principle applies. For each GO term in the motif-matching proteins, we count the numbers of proteins in the foreground set and the background set with and without this term. The test will return the p-value of the term; the smaller the p-value, the greater evidence of statistical significance [12]. The following is the equation to calculate the p-value.