Diegos GO & SPECIES ENRICHMENT ANALYSIS

dieGOS GO & SPECIES ENRICHMENT ANALYSIS by DIEGO MONCAYO M. Supervisors: Dr Mikael Bodén. Dr Minh Duc Cao. Department of Electrical and Computer Engineering, University of Queensland. Submitted for the degree of Master of Computer Science. NOVEMBER 2013. ii iii 32/1 Mitre St. St. Lucia QLD 4072 Tel. (04)3102 6180 November 18, 2013 The Dean School of Engineering University of Queensland St Lucia, Q 4072 Dear Professor Paul Strooper, In accordance with the requirements of the degree of Master of Computer Science in the School of Information Technology and Electrical Engineering, I present the following thesis entitled \dieGOS - GO & Species Enrichment Analysis". This work was performed under the supervision of Dr Mikael Bodénand Dr Minh Duc Cao. I declare that the work submitted in this thesis is my own, except as acknowl- edged in the text and footnotes, and has not been previously submitted for a degree at the University of Queensland or any other institution. Yours sincerely, DIEGO MONCAYO M. iv v To my beloved family, especially to my parents. Acknowledgments I wish to express my sincere gratitude to the main supervisor of this thesis, Dr Mikael Bodén,for providing me the opportunity to do this project and for giving his valuable guidance, advice, and help. I also would like to thank him for the support and assistance during the recovery following my accident. Besides, I would also like to thank to Dr Minh Duc Cao for his recommendations and comments during his supervision of this thesis work and to all the Bioinformatics Research Group, especially to Dr Yosephine Gumulya and PhD. student Julian Zaugg, for their suggestions about the tool. vi Abstract This thesis addresses the lack of tools for determining statistical significance of an user-specified sequence motif associated to the protein sequences of a species and to the annotations related to those sequences. The tool called dieGOS is implemented as a published web service in the Internet. It enables the user to input different types of sequence motif, to select the species of interest, to choose the statistical significance test of preference and to obtain tabular and visual results. The web service considerably assists the researchers to investigate the significance of a particular motif for a group of species and between the gene ontology terms associated with those species. The process to obtain the results this tool provides is usually by developing or applying different tools over big files. This tool integrates all the functionality in one simple view but with powerful options due to advanced input and output visual components to simplify the user interaction. As the user experience depends on the algorithms responsible to process the data, they have been optimized for being effective and efficient and thus are able to return the results in an acceptable time frame. The results obtained by applying the tool to the data of a real study were slightly different. However, the results are consistent with the numbers involved in the statistical tests. vii Contents Acknowledgments vi Abstract vii List of Figures xii List of Tables xiii 1 Introduction 1 1.1 Challenges . .2 2 Background 4 2.1 Sequence Motif . .4 2.1.1 PROSITE Pattern notation . .4 2.1.2 Matrices . .4 2.2 Protein Localisation Signal . .5 2.3 UniProtKB . .6 2.4 Taxonomy . .7 2.4.1 NCBI Taxonomy database . .7 2.5 Gene Ontology . .7 2.5.1 Ontology . .7 2.5.2 Gene Ontology Project . .8 2.5.3 Structure of GO . .8 2.6 Statistical Test . .8 2.6.1 Fisher's exact test . .8 2.6.2 Mann Withney Wilcoxon test . 10 3 Previous Work 11 3.1 Repository . 11 3.2 Gene Ontology enrichment tools . 11 viii CONTENTS ix 4 Materials 12 4.1 Data . 12 4.1.1 Protein Sequences . 12 4.1.2 Gene Ontology . 12 4.1.3 Taxonomy . 13 4.1.4 Statistics of Database Files . 13 4.2 Technology . 13 4.2.1 Computer . 14 4.2.2 Programming Language . 14 4.2.3 Integrated Development Environment . 14 4.2.4 Libraries . 14 4.2.5 Server . 14 5 Methods 16 5.1 Motif to Protein algorithm . 16 5.1.1 Loading and accessing protein sequence data . 17 5.1.2 Sequence Matching and Scoring . 18 5.2 GO Associations . 19 5.2.1 Mapping sequences and annotations . 20 5.2.2 Extracting GO terms from DAG . 20 5.3 Taxonomy Tree . 21 5.3.1 Generating Taxonomy . 21 5.4 Statistical Test . 22 5.4.1 Species Enrichment . 22 5.4.2 Gene Ontology Enrichment . 22 5.5 Web Service . 22 5.5.1 dieGOS . 23 5.5.2 Input . 23 5.5.3 Species Tree . 25 5.5.4 Statistical Test . 26 5.5.5 Results . 26 5.5.6 Visualization . 26 6 Results and Discussion 31 6.1 Results Comparison . 31 6.2 Similar tools Comparison . 33 7 Conclusions 36 7.1 Possible future work . 36 7.2 Comments . 36 x CONTENTS Appendices 38 A Similar Tools 39 A.1 AmiGO . 39 A.2 BiNGO . 39 A.3 Bioconductor . 40 A.4 DAVID................................... 40 A.5 ermineJ . 41 A.6 FuSSiMeG . 41 A.7 g:Profiler . 41 A.8 GOrilla . 42 A.9 GREAT . 42 A.10 GOBU . 42 A.11 GOFFA . 43 A.12 GeneMANIA . 43 A.13 GeneMerge . 43 A.14 GeneTools . 44 A.15 TermFinder . 44 A.16 GOTermMapper . 44 A.17 GoBean . 44 A.18 GraphWeb . 45 A.19 GoMiner . 45 A.20 NOA . 45 A.21 Onto-Express . 46 A.22 OE2GO . 46 A.23 PiNGO . 46 A.24 ProteInOn . 46 A.25 STEM . 47 A.26 StRAnGER . 47 A.27 ToppGene . 47 A.28 agriGO . 48 B Package Documentation 49 B.1 Sequence . 49 B.1.1 Sequence.java . 49 B.1.2 PWMScore.java . 49 B.1.3 RegExp.java . 52 B.2 stats . 52 B.3 go ..................................... 52 B.4 net . 52 CONTENTS xi B.5 taxonomy . 52 B.6 data . 53 B.7 bean . 53 B.8 controller . 53 List of Figures 2.1 Flow diagram of the UniprotKB annotation process. .7 2.2 Screenshot from the software OBO-Edit, showing a small set of terms from the ontology. .9 5.1 Timeline for each stage. 16 5.2 10 runs readFastaFile. 17 5.3 Used heap after read fasta file. 18 5.4 10 runs RegExp.search. 19 5.5 dieGOS Logo . ..

Diegos GO & SPECIES ENRICHMENT ANALYSIS

Applied Category Theory for Genomics – an Initiative

Gene Prediction: the End of the Beginning Comment Colin Semple

Meeting Review: Bioinformatics and Medicine – from Molecules To

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

The Ethos and Effects of Data-Sharing Rules: Examining The

The HUPO Proteomics Standards Initiative Meeting: Towards Common Standards for Exchanging Proteomics Data Hinxton, Cambridge, UK, 19–20 October 2002

The European Bioinformatics Institute in 2020: Building a Global Infrastructure of Interconnected Data Resources for the Life Sciences Charles E

2003 Mulder Nucl Acids Res {22

Contcenter for Genomic Regul

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

Computational Biology: Plus C'est La Même Chose, Plus Ça Change

MPGM: Scalable and Accurate Multiple Network Alignment