Diegos GO & SPECIES ENRICHMENT ANALYSIS

Diegos GO & SPECIES ENRICHMENT ANALYSIS

dieGOS GO & SPECIES ENRICHMENT ANALYSIS by DIEGO MONCAYO M. Supervisors: Dr Mikael Bod´en. Dr Minh Duc Cao. Department of Electrical and Computer Engineering, University of Queensland. Submitted for the degree of Master of Computer Science. NOVEMBER 2013. ii iii 32/1 Mitre St. St. Lucia QLD 4072 Tel. (04)3102 6180 November 18, 2013 The Dean School of Engineering University of Queensland St Lucia, Q 4072 Dear Professor Paul Strooper, In accordance with the requirements of the degree of Master of Computer Science in the School of Information Technology and Electrical Engineering, I present the following thesis entitled \dieGOS - GO & Species Enrichment Analysis". This work was performed under the supervision of Dr Mikael Bod´enand Dr Minh Duc Cao. I declare that the work submitted in this thesis is my own, except as acknowl- edged in the text and footnotes, and has not been previously submitted for a degree at the University of Queensland or any other institution. Yours sincerely, DIEGO MONCAYO M. iv v To my beloved family, especially to my parents. Acknowledgments I wish to express my sincere gratitude to the main supervisor of this thesis, Dr Mikael Bod´en,for providing me the opportunity to do this project and for giving his valuable guidance, advice, and help. I also would like to thank him for the support and assistance during the recovery following my accident. Besides, I would also like to thank to Dr Minh Duc Cao for his recommendations and comments during his supervision of this thesis work and to all the Bioinformatics Research Group, especially to Dr Yosephine Gumulya and PhD. student Julian Zaugg, for their suggestions about the tool. vi Abstract This thesis addresses the lack of tools for determining statistical significance of an user-specified sequence motif associated to the protein sequences of a species and to the annotations related to those sequences. The tool called dieGOS is implemented as a published web service in the Internet. It enables the user to input different types of sequence motif, to select the species of interest, to choose the statistical significance test of preference and to obtain tabular and visual results. The web service considerably assists the researchers to investigate the significance of a particular motif for a group of species and between the gene ontology terms associated with those species. The process to obtain the results this tool provides is usually by developing or applying different tools over big files. This tool integrates all the functionality in one simple view but with powerful options due to advanced input and output visual components to simplify the user interaction. As the user experience depends on the algorithms responsible to process the data, they have been optimized for being effective and efficient and thus are able to return the results in an acceptable time frame. The results obtained by applying the tool to the data of a real study were slightly different. However, the results are consistent with the numbers involved in the statistical tests. vii Contents Acknowledgments vi Abstract vii List of Figures xii List of Tables xiii 1 Introduction 1 1.1 Challenges . .2 2 Background 4 2.1 Sequence Motif . .4 2.1.1 PROSITE Pattern notation . .4 2.1.2 Matrices . .4 2.2 Protein Localisation Signal . .5 2.3 UniProtKB . .6 2.4 Taxonomy . .7 2.4.1 NCBI Taxonomy database . .7 2.5 Gene Ontology . .7 2.5.1 Ontology . .7 2.5.2 Gene Ontology Project . .8 2.5.3 Structure of GO . .8 2.6 Statistical Test . .8 2.6.1 Fisher's exact test . .8 2.6.2 Mann Withney Wilcoxon test . 10 3 Previous Work 11 3.1 Repository . 11 3.2 Gene Ontology enrichment tools . 11 viii CONTENTS ix 4 Materials 12 4.1 Data . 12 4.1.1 Protein Sequences . 12 4.1.2 Gene Ontology . 12 4.1.3 Taxonomy . 13 4.1.4 Statistics of Database Files . 13 4.2 Technology . 13 4.2.1 Computer . 14 4.2.2 Programming Language . 14 4.2.3 Integrated Development Environment . 14 4.2.4 Libraries . 14 4.2.5 Server . 14 5 Methods 16 5.1 Motif to Protein algorithm . 16 5.1.1 Loading and accessing protein sequence data . 17 5.1.2 Sequence Matching and Scoring . 18 5.2 GO Associations . 19 5.2.1 Mapping sequences and annotations . 20 5.2.2 Extracting GO terms from DAG . 20 5.3 Taxonomy Tree . 21 5.3.1 Generating Taxonomy . 21 5.4 Statistical Test . 22 5.4.1 Species Enrichment . 22 5.4.2 Gene Ontology Enrichment . 22 5.5 Web Service . 22 5.5.1 dieGOS . 23 5.5.2 Input . 23 5.5.3 Species Tree . 25 5.5.4 Statistical Test . 26 5.5.5 Results . 26 5.5.6 Visualization . 26 6 Results and Discussion 31 6.1 Results Comparison . 31 6.2 Similar tools Comparison . 33 7 Conclusions 36 7.1 Possible future work . 36 7.2 Comments . 36 x CONTENTS Appendices 38 A Similar Tools 39 A.1 AmiGO . 39 A.2 BiNGO . 39 A.3 Bioconductor . 40 A.4 DAVID................................... 40 A.5 ermineJ . 41 A.6 FuSSiMeG . 41 A.7 g:Profiler . 41 A.8 GOrilla . 42 A.9 GREAT . 42 A.10 GOBU . 42 A.11 GOFFA . 43 A.12 GeneMANIA . 43 A.13 GeneMerge . 43 A.14 GeneTools . 44 A.15 TermFinder . 44 A.16 GOTermMapper . 44 A.17 GoBean . 44 A.18 GraphWeb . 45 A.19 GoMiner . 45 A.20 NOA . 45 A.21 Onto-Express . 46 A.22 OE2GO . 46 A.23 PiNGO . 46 A.24 ProteInOn . 46 A.25 STEM . 47 A.26 StRAnGER . 47 A.27 ToppGene . 47 A.28 agriGO . 48 B Package Documentation 49 B.1 Sequence . 49 B.1.1 Sequence.java . 49 B.1.2 PWMScore.java . 49 B.1.3 RegExp.java . 52 B.2 stats . 52 B.3 go ..................................... 52 B.4 net . 52 CONTENTS xi B.5 taxonomy . 52 B.6 data . 53 B.7 bean . 53 B.8 controller . 53 List of Figures 2.1 Flow diagram of the UniprotKB annotation process. .7 2.2 Screenshot from the software OBO-Edit, showing a small set of terms from the ontology. .9 5.1 Timeline for each stage. 16 5.2 10 runs readFastaFile. 17 5.3 Used heap after read fasta file. 18 5.4 10 runs RegExp.search. 19 5.5 dieGOS Logo . ..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    76 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us