Integration and Visualisation of Data in Bioinformatics
Total Page:16
File Type:pdf, Size:1020Kb
Integration and Visualisation of Data in Bioinformatics a dissertation presented by Gustavo Adolfo Salazar Orejuela to The Computational Biology Group Supervised By Prof. Nicola Mulder in partial fulfillment of the requirements for the degree of Doctor of Philosophy Universityin the of subject Cape of Town Bioinformatics University of Cape Town Cape Town, Western Cape, South Africa December 2015 The copyright of this thesis vests in the author. No quotation from it or information derived from it is to be published without full acknowledgement of the source. The thesis is to be used for private study or non- commercial research purposes only. Published by the University of Cape Town (UCT) in terms of the non-exclusive license granted to UCT by the author. University of Cape Town © 2015 - Gustavo Adolfo Salazar Orejuela All rights reserved. Thesis Supervisor: Prof. Nicola Mulder Gustavo Adolfo Salazar Orejuela Integration and Visualisation of Data in Bioinformatics Abstract The most recent advances in laboratory techniques aimed at observing and measuring biological processes are characterised by their ability to generate large amounts of data. The more data we gather, the greater the chance of finding clues to understand the systems of life. This, however, is only true if the methods that analyse the generated data are efficient, effective, and robust enough to overcome the challenges intrinsic to the management of big data. The computational tools designed to overcome these challenges should also take into account the requirements of current research. Science demands specialised knowledge for understanding the particularities of each study; in addition, it is seldom possible to describe a single observation without considering its relationship with other processes, entities or systems. This thesis explores two closely related fields: the integration and visualisation of biological data. We believe that these two branches of study are fundamental in the creation of scientific software tools that respond to the ever increasing needs of researchers. The distributed annotation system (DAS) is a community project that supports the integration of data from federated sources and its visualisation on web and stand-alone clients. We have extended the DAS protocol to improve its search capabilities and also to support feature annotation by the community. We have also collaborated on the implementation of MyDAS, a server to facilitate the publication of biological data following the DAS protocol, and contributed in the design of the protein DAS client called DASty. Furthermore, we have developed a tool called probeSearcher, which uses the DAS technology to facilitate the identification of microarray chips that include probes for regions on proteins of interest. Another community project in which we participated is BioJS, an open source library of visualisation components for biological data. This thesis includes a description of the project, our contributions to it and some developed components that are part of it. Finally, and most importantly, we combined several BioJS components over a modular architecture to create PINV, a web based visualiser of protein-protein interaction (PPI) networks, that takes advantage of the features of modern web technologies in order to explore PPI datasets on an almost ubiquitous platform (the web) and facilitates collaboration between scientific peers. This thesis includes a description of the design and development processes of PINV, as well as current use cases that have benefited from the tool and whose feedback has been the source of several improvements to PINV. Collectively, this thesis describes novel software tools that, by using modern web technologies, facilitates the integration, exploration and visualisation of biological data, which has the potential to contribute to our understanding of the systems of life. iii Contents 1 Introduction 1 1.1 Integration of information . 2 1.1.1 State of the art ................................... 2 Data warehousing ................................. 3 BioWarehouse............................... 3 BioDWH ................................. 4 BioMart.................................. 5 Federated databasing ............................... 7 The HUPOProteomics Standards Initiative . 7 Service oriented integration . 9 Sequence Retrieval System . 9 Taverna .................................. 10 Galaxy................................... 11 Semantic integration................................ 12 BioMOBY................................. 12 Bio2RDF ................................. 13 SADI.................................... 14 BioPAX .................................. 15 Wiki-based integration. 15 WikiPathways ............................... 15 WikiGenes................................. 16 1.1.2 The Distributed Annotation System. 17 DASProtocol ................................... 18 1.1.3 Discussion..................................... 21 1.2 Visualisation ........................................ 22 iv 1.2.1 Genomics ..................................... 24 Sequencing Process ................................ 24 Genome Browsing................................. 27 Comparative Genomics .............................. 30 1.2.2 Proteomics..................................... 33 Gel Based Proteomics ............................... 34 Mass Spectrometry Techniques . 36 PRIDEInspector ............................. 38 Rover ................................... 40 Microarrays .................................... 41 Bioconductor ............................... 41 Chipster.................................. 42 Protein Annotations................................ 44 PFAM................................... 44 InterPro.................................. 45 1.2.3 Protein Interactions ................................ 47 STRING...................................... 48 Cytoscape ..................................... 49 Web based PPItools................................ 52 1.2.4 Discussion..................................... 53 2 Integration of Information in Bioinformatics 55 2.1 MyDas ........................................... 57 2.1.1 Overview...................................... 57 2.1.2 Design and Implementation. 58 2.1.3 Instances...................................... 58 2.1.4 Tutorials...................................... 60 2.1.5 Other DASServers................................. 60 2.2 DASWriteback....................................... 63 2.2.1 Overview...................................... 63 2.2.2 DASas RESTful services. 64 2.2.3 Architecture .................................... 64 2.2.4 Protocol Extension................................. 66 2.2.5 Implementation .................................. 67 Server ....................................... 67 v Client ....................................... 68 CRUDfunctions.................................. 68 Create................................... 68 Read.................................... 69 Update................................... 69 Delete................................... 70 Other writeback functions. 70 2.3 DASClients......................................... 70 2.3.1 DASty ....................................... 71 Writeback plugin.................................. 73 2.3.2 probeSearcher ................................... 73 Implementation Details . 77 DASextension for advanced search . 77 2.4 Discussion ......................................... 78 3 Visualisation in Bioinformatics 81 3.1 BioJS: AJavaScript framework for Biological Web 2.0 Applications . 83 3.1.1 BioJS2.0...................................... 88 Develop ...................................... 90 Discover...................................... 90 Test ........................................ 91 Use......................................... 91 Combine...................................... 92 Customize..................................... 92 Extend....................................... 92 Maintain...................................... 93 3.1.2 Developed Components. 93 Chromosome Viewer ............................... 93 Protein-Protein Interactions Force Layout . 95 Protein-Protein Interactions Circle Layout . 98 Protein-Protein Interactions:Heat Map . 100 3.1.3 Discussion..................................... 101 3.2 PINV, a web-based Protein Interaction Network Visualiser . 102 3.2.1 Description of the application . 102 3.2.2 Architecture .................................... 108 vi Server ....................................... 109 Client ....................................... 110 3.2.3 Comparison with Similar Tools . 114 3.2.4 Spatial Clustering ................................. 116 High level algorithm. 116 Clustering with quadtrees . 118 Spatial Clustering in PINV. 120 3.3 Discussion ......................................... 122 4 Use Cases 124 4.1 Predicting and Analyzing Interactions between Mycobacterium tuberculosis and Its Human Host............................................. 125 4.1.1 Research description. 125 4.1.2 Impact on PINV.................................. 125 4.2 Orthologs in 3 Mycobacterial Organisms . 128 4.2.1 Research description. 128 4.2.2 Impact on PINV.................................. 129 4.3 Shotgun Analysis of Platelets From Patients Infected with the Dengue virus . 132 4.3.1 Research description. 132 4.3.2 Impact on PINV.................................. 135 4.4 Analysis of potential impact of SNPs on the human PPI network and gene expression . 136 4.4.1 Research description. 136 4.4.2 Impact on PINV.................................