Design of a Flexible, User Friendly Feature Matrix Generation System and Its Application on Biomedical Datasets
Total Page:16
File Type:pdf, Size:1020Kb
J Grid Computing (2020) 18:507–527 https://doi.org/10.1007/s10723-020-09518-y Design of a Flexible, User Friendly Feature Matrix Generation System and its Application on Biomedical Datasets M. Ghorbani & S. Swift & S. J. E. Taylor & A. M. Payne Received: 26 June 2018 /Accepted: 23 March 2020 / Published online: 27 April 2020 # The Author(s) 2020 Abstract The generation of a feature matrix is the first report demonstrates the use of our proposed WS- step in conducting machine learning analyses on com- PGRADE/gUSE BOINC system to identify features to plex data sets such as those containing DNA, RNA or populate matrices from very large DNA sequence data protein sequences. These matrices contain information repositories, however we propose that this system could for each object which have to be identified using com- be used to analyse a wide variety of feature sets includ- plex algorithms to interrogate the data. They are nor- ing image, numerical and text data. mally generated by combining the results of running such algorithms across various datasets from different Keywords BOINC . Desktop grid . DNA sequence . and distributed data sources. Thus for non-computing Feature subset selection . Machine learning . High experts the generation of such matrices prove a barrier to performance computing . WS-PGRADE . gUSE . DNA employing machine learning techniques. Further since feature identification . DNA sequence . Speedup datasets are becoming larger this barrier is augmented by the limitations of the single personal computer most often used by investigators to carry out such analyses. 1 INTRODUCTION Here we propose a user friendly system to generate feature matrices in a way that is flexible, scalable and Machine learning techniques have proved to be important extendable. Additionally by making use of The Berke- tools in many research areas to aid knowledge discovery ley Open Infrastructure for Network Computing from complex data sets. Examples of its far reaching (BOINC) software, the process can be speeded up using impact and methods have been extensively reported distributed volunteer computing possible in most insti- [1–3]. Machine learning analysis however is preceded by tutions. The system makes use of a combination of the the important stage of feature matrix generation which Grid and Cloud User Support Environment (gUSE), selects the features to be analyzed from these data sets. In combined with the Web Services Parallel Grid Runtime some cases these features can be simply a chosen subset of and Developer Environment Portal (WS-PGRADE) to features in the data set; chosen using expert knowledge of create workflow-based science gateways that allow the subject arena the data was collected from. Often how- users to submit work to the distributed computing. This ever the features are generated by running algorithms across the data to draw out derived features or values not : : : M. Ghorbani S. Swift S. J. E. Taylor A. M. Payne (*) in the original data set. It follows therefore that the suc- Department of Computer Science, College of Engineering, Design cessful outcome of machine learning techniques is highly and Physical Sciences, Brunel University London, Kingston Lane, Uxbridge, Middx UB8 3PH, UK dependent upon the feature generation stage [4]. This adds e-mail: [email protected] an additional layer of complexity to an already difficult analysis for the non-expert. 508 M. Ghorbani et al. Table 1 methods for grid enabling in BOINC BOINC API DC-API BOINC Wrapper GenWrapper GBAC Supported Programming C/C++/FORTRAN/Python C/C++/Java/Python Control-flow POSIX shell non required languages description in scripting XML Legacy application No No Yes Yes Yes Native application Yes Yes Partial Partial Partial Application level Yes Yes Partial Partial Partial checkpointing Type Native Native Native Native Virtualized Requires client-side third No No No No Yes (VirtualBox party software predeployed) A significant amount of work has been achieved [5, investigators wish to data mine for new discoveries and 6] focusing on the algorithms for feature generation novel associations. It has been described as the fourth ensuring that the features in the matrix are of high paradigm of data-intensive science [10, 11] or data- quality, however we can find no work describing a driven science after the first three paradigms of compu- system designed to place these algorithms into a kind tational, experimental and theoretical science. This data- of workflow that makes this stage flexible. Thus we driven science complements existing paradigms noted a gap in systems design that allowed interchange attempting to link knowledge with observations [12]. of feature generation algorithms so if new or different This new discipline requires new data management features needed to add to the matrix this can be done systems which are distributed in many locations. Dis- with little impact on the original scheme. Currently tributed and cloud computing capacity are needed to feature generation is prepared de novo from first princi- cope with the enormity of the data [11]. Further the data ples for each particular analysis in a project and there is may be in different formats and the experimenter may no general propose flexible design for this stage of wish to make use of many very different software pack- machine learning. We concluded that one of the best ages to achieve the analysis required. Cross- ways to do the feature generation is the use of workflow architectural implementation of algorithms and systems systems [7]. This enables the workflow design to be is being attempted to analyse this data (e.g., Grid/Cloud later reused or merged with other workflows. Also if [13–22]) but these efforts are only achievable with the there is a need to focus on a particular class of features, help of computer specialists e.g. the HUBzero [17] the workflow system could be enabled to turn off or on cyber grid infrastructure or they are limited to that part of the workflow, and see the resultant outcome. workflows within a particular topic area e.g. WeNMR Current work flow systems include Nextflow for structural biology with access to a computing grid https://www.nextflow.io/, Snakemake, [8] Galaxy [9] offered by the project partners [16]. This poses issues for however these systems are either complex to use, not the data domain experts as they try to grapple with as flexible as we required or if they are easy to use have unuser-friendly software and the limited computer pro- not been implemented on a distributed system important cessing capacity of their single computers if they choose for the analysis of large datasets. To test our system we to undertake bespoke analyses. The workflows we have decided to use some of the most complex data sets developed to assist in these analyses try and map currently available namely those in the field of biology “routes” through the steps need to analyze a particular and biomedicine, with particular reference to molecular data set in the way desired by the experimenter and genetics. In choosing this field we were also driven to assist in creating bespoke successions of processes that provide not only a flexible system but a very user need to be carried out on a data set to analyze it. Further friendly one since most experts in this filed have little they allow users to use volunteered distributed systems computational expertise. to harness greater processing power and to achieve Today’s studies in biology and biomedicine involve practical run times [19, 20]. ever increasing complex and larger data sets which the Design of a Flexible, User Friendly Feature Matrix Generation System and its Application on Biomedical... 509 Fig. 1 Interaction between different BOINC processes, database processes and the server side components (shown in the rectangles) [29]. Many bioinformatics software packages and algo- application without the need to know the details of the rithms exist for the identification of biological features. workflow, making it easy to use by non-programmers. Some of them are available as user friendly web-based The gUSE workflow portal was selected as the software with a GUI interface such as MEME which workflow management service as it can be connected identifies DNA motifs (meme-suite.org/) or those on the to a diverse range of distributed computing infrastruc- NCBI hub (https://www.ncbi.nlm.nih.gov/). However, tures. The advantage of using gUSE are an application the vast majority are only available as command line specific API (Application Program Interface) which tools and so are inaccessible to the non-computationally enables the developer to develop a form-based interface trained biologist. This work details the design of a for workflows therefore the user can easily interact with system that uses volunteer distributed computing to the application without the need to know the details of query and explore biological data using a user-friendly the workflow. WS-PGRADE/gUSE [18] was chosen as and form-based workflow as a way to make these pack- it offers a complete, customizable web-based generic ages and algorithms more accessible for the non- portal framework system to run applications. Different computer coding expert. The user can easily interact applications, each representing a function to be executed with the application like any other web-based onthedata,canbejoinedtocreatecustomized stataggregator wfs storage Liferay- wspgrade Information dci_bridge_service portal-6.1.0 gUSE Liferay Database Database Fig. 2. gUse server architecture. 510 M. Ghorbani et al. Virtual machine BOINC Debian server GenWrapper gUSE 3G BRIDGE Banana btwisted BOINC Virtual machine BOINC GenWrapper Banana btwisted Fig. 3 The different software used in the client and server side of the system showing their interaction with each other workflows. Details of the WS-PGRADE/gUSE In our experience many biologists do not have access workflow concept is described in Balasko (2014) [23].