Shared Bioinformatics Databases Within the Unipro UGENE Platform
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Integrative Bioinformatics, 12(1):257, 2015 http://journal.imbio.de Shared Bioinformatics Databases within the Unipro UGENE Platform Ivan V. Protsyuk 1,2*, German A. Grekhov 1, Alexey V. Tiunov 1 and Mikhail Yu. Fursov 1 1Unipro Center for Information Technologies, 6/1 Lavrentieva Ave., 630090 Novosibirsk, Russia 2Department of Physics, Novosibirsk State University, 2 Pirogova St., 630090 Novosibirsk, Russia Summary Unipro UGENE is an open-source bioinformatics toolkit that integrates popular tools along with original instruments for molecular biologists within a unified user interface. Nowadays, most bioinformatics desktop applications, including UGENE, make use of a local data model while processing different types of data. Such an approach causes an (http://creativecommons.org/licenses/by-nc-nd/3.0/). inconvenience for scientists working cooperatively and relying on the same data. This refers to the need of making multiple copies of certain files for every workplace and License maintaining synchronization between them in case of modifications. Therefore, we focused on delivering a collaborative work into the UGENE user experience. Currently, several UGENE installations can be connected to a designated shared database and users can interact with it simultaneously. Such databases can be created by UGENE users and be used at their discretion. Objects of each data type, supported by UGENE such as sequences, annotations, multiple alignments, etc., can now be easily imported from or exported to a remote storage. One of the main advantages of this system, compared to existing ones, is the almost simultaneous access of client applications to shared data regardless of their volume. Moreover, the system is capable of storing millions of objects. The storage itself is a regular database server so even an inexpert user is able to deploy it. Thus, UGENE may provide access to shared data for users located, for example, in the same laboratory or institution. UGENE is available at: http://ugene.net/download.html. 1 Introduction One of the major challenges in bioinformatics is the development of methods and algorithms for processing large data arrays that appear as a result of biological sequencing experiments. Average information flows produced by sequencing constitute hundreds of gigabytes per day; therefore, the issue of their reasonable storing is especially problematic. In order to solve the major part of biological data processing tasks, researchers utilize desktop bioinformatics applications offering required means and algorithms that provide an automated analysis of this information. However, these applications normally use a local data model, i.e. allow the processing of the information that is located immediately on a user’s computer only. At the same time, today’s scientific activity often requires many scientists to analyse the same data The Author(s). Published by Journal of Integrative Bioinformatics. simultaneously. The use of desktop applications causes certain problems in this case. Firstly, 2015 the information needed is to be copied to every workplace, which can be a time-consuming task in itself due to large data volumes. Secondly, if one of several users modifies shared data, these changes have to be distributed to the other users, and this, again, requires multiple copy Copyright This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported operations. However, several proprietary bioinformatics applications offer some solutions for the collaborative biological data access problem, and they will be discussed in further detail in Section 2. * To whom correspondence should be addressed. Email: [email protected] doi:10.2390/biecoll-jib-2015-257 1 Journal of Integrative Bioinformatics, 12(1):257, 2015 http://journal.imbio.de The UGENE platform [1], within which this study was performed, represents one of the most commonly used software packages for carrying out various types of biological data analyses. UGENE provides a wide range of features including different types of visualization and workflow data processing. It is available on different operating systems since UGENE is being developed mainly with the use of the cross-platform Qt framework [1]. It is also provided free of charge, and its source code is distributed under the terms of GNU General Public License, v. 2.0 [3]. The aim of this study is to develop the shared biological data storage system within the UGENE platform. This will allow regular UGENE users to create their own shared databases for providing access to their genomic datasets for other scientists. Reaching this goal implies not only solving the collaborative data access problem, but also integrating the storage with visualization capabilities and algorithms available in UGENE. Eventually, it will help users save time. In this work, the general architecture of the shared biological data storage, offered by the Geneious package [4], was used along with the concept of a public database server. At the same time, the approach to storing and arranging heterogeneous data was completely rethought. Certain data models were developed for each type of biological information (http://creativecommons.org/licenses/by-nc-nd/3.0/). supported by the UGENE platform such as sequences, multiple sequence alignments, genome assemblies, etc. In analogous systems, these pieces of data are serialized and stored uniformly License (in file systems or databases) and loaded entirely to the client side, which is not applicable in case of large data volumes. Instead, the common idea of the data models, developed in this study for different types of biological data, involves the deep separation of the data and their loading to the client side piecemeal. Such an approach results in a shortened response time of the client application as well as a reduced network traffic in the system. The paper is organized as follows. Section 2 gives an overview of related work. Details of the methods and approaches used in this work are outlined in Section 3. Results and discussion are addressed in Section 4. Finally, Section 5 summarizes the achievements of this study, the work currently in progress, and future plans. 2 Related work Existing desktop systems for shared biological information storage generally realize two approaches. The shared database, and the shared data server. The first approach requires a public database server to be deployed by a user. Then, client applications can connect to it and store the information that, afterwards, becomes available for all the users of the system who are connected to that database. The shared database approach is then implemented, for example, in the Geneious application, developed by the BioMatters Company [4]. Then there is the shared data server approach which, besides database server, also includes the deployment of a web-server offered by the vendor of the system. In this case, all the requests to the database pass through the web-server. Thereby, the whole storage, in contrast to a stand-alone database server, becomes active, i.e. not only can it respond to client requests, but The Author(s). Published by Journal of Integrative Bioinformatics. it also can send notifications to clients. Besides, it is not necessary to use a database as a 2015 storage engine. An ordinary file system suits this purpose well, since all incoming requests are processed by the web-server anyway. Another advantage is that the system can be accessed via a web-browser, as well as the client desktop application. This point increases the usability Copyright This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported of the storage system. On the other hand, a web-server installation is a significantly more complicated procedure than a database server installation and it requires a qualified IT specialist to get involved. The shared data server technique is used by the CLC Bioinformatics Database application, offered by the CLC bio Company [5]. doi:10.2390/biecoll-jib-2015-257 2 Journal of Integrative Bioinformatics, 12(1):257, 2015 http://journal.imbio.de However, existing software leverages a unified approach to the data storage. Namely, all the biological objects (sequences, multiple alignments, etc.) are immediately stored in database records or in a server’s file system as serialized data. Such an implementation requires a client to completely download data to the local host in case of any need of access to them, because the serialized state does not enable a fast information search. Therefore, this approach increases system response times when large objects, occupying several gigabytes, are involved. That is why within this study, we decided to implement the data storage in a database, using the structured data separation according to the constitution of each data type and typical operations applied to it. Such a method increases the upload time of information to the database, since additional ordering needs to be done, but reduces the time needed for subsequent access to it. Thereby, each client, accessing the storage, does not have to download a DNA sequence with the size of one gigabyte to his/her computer in order to visualize its region with a length of 1,000 nucleotides. Instead, one has only to download 1 KB of the information that describes that region. This simple example shows that, in case of large data arrays, the method of their separation and internal structuring can generally be more effective from the user’s point of view than the unified storage approach. 3 Methods and approaches (http://creativecommons.org/licenses/by-nc-nd/3.0/). 3.1 Initial assumptions License In this work, the shared system was implemented using the approach of the shared database, described in the previous part. It enables even a non-technically educated user to deploy the storage for one’s own purposes within a laboratory. Besides the simplicity from the user’s point of view, the shared database approach has no essential drawbacks, compared to the shared data server, when it comes to the problem of collaborative data access.