<<

* Section 2. problem access data biological collaborative the operations. users, requir the again, to otherand distributed this, these be to changes have itselfduetask in to large data Secondly, volumes. the information needed be to copied is to everyworkplace, a which time be can simultaneously data same the analyse to scientists many requires often activity scientific today’s time, same processing of the information thatis this information analysis of automated thatprovide anapplicationsand offering required algorithms means major part of biologicaldata processing tasks therefore, Average information flows produced by ex sequencing biological of aresult as appear that arrays data large processing for One of major the challenges in is the development methods of and algorithms 1 doi:10.2390/biecoll-jib-2015-257 Journal Ivan To whom correspondenceshould be addressed. Email:

V. ProtsyukV. Nowadays along with original instruments for molecular biologists within a un maintaining synchronization inconvenience refers to refers to the need of makingmultiple copies of certain be used at their discretion.used their at be can several UGENE installations can be connected a to designated shared and database open an is UGENE Unipro focused on delivering collaborative a work into the UGENE exper user datalocal model while existing ones toexported remote a storage. sequences, annotations, multiple alignments, etc. same lab Thus, UGENE may provide access sharedto data for storageThe itself aregular is database server so even an inexpert user able is to deploy it. regardless of their Introduction of interact the issue of their issueof the However, several proprietary bioinformatics proprietary several However, Integrative 2 Department of Physics, University, State Department 2PirogovaNovosibirsk St. 1 or institution. or oratory Unipro C Unipro . The use of desktop applications causes certain problems in this case. this of. The desktop applications Firstly, usecausescertain problems in Shared Bioinformatics Shared , . However, these applications normally use a local data model, i.e. allowa. However,applications model, local normally use data the these

most bioinformatics desktop applications, including UGENE, make of use a with it simultaneously. , 1 ,2 is is for * , the the

German A. German Grekhov volume. enter for Information Technologies for enter scientists Bioinformatics,

the U the almost simultaneous access of client applicat

reasonable storingespecially is problematic processing different types of data. Such an approach causes an Moreover, the system - supported Objects of each data type, relying on working cooperatively relying and 630090 Novosibirsk, Russia630090 Novosibirsk, 630090 Novosibirsk, Russia630090 Novosibirsk, sourcebioinformatics toolkit that integrates popular tools UGENE UGENE nipro between them casemodifications in of . One of the mainOne of the advantages of this system

located immediately on a user’s computer only computer auser’s on immediately located

Such databases can be created by UGENE users and 12(1):257, is available is at: Summary 1 , , Alexey V. i [email protected] , and they will be discussed in further detail in in further detail in discussed be they will and , , if one of several users modifies shared data, data, shared modifies users several of one if research Databases Databases

is capable of storing millions of objects.

,

constitute hundreds of gigabytes hundredsconstitute of per day; 2015 can now be easily imported from or Platform

http://ugene.net/download.html

users located, for example, the in applications offer

Tiunov , ers utilize desktop utilize bioinformaticsers 6/1Ave., Lavrentieva files

for every workplace and 1

and Mikhail Yu.and Mikhail Fursov

by UGENE such as such by UGENE within the samethe data. This ions ions sharedto data ified userified interface. http://journal.imbio.de . In order to solve the In solve orderto . ience. Currently Therefore, we

, compared compared , some solutions for some solutions es multiple copy

, -consuming users perim . . to , At the

ents. 1 1

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). same same package Geneious In, work this time. save visualization capabilities and algorithms available in UGENE with storage also integrating but the collaborative problem, onlynot access the data solving access their to for providing platform. UGENE work and progress, currentlyin future plans. addres are and discussion Section3.Results in work andmethods used this are approaches outlined in work. follows.Section2givesThe an organized overviewDetails paper is as of related the of as well as application client the loading toclient the piecemeal. side in approach results Such an involves of biologicaldata, types study different for volumes data large of case applicable not in is the client side, which to andentirely (in fileor loaded systems databases) etc. assemblies, supported by theplatform UGENE rethought. were Certain models data developedfor each type of biologi storage data biological develop to e shared The th study aim of this is v License, Public provided free and of its charge, the with being developedmainly example, example, then is approach database shared The database. that to connected are processing data workflow Databaseapplication, offered Company bio bythe CLC [5] specialist to get a than procedure complicated t of aweb via web the by processed o An storage engine. it stand theto database deployment a of also server, database besides which, approach server data shared the is the store and it connectto can client applications bea Then, deployeduser. public database server by to d shared The approaches. for share Existing desktop systems 2 UGENE analyses. data various types of biological packages carrying out for commonly used [1] platform UGENE The doi:10.2390/biecoll-jib-2015-257 Journal also can can also he storage - alone database server, becomes activ becomes server, database alone time, Related work Related of information that rvds a provides -browser, in

sed Section4.Finally, in the achievements Section5summarizes study, of this the

send clients. Besides, notificationsto Integrative the Geneious application Geneious the the approach storing to and arranging heterogeneous was data completely

the the system involved. The techniqueCLC shared data usedBioinformatics involved. server is bythe

In analogous systems, these pieces of data are serialized and stored ofuniformlyIn and data these systems, are serialized pieces analogous pass through the web through the pass web . 2.0[3]

general architecture of the shared biological data storage, offered by the the by offered storage, data biological shared the of architecture general This will allow regular UGENE users to create their own shared databases databases shared own their create to users UGENE regular allow will This [4] as well as rdinary fil rdinary - wide range of f of range wide - ofserver vendor the offered bythe . , was used along the with concept of a publicdatabase server.e Atth server server , handOn the, other Bioinformatics, . , , within which this study which, within this one of the most performed,was represents

, atabase . It is .

afterwards . desktop application. desktop client the as genomic Instead . Another advantage is that the system can be accessed accessed be can system the that is advantage Another . anyway e system suits a reduced network traffic in the system the in traffic network areduced

database source under of GNU codethe terms General distributed is available on different operating systems operating available ondifferent

and the such as sequences, multiple , the common idea ofdeveloped, the the common data this in models, use of the of use d biologicalinformation stor , , eatures including different types different including eatures datasets for other for datasets developed by BioMattersCompanydeveloped the [4] becomes available for all the u the all for available becomes - server 12(1):257,

server

e, i.e.only not respond it can shared data server. The first approach first The server. data shared

this purpose well this a web T . cross

hereby installation

- it is not necessary to use a database as a as adatabase use to necessary not is it server installationa significantly is more 2015 - the deep separation of the data and their their and data the of separation deep the platform

, scientists the the system.In this case . This point This whole whole , since all incoming requestsare and it requires a qualified IT aqualified requires it and . E

Qt framework framework sequence a shortened response time of of a time response shortened . ventually Reachinggoal this implies age storage, in contrast to a a to contrast in storage, http://journal.imbio.de . . sers of the of sers

generally realize two realize generally increases increases

to client requests of alignments, genome

users itwill help, users

system within the

implemented, for implemented, for visualization andvisualization since UGENE UGENE since , cal information all the requests requests the all [1] includes

the . system wh . Then there Then there requires a requires Italso is usability , but , the the is is 2 o

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). user’s user’s the from effective arrays data large the of 1 KB visualize its region with sequence the DNA with download a subsequent needs ordering al the addition database, since of time information upload the to increases amethod and applied typical operations Such to it. , database shared data server data shared involved. Thatwhy is system increases approach this Therefore, search. information afast enable not does state serialized the most obvious reason for this is thatrelational reason is obvious databases this arewellmost known for Then aweb via this system may extended be with a web drawbacks of view,point the shared database noessential approach has storage It part.described enable the in previous In 3.1 3 to records biological multiple (sequences, objects t all Namely, However, software existing storage. leverages data a unified approach the to better horizontal scalability,fault tolerance andlower database the other server hand, itself.On itis evident assumed allareaswithin of database applications. Therefore, using approach, this shared our is system DBMS [6] DBMS . database NoSQL a on based is these that interfaces theof implementation develop to possible is the clientapplication interfaces certain designed have we Instead, database. underlying aparticular of application and a NoSQL ones. instead of the biological with large computer a single whereas computational nodesa areinvolved into NoSQL system, only performanceachieved growth, usually databases, comparedis relational to cause can quiterareenterprise this bioinformaticsso systems, still is in of such database providers the conventionalcounterparts. SQL T doi:10.2390/biecoll-jib-2015-257 Journal completely completely this

hardware nodes constituting costsof computational theHowever, shared storage. the of infrastructure underlying an as database relational a chose , we additional work or in a server’s file system file a server’s in or laboratory a within for own purposes one’s Methods and approaches and Methods Initial assumptions Initial

of to poseto fewer - using the using However, browser. - . laboratories. capacity hard disks is usually is for disks sufficient capacity hard Integrative , information access to it to access the shared system was implemented system using shared the approachthe of the shared database, download download can generally be generally ofinternal their structuring separationand , thecan method

difficulties during deployment and maintenance. and during deployment difficulties

shared storage ,

response times when large objects, occupying several gigabytes, are are gigabytes, several occupying objects, large when times response structured when

the and the approach the relational towards done first step is our

that mustthat be s

ted with challengesadditional connec Bioinformatics, within that describes that For these reasons these For . Thereby . point of view than the unified storage approach. of viewpoint storage than the unified approach. data

a it comes to the problem of collaborative data access. data collaborative of problem the to comes it 000 nucleotides. Instead,length of 1,000nucleotides. only one has Nonetheless, the architecture of the Nonetheless, architectureof the to the local host in case of any need of access to them to access of need any of case in host local the to

data data

this study , devised this within work,

he h he . Suchan implementation data requires a client serialized as , separation according the to constitution of each data type upported by the storage engine. Therefore storage bythe upported each client, accessing the storage, does not have to have not does storage, the accessing client, each

that region. This simple example showsthat example simple This that region. size of one gigabyte to his/her computer in order to orderto in computer gigabyte his/her to one of size orizontal scalability,orizontal reduce turn, can in significantly s s alignments, etc.) are immediatelyare etc.) alignments, 12(1):257, - , , even a even server in the future in order to provide access to it provide access it to orderto thein in future server

, we decided data, we models develop to relational we

the the reduces but be done, to d to implement decide non- . Besides. the simplicityfrom the user’s 2015 that modern NoSQL solutions canthat modern NoSQLsolutions prove technically educated user to deploy the the deploy to user educated technically

the data storage in the majority of abstracts from the implementation

development costs development tuning and administration of thetuningadministration and

interaction between a client client a between interaction Besides, the considerable http://journal.imbio.de the shared storage. The The storage. shared

, stored in in stored data storage in a a in storage data and widely used compared to compared time needed needed time the the , in

referenced by by referenced when several several when , If necessary, If necessary, to download download to , than their than their equipped equipped in case of of case in , database database MySQL MySQL future it it future because because the use the use more more

the the for for he he 3

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). database only andcontent of eachtheir thatincludes identifier numerical object corresponding object. a of set large Therefore, relatio belonging to object Each local computer, possibly databases. allows “DatabaseCon (see database existing an implement one more specialization of “DocumentFormat” In case of the shared document and fills itwith objects. document format (see“DocumentFormat” onthe Figureentity 1), i.e.itializes thein that and annotations. sequence alignments biological multiple sequences, 1 Figure cert a some data resource of aobjects. and list contains biologicalObjects, turn, represent in data entities the of includes package UGENE In this section 3.2 object, ora such saving as local visualization file. to theAll further upon only be are downloaded a displayed data for user. doi:10.2390/biecoll-jib-2015-257 Journal small size that

formats exist for various file formats and databases. initialization database, records in “Document the database. format” alignments and annota and alignments “Object” instances“Object” that Figure 1: UGENEFigure 1: data sources diagram ain type n reusing Storage of different data types records . For. illustration purposes, of

view of view thepoint user’s Thus, from

it is not a not resource is it . database the from objects Integrative where . The . , in in encapsulates low encapsulates procedure and we consider consider we entire the

its a

does not depend on the amount of physical dependamount not onthe does

diagram, describing relations between these entities, is depicted is entities, these between relations describing diagram,

database where its content is stored. content where is its database objects are stored objects are Thus, a client application can Bioinformatics, , containingaobjects. number , of system, tions tions a document has some reference (see “EntityRef” on the Figure 1) to to 1) Figure the on “EntityRef” (see reference some has a document

can belong to different data types. Sequences, multiple Sequences, types. data different to belong can infrastructure, developedfor document handling,for shar a - are given as an example. The even for a a for even consuming information download to procedure this software architecture of system the in detail. Originally, the

we it shows a few implementations of the object entityfor

. . same time, theAt decided to use scheme abstractions this within and decided to It is also important that such a . The key entity here is “Document” containing here entity key isThe “Document” . aset of

12(1):257, a documen nectionFormat” on the Figure 1). onthe nectionFormat”

- . Different. operations input/output level a shared is is

an t andt an object. The document represents 2015 each object has a reference to particular has reference to object each a

download a shallow copydownload a a shallow shared of

database looks like a usual file from a a file from a usual like looks database entity that determines a document document a determines that entity In fact, itis a numerical identifier.

for a a has reference to document a creating n index ( n index http://journal.imbio.de space occupied by a by occupied space In a is , there addition

a user’s request to an an to request user’s a a document baseda on ) “EntityRef”

This approach approach This

text nametext

document sequence

on the

s s has has

ed ed of to to 4

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). advanced userswill or offer certain means forautomatic detection of the partitioning optimal computers. We information. biological types of different storing for data used models the consider Let now us sequence sequences and multiple ), annotations, (nucleotide include types data complex The sizes. small relatively have also are They hardware. sequencing by generated or biologists ove belonging to weight matrixes, frequency matric referred to the simple ones: phylogenetic trees,chromatograms, tertiary , structures of into separated were platform, Within this study “complex”. types data relatively small pieces for downloading them to the client side to correspond For typesa special each object of other Such small the network, and side, client the from access the of In case database. done relation analogous in form asingle software, in namely, serialized in entirely used usually are (~1 substantial itis all, of First database. us let Now constituting On the the Figure region. 2, done is aselection and sequence the of region particular the specifies it asequence accesses client a when Accordingly, 1,000 bytes,eachof them and achieved is Serialization by object theof data conversion an chunk size in order not to deteriorate to order in chunk size of not database the performance small arelatively keep to decided We database. the to issued be to have queries more since serv database of a the additionalmemoryconsumption about Large canchunk sizes storage bring located. is partition Additional cachingalphabet an purposes), and a relation al chunks sequence bytes intended for the storedobject not a in is single For type such a complex sequence as BLOB fields have been also used, however found theUGENE in publicsourcecode of theproject. repository paper scopethis of the techniquesParticular of are out serialization doi:10.2390/biecoll-jib-2015-257 Journal GB) or complex internal GB) orexceed structure.donot complex several me of Some them

rwhelming majorityof cases. decided store to data types are called “s called are types data object , in turn, keeps, in ing of alignments, genome assemblies and genome alignments,

. Obviously, it depends on a hardware configuration of a database server where where a database server of configuration a depends hardware it on . Obviously, Probably, UGENE future allow versions tuning will the sequence partitioning for Integrative

consider directly the methods of storing different types of biological data in the in types ofconsider data biological the of methods different storing directly research is needed to determine an optimal chunk size for the sequence sequence the for size chunk optimal an determine to needed is research sizes er, while ones small increase the any of these data types has a has types data these of any their sequence storage. “SequenceData” is the very relation the is “SequenceData” storage. sequence , its objective representation will be , this approach does approach this

structure, typical operations applied typicalstructure, operations to

the the of types the all simple data types as serialized data in BLOB fields of database database of fields BLOB in data serialized as types data simple

Bioinformatics, some

by biologists ong their with coordinates a within sequence. The “Sequence” imple” in further.

field. Inste general about a sequence information the twocategories definedfollowing The above. types were

is represented by a single by asingle represented is Here, es and text files. text and es sequence is circular or not. or circular flag indicatingis whether the sequence in the database that includes records intersecting or or intersecting records includes that database in the not cause problems with the with cause problems not . Therefore . that not all of them may principallyhave large sizes

“text files” refer to comments or remarks made by by made remarks or comments to refer files” “text 12(1):257, parts separated ad, is uniformlyinto it biological information, supported by UGENE information, supported the biological

data model data relation

variations. variations. genetic several several nomoresize than

average time of data import to the database, database, the to import data of time average , this serthis theycan be stored similarly to how itis

s “Sequence” and “SequenceData” are are “SequenceData” and “Sequence” s 2015 created

A single independent

needs be to developed that them ialized

by client the application BLOB field BLOB

and permit

to its binary representation. upon request

data performance of the system. system. the of performance . such such http://journal.imbio.de can be transferred over over transferred be can However, they can be be can they However, s u o commodity on run

000 that contains 1,000

their

megabytes in the the in megabytes as its length (for in the database. database. the in . W .

piece of data data of piece

with

separation to to separation gabytes and and gabytes record these these e call , relation an entire entire an a . Due to . Due to would size of of size

of a of

5 s. s. .

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).

100, 200, 400, 800, 4∙10 800, 400, 200, 100, length. groups according their to visualization. In this work, alignments another so required methodis for their storage thatcould supportfast for lookups than multiple space occupymore genome assemblies much usually hand, On the other alignment row. “MsaRow” the as relation “MsaRowGap” the in stored alignment.Finally, an within end positions andsequence sequence start (“pos” field), i.e. a number of a sequence,horizontally and “gend” (“gstart” and fields) – relation relation “MsaRow” the in records by described as its total length, the number of sequences alphabet. and their Aligned sequences, turn, are in Such compound data record in the “Msa” relation “Msa” the in record superstructure above , “MsaRowGap” devised were sequences and gaps between describing relations structures, data gaps supplementary so inside, contain alignments usually Nevertheless, above. described sequences independent as way same the stored sequences of amulti Namely, types. these of each for specific was sequences of storage effectively represented doi:10.2390/biecoll-jib-2015-257 Journal an accessionan sequence) a number for relation addition, relation there are annotations (or features) a a on Depending type. data and name including objects biological relation here the is“Object” in record a Figure s. For example, sequence objects are stored in one way one stored in objects are example,s. sequence For relation different d in store are s ands absolute determine coordinates of a sequence an in alignment vertically both of 2 s keep that from record each values for relation. the “Attribute” : Integrative ER

- diagram of the shared data storage shared for the of “complex”diagram data types. The central entity shown onthe Figure 2, relation

a typesas multiple Bioinformatics, by a set of sequences, so the approach the so sequences, of by aset plain 3 , 2.5∙10

leveraged for them. However, there are additional data structures structures data additional are there However, them. for leveraged and keeps absolute gap start and gap for end a positions particular bjects and stored in another way another in stored and are considered objects be separate to s for s for arbitrary key we

that the the has object alignment Each list. sequence

decided to divide short reads, composing an assembly, into an shortreads,composing assembly, into divide decided to

4 such alignmentsuch about the information generalholds cached , 10 Boundaries

– – where each record references the “Msa” relation “Msa” the references record each where 5 , 5∙10 common “Attribute” relation “Attribute” common 12(1):257, are involved in the multiple alignmen

. It serves for the storage of common of It. serves the storage attributes of for sequence 5

- value attributes of any obj typeect (for example, and 2∙10 . Database relation Database . are the lengthare follo shortread of the ple alignment is represented byan represented ple alignmentis ordered list

that

2015 alignment s s reference 6 . The non- , , type particular object s or genome assemblies genome or s both “Msa” and “Sequence” and “Msa” both

and afew value uniformity of this sampling , s “Msa”, “MsaRow” and “MsaRow” “Msa”, s

described above described http://journal.imbio.de alignment gaps are are gaps alignment corresponding - t storage as a as storage t dependent dependent , , whereas whereas its data wing: 50, 50, wing: ,

. In In . for the the for as well well as

are are 6

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). aaae tables tableare annotation an (this joinedinto termnot connected is withdatabase attr emphasized region c now us Let Inregion. a addition, sizerelation total of visualization relation database single a within storage read short than rendering assembly of case in performance better proves reads. group stored Each is an within is overall of part asmall constitute that reads long relatively than scale a lower at separated of small sizes, are reads, during sequencing, caused short obtained is overwhelming of majority bythatthe decided the in database store to annotations as n- site annotations thatrefer annotations are incorporatedgroups into according their to biological meaning. example, For UGENE of architecture software the Within sequence. certain a annotate all they that means symbol. The relation “FolderContent” absolute path “Folder” the in 2). record Each locations of different objectsstored relation is in first introduced. computers. objects a in shared database folders in from the UGENE GUI, is as it done ondesktop ordering. developed We native of data a mean work provides this additionally the The developedwithin data model respectively. one amutated and sequence reference a containing columns “obsData” and “refData” variation Variation sequence. particular from therecords Figure 2holds of variation representing tracks, a variant set i.e. one that are annotations beca Genomic variations usuallygroups lessthan constitute 0.1% of allrelation because acceptable only for annotation storage for and annotation used aregroups. not This lengthrespectively, and its sense make annotated start position an region or “len”representing table aan within particular annotation annotation selection athat contains referencesroot node for to performance each of node. tree improves This the other group or a top- annotation has an annotation group as a parent annotation groups.Its “parent”references the columncontains to Figure annotation qualifiers 2) stores table doi:10.2390/biecoll-jib-2015-257 Journal ibutes represented by a set of key of a set by represented ibutes

not incorporated into groups and lack qualifiers. Accordingly, an approach, similar to the the approach, to similar an Accordingly, qualifiers. lack groups and incorporated into not , a structure, structure, a belongnestedgroups. as canwell. different Due to Groups such be to

and only after that and has a dummy root node uniqueand for thatis has each aroot dummy

was advised for annotations, was applied advised was for

of coordinates(the columns) data and “startPos” variation – and “endPos” Integrative Normally For For

proc , onsider the annotation storage in the database. We treat an annotation as the an database. treat We annotationin storage onsider the thatfact, is, in merged a set of

no more than a couple of hundred base they pairs. Therefore, be are to h convenience the use they always corresp always they use only only from reads ess or a group of of regions

level

, , to the regions coding gene or containing a binding factororgene containingthe a regionsbinding transcription coding to describe describe al . The reason is that we index only short read start positions and, during read start only positions short index we The that reasonis . , Bioinformatics, a user bya deleted andfolders objects other l theycan about be folders The permanently. removed information and an analog an

root group. For cachingpurposes, the “root” columnis introduced s themselves are stored in the “Variant” relation “Variant” the in stored are themselves s mutations genome. in a Basically, theyare simplified

of users, of

relation of a conventional file system directory tree for arranging -value -value , object between provides links and the “Feature” relation “Feature” the and a few a few 12(1):257, indexes ond to a single sequence region. Additionally, single region. a to sequence ond olated olated

pairs, so called so pairs, a corresponds to a single folder and contains its contains its and folder a to single corresponds sequence a biological , and a and , all relation special predefined folder “Recycle bin” was was bin” “Recycle folder predefined special relation is smaller in this case. s “Folder” and “FolderContent” (see Fig (see “FolderContent” and s “Folder” ary trees. Each tree describes an annotati an describes tree Each trees. ary

parent folder names separated by a special a special by separated names folder parent to store them. The “VariantTrack” re “VariantTrack” The them. store to 2015 parent parent s depending on a visualized assembly a dependingvisualized s on

tree. The “FeatureKey” “FeatureKey” The tree. of the database.Such an approach , . annotation qualifiers. Annotations Annotations annotation qualifiers. Certain for group an annotation is either

accumulates annotations and accumulates

that has that parent nodeparent a in tree. An s and

columns, such as “start” suchcolumns, as “start” appear in “Recycle bin” bin” “Recycle in appear http://journal.imbio.de

records. their

a finite number of parent folder parent

that

relation overhead

comprises comprises ) s for a a for s which lation

they they (see (see ure ure

the the we we on an s. s. is is 7 , ,

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). database instance was installed and made available for the life science community science life the for available made and installed was instance database sharedit and storage use user UGENE every allows this and packages, UGENE implemented. and designed As a result of this study 5 online[12] published been has gretrievin data, of and the to procedures storing configuring and deployment A user manual describing the interaction exceed the fragmentationofof serverresponse i.e. packages, case in especially speed a longer round location during publicdatabase goesthe on Note server publicstorage that the since thewith publicdatabasecan documentationUGENE [10] be found the in their original locations drosophila (dm3) sequences genome sequences comprises higher read , instance database ashared of example an As databases. shared as well as system file alocal to results workflow store databases simultaneously. Different workflow writing elements within the same workflow can is introduced, were databases shared desktop tool a Designer used be to the Workflow . Since database shared a to stored Namely, Desi Additionally, shared were database integrated capabilities the with UGENE Workflow somehow, and database updatecontent view the appropriately. particula the into developedand integrated been content has filtering shareddatabase exploring and forgraphical The representation via the it UGENE to forconnected package. other clients thebecomto system. uploaded the All data the to system Denisovanof [9] the ofSingle objects GB and 280 one about is The structures. second and protein tertiary annotations include sequences, aabout The40GB plainfiles it. first one in has size among During initial tests developed of the system, 4 doi:10.2390/biecoll-jib-2015-257 Journal more versatile: workflows - gner gner only access available for every UGENE user [10] user UGENE every for available access only s MTUs (maximum transmission unit) are are scientific community scientific sequence , whereas it whereas , Conclusion Results r database check periodically, check database r of a workflow can use can a workflow contains [1] provi Integrative . It. is - tripbetween time client the and server the ded

a large to name afew. name to visualization is aboutvisualization 0.5seconds depending geographical ona client a

over 10 .

totals tenths of tenths atotals local second environment.caused a is in This by mainly with predefinedwith information for data authentication this , . sequences chromosome are There workflow management system that is a part of the UGENE platform. comprising

in publicsourcesin Bioinformatics, size havesize supported been , ,

6 forsharing data otherwith and users. storage

slightly thandoes it with slower the shared system for storinghas beenthe for sharedsystem biologicalinformation , objects thatinclude sequencesand corresponding annotations.

as PDB [7] PDB as input, input, as data shared

The shared database functionality is included into regular into included functionality is The shareddatabase UniVec database [11] database fromdownloaded the UniVec can local both process files and data remote from shared

All

about 40 GB of short reads, has been successfully uploaded uploaded successfully been shortreads,has of 40GB about

it worked only with localwith files. worked Currentlyit only existing GUI.

sequences have attached text files with particular links to to links particular with files text attached have sequences is 12(1):257, .

if and GenBank [8] physically located in Germany,the interaction with A detailed use case for connection andinteraction connection for case use detailed A questions of questions from detail, in databases shared with

network path used for the connection. anetworkconnection. for of path the used other users have changed the database content content database the changed have users other f such public databases, widely used used widely databases, public such of content the we have deployed a public database server with with server database apublic deployed have we

Besides, client applicati connectedons a to

as well as and

2015 form and contains about 10 containsform about and .

workflow workflow UGENE UGENE

. The delay . The to deploy one’s own instance of a of own instance deployto one’s of mouse (mm9), human (hg19), mouse (mm9), of e available for reading and writing writing and reading for e available . For example . For

a local database instance. A delay delay A instance. database alocal has been successfully uploaded to .0 and version 1.14.0 packages computation computation

deteriorates transmission http://journal.imbio.de

, a g a , . , , base.

this functionality and widely used A publicshared enome assembly assembly enome when results 5

Currently

. It. contains objects

their

can be be can before before

size that , 8 it it

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). versions. UGENE future can in added necessary, be functionality this altered rarely is data shared supportof The alignmentobject. a sharedmultiple done for be insertinggaps cannot a way Presently cancomputations evendatabase down. be after is the accomplished server As for our long- our for As that. after deleted be objects may again. original The and uploaded locally thenclient modified computer, implementation backup advances for the Workflow Designer. Namely, if a shared database was incorporated with a advances a Designer.Namely, shareddatabase a for with was if the incorporated Workflow DBMSs, such as SQLite [16] relational local example, For application. client a by caching data shared is situations such fault,computations, occurred during downloaded objects are never , tolerance A of field a user’s Itobjects. requireddevelop to is the approach implying tracking objects only thatare those in memo application constantly monitors the stateall the of objects thatthe client is regard,theIn system the maindisadvantage of shared this objects. Our current work focused is ona series for of optimizations the work with a large of number interface. user UGENE gain otherwith the conformity simultaneously. implementation loaded completely the to memory server of the ona depending sequences large of case in concern of matter MySQL. example, overcome impediment this it , but notis supported natively byall database providers, for help can technique streaming Database schemas. existing the in record database single described paper.The thatthe this mainreasonwhole in is this for existing systems not support do directly partial sequence BioSQL comprehensive platform. UGENE by the study andtypes supported Data this manipulations in coverdata allthe data devised models 1, 2014. first version ofsupporting UGENE shareddatabases nt within a biological manageme data laboratory shared for package m genomic datasets that often are used by biologists. Generally, can be be can need issues main Two achieve interoperabil futureorderto of in into UGENE versions are such supported to external not certainsystems interfaces by UGENE, can be integrated doi:10.2390/biecoll-jib-2015-257 Journal

data data of objects various belongingto illions lementation imp drawback sharedsystem current of its the in is that ithas lower fault explicit ry consumed by the client application grows linearly depending grows ry client on application consumed bythe s local files opened by UGENE. F UGENE. by opened files local s analyzed

seamlessly content the of [14] of , objects in a shared database are read are database ashared in objects compared Integrative modification is currently

and [15] BioMart view. , . relational datagenomic modelfor relational term The bottom lineisThe bottom certain workarounds and trade

This isThis a potentialbottleneck for a shared system several with users working , a possible workaround is the following: shared data can be acan to exported a shared data the workaround following: possible is

with

but used for the production of some derived information instead. If instead. information derived of some the production for used but Moreover, the use of streaming makes a database server performance the the performance server adatabase makes streaming of use the Moreover,

ses opens the opportunity of the opportunity ses opens databa shared of emergence the goals,

to

to befirst solved for The this. to UGENE UGENE

Bioinformatics, the platfo the

However, other projects aimed at the development However,at of the a other aimed projects recreate the shared storage data model data storage shared the be can used recreate to , .

completely data sources and providedata their and sources may Although, they may in . Since certain biological biological certain Since 2. the in Section considered rms

The second pr The second databases. external of each can for leadThe of the computedsolution to loss results.

postponed since the majorpostponed part bioinformatics in of data objects 12(1):257,

to the client side, an unexpected database server server database unexpected an side, client the to or example,or changinga multiple alignment by

types. This makes possible makes This types. being involved because streamed data can be be can data streamed because involved being in use. Applying technique, use. this in are that - only, i.e. they be cannot modified the same Chado [13] GMOD example, for exist, data

2015 provide access specific to provide is 1.14.0, is one data fetching unlike the data storage unlikethe data fetching data

is the identification of entities that , , database the in t he system is capable of storing storing of capable is system he itywith existing data storages. and was it released onAugust seamless integration into the sequence - offs be to have devised to or or http://journal.imbio.de research instit research

a number a using the UGENE using and the amount of amountof and the data storage engine engine storage In

data types that that types data oblem is that

is stored ais in the current current the of shared shared of ute. The

major major

and and 9 ,

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). [11] [9] [8] [7] [6] [4] [3] [2] [1] References and Okonechnikov. Varlamov, Konstantin Timur Tleukenov Algaer, Alexey YuliyaVladimir Malinovsky, Rasputin, Kirill Savchenko, Pushkova, Eugenia ideas, and discussions help:Olga Golosova Russia). Our appreciation goes other to members of the workThis by was supported the UNIPRO Acknowledgements ( available freely is software The that storage is our firstto step cloud service biologicaldata of for and Thus, the storage the development processing. shared a could provide Internet, a system the such accessible Being via shared workflows. using data high [10] [5] doi:10.2390/biecoll-jib-2015-257 Journal

does not require any other UGENE does require software not any exceptthe platform.

- performance computational cluster, wouldprincipally allow processing it distributed computational the performance http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/ URL: Library Medicine. of [Online]. The UniVec Database, Database, UniVec The J. Medicine. of GenBank Home, Bank, Data Protein RCSB http://www.mysql.com source database. popular most open MySQL. world’s The M. LicenseGNU General Public v2.0, Qt Bioinformaticsunified bioinformatics toolkit. K. Accessed: 01 Accessed: Nature Siberia. from southern hominin angenome unknown DNA ofcomplete mitochondrial 2015. 01-Jun-2015. http://www.rcsb.org/pdb/home/home.do. Accessed: URL: [Online]. source database. popular most open MySQL. world’s The A. Jun-2015. URL:[Online]. http://www.gnu.org/licenses/old URL [Online]. UGENE Public Storage – Storage Public UGENE http://www.clcbio.com/products/clc QIAGEN Company CLC a BioinformaticsDatabase.bio, CLC 2012. for the and organization analysis of sequence data Geneious Drummond. [Online]. URL:[Online].

. Cross Platform Application & Framework Development PlatformApplication UI . Cross . , Q Krause of , R. Moir R. , Kearse Okonechnikov, Cooper, S. S. Cooper,

Integrative , 464(7290): 894–897,2012.

. Fu . -Jun-2015. [Online]. URL:[Online]. http://www.ncbi.nlm.nih.gov/genbank : http://www.qt.io Markowitz S. National Library Library National Information, U.S. Biotechnology Center for National , Bioinformatics,

J. J. , A. Wilson O. Golosova, wards Good . http://ugene.net Basic: An integrated and extendable desktop software platform platform software desktop extendable and integrated An Basic:

Accessed: 01 Accessed: National Center National

license: GPL license: , Shunkov Viola, M. , B. the- high , Ch. Ch. , Unipro UGENEUser v.1.14.0– Online Manual

Research Collaboratory Research , S. Stones- , S. . Accessed: 01- Accessed: . Unipro UGENE: a a Fursov,M. UGENE: the UGENE team. Unipro 12(1):257, - Duran

bioinformatics Center for Center performance computational bioinformatics systemperformance computational bioinformatics -Jun-2015. /wiki/display/UUOUM/UGENE+Public+Storage GNUInc. Project, Free SoftwareFoundation, ) a )

for Information, Biotechnology National U.S. , T. t: http://ugene.net/download.html , , Havas Yuriy Kandrov, Vaskin, Denis Art , 28(8): 1166-1167, 2012.

2015 Ju Information (Novosibirsk, Technologies Thierer,

, M. Cheung , M.

- n-2015. Unipro UGENE Unipro licenses/gpl . - . Accessed: 01 Accessed: . database Bioinformatics . Accessed: 01 Accessed: . A. A.

for Structural Bioinformatics.

B. B. Derevianko Ashton, P. P. Ashton, , Sh. -2.0.html http://journal.imbio.de

project , 28(12): 1647-1649, -Jun-2015. Sturrock . Accessed: 01 Accessed: .

The Qt Company Qt The

. . & [Online]. URL: [Online]. . Accessed: 01 Accessed: . [Online]. URL: [Online]. S. S.

team team Meintjes -Jun-2015. . Buxton, Buxton, , S. Pääbo

for th for WIKI . -Jun- , yo The The 10 eir eir

A. A. m - . . . .

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). [14] [16] [15] [13] [12] doi:10.2390/biecoll-jib-2015-257 Journal

https://www.biosql.org/wiki/Main_Page BioSQL. BioinformaticsFoundation. Open Bioinformatics genome forbased representing modular schema [Online]. URL: http://sqlite.org [Online]. Page. Home SQLite bar049,, 2011: 2011. Databases andCuration Biological Database: TheJournal of A. Kasprzyk.BioMart: driving a change paradigm management biologicaldata in Mungall,C. D. Emmert, the FlyBase Consortium. 2015. URL: – Database Shared of

Integrative ://ugene.net http , 23(13): 337-346, 2007. Bioinformatics, Unipro UGENE Unipro /wiki/display/UUOUM/Shared+Database 12(1):257,

Online User Manual v.1.14.0– Online User . Accessed: 01 Accessed: . 2015 . Accessed: 01 Accessed: . - associated biological information. A Chado case study: an ontology -Jun-2015. http://journal.imbio.de . Accessed: 01 Accessed: .

-Jun-2015. [Online]. URL: URL: [Online]. WIKI

. [Online]. [Online]. . -Jun- 11 - .

Copyright 2015 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).