FORUM

Big Data in Biology and Medicine Based on material from a joint workshop with representatives of the international Data-Enabled Life Science Alliance, July 4, 2013, ,

O. P. Trifonova1*, V. A. Il’in2,3, E. V. Kolker4,5, A. V. Lisitsa1 1 Orekhovich Research Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Pogodinskaya Str. 10, Bld. 8, Moscow, Russia, 119121 2 Scientific Research Center “Kurchatov Institute,” Academician Kurchatov Sq. 1, Moscow, Russia 123182 3 Skobel’tsyn Research Institute of Nuclear Physics, Lomonosov Moscow State University, Leninskie Gory 1, Bld. 58, Moscow, Russia, 119992 4 DELSA Global, USA 5 Seattle Children’s Research Institute, 1900 9th Ave Seattle, WA 98101, USA *E-mail: [email protected]

he task of extracting new dealing with the analysis of large Big Data world: either molecular knowledge from large data experimental data sets and com- biology in the “omics” format, or in- Tsets is designated by the mercial companies developing in- tegrative biology in brain modeling, term “Big Data.” To put it simply, formation systems. The workshop or social sciences? the Big Data phenomenon is when participants delivered 16 short re- The tasks of working with large the results of your experiments can- ports that were aimed at discussing data sets can be subdivided into two not be imported into an Excel file. how manipulating large data sets groups: (1) when data are obtained Estimated, the volume of is related to the issues of medicine interactively and need to be proc- chats throughout a year is several and health care. essed immediately and (2) when orders of magnitude larger than The workshop was opened by there is a large body of accumulat- the volume of a person’s memory Prof. Eugene Kolker, who pre- ed data requiring comprehensive accumulated during his/her entire sented a report on the behalf of interpretation. The former category life. As compared to Twitter, all the the Data-Enabled Life Science of data is related to commercial sys- data on human genomes constitute Alliance (DELSA, www.delsaglo- tems, such as Google, Twitter, and a negligibly small amount [1]. The bal.org). The alliance supports the . Repositories of genomic problem of converting data sets into globalization of bioinformatics ap- and proteomic data exemplify the knowledge brought up by the U.S. proaches in life sciences and the latter type of data. National Institutes of Health in 2013 establishment of scientific commu- Systems for handling large data is the primary area of interest of the nities in the field of “omics.” The arrays are being developed at the Data-Enabled Life Science Alliance main idea is to accelerate transla- Institute for System Programming, (DELSA, www.delsaglobal.org) [2]. tion of the results of biomedical re- Russian Academy of Sciences, with Why have the issues of compu- search to satisfy the needs of the special attention on poorly struc- ter-aided collection of Big Data cre- community. tured and ambiguous data that are ated incentives for the formation Large data sets that need to be typical of the medical and biological of the DELSA community, which stored, processed, and analyzed fields. Collections of utili- includes over 80 world-leading re- are accumulated in many scientific ties and packages, as well as dis- searchers focused on the areas of fields, in addition to biology; there tributed programming frameworks medicine, health care, and applied is nothing surprising about this fact. running on clusters consisting of information science? This new Large data sets in the field of high- several hundreds and thousands of trend was discussed by the partici- energy physics imply several dozen nodes, are employed to implement pants of the workshop “Convergent petabytes; in biology, this number is smart methods for data search, Technologies: Big Data in Biology lower by an order of magnitude, al- storage, and analysis. Such projects and Medicine.” though it also approaches petabyte as Hadoop (http://hadoop.apache. The total number of workshop scale. The question discussed dur- org/), Data-Intensive Computing, participants was 35, including rep- ing the workshop was what Russian and NoSQL are used to run search- resentatives of research institutes researchers should focus on in the es and context mechanisms when

VOL. 5 № 3 (18) 2013 | Acta naturae | 13 FORUM

From problem to solution: the experts in the field of data processing, Data-Enabled Life Science Alliance-DELSA, are ready to beat back challenge of NIH

handling data sets on a number of the analysis of the resulting data derived using genome, transcrip- modern web sites. will become the highest priority. tome and/or proteome profiling Prof. Konstantin Anokhin (Sci- Extraction and interpretation techniques; composite, integrative entific Research Center “Kur- of information from existing data- measures may be used to quantify chatov Institute”), talked on the bases using novel analytical algo- the distance that separate any two fundamentally novel discipline of rithms will play a key role in science samples. However, as each human connectomics, which is focused on in future. The existence of a large organism has both individual genet- handling data sets by integrating number of open information sourc- ic predispositions and a history of data obtained at various organiza- es, including various and environmentalal exposure, the tra- tional levels. Large bodies of data search systems, often impedes the ditional concept of averaged norm will accumulate in the field of neu- search for the desired data. Accord- would not be appropriate for per- roscience because of the merging ing to Andrey Lisitsa (Research In- sonalized medicine applications in of two fundamental factors. First, stitute of Biomedical Chemistry, its true sense. Instead, Prof. Ancha an enormous amount of results ob- Russian Academy of Medical Sci- Baranova introduced the concept of tained using high-resolution ana- ences), existing interactomics data- a multidimensional space occupied lytical methods has been accumu- bases coincide to no more than 55% by set of normal sample tissue and lated in the field of neurosciences. [3]. The goal in handling large data the tissue-specific centers within Second, the main concern of scien- sets is to obtain a noncontradictory this space (“the ideal state of the tists is whole-brain functioning and picture when integrating data tak- tissue”). The diseased tissues will be how its function is projected onto en from different sources. located at a greater distance from the system (mind, thought, action), The concept of dynamic profiling the center as compared to healthy rather than the function of individ- of a person’s health or the state of ones. The proposed approach allows ual synapses. Obtaining data on the an underlying chronic disease us- one to abandon binary (yes/no) pre- functioning of the brain as a system ing entire sets of high throughput dictions and to show the departure includes visualization techniques: data without reducing the dataset of a given tissue sample as a point in high-resolution computed tomog- to the size of the diagnostic biomar- an easily understandable line graph raphy, light microscopy, and elec- ker panels is being developed at the that places each sample in the con- tron microscopy. Megaprojects on Research Center of Medical Ge- text of other samples collected from brain simulation have already been netics of the Russian Academy of patients with the same condition launched (e.g., the Human Brain Medical Sciences. The description and associated with survival and Project in Europe); the investments of a normal human tissue requires other post-hoc measures. to obtaining new experimental data one to integrate several thousand Prof. Vsevolod Makeev (Institute will be devalued with time, while quantifiable variables that may be of General Genetics, Russian Acad-

14 | Acta naturae | VOL. 5 № 3 (18) 2013 FORUM

emy of Sciences) asserted in his re- Medical data need to be proc- Watson supercomputer in various port that we will be dealing with essed interactively so that a pre- contexts. This supercomputer was large data sets more frequently in liminary diagnosis can be made no designed by IBM to provide an- the near future. There will be two later than several minutes after swers to questions (theoretically types of data: data pertaining to the the data have been obtained. The any questions!) formulated using individual genome (the 1000 Ge- “Progress” company is currently the natural language. It is one of nomes Project), which are obtained developing a system for remote the first examples of expert sys- once and subsequently stored in monitoring of medical indicators tems utilizing the Big Data princi- databases to be downloaded when using mobile devices and the cel- ple. In 2011, it was announced that required. The second type of data lular network for data transfer the supercomputer will be used to pertains to the transcriptome or (Telehealth, report by Oleg Gash- process poorly structured data sets proteome analysis, which is con- nikov). This method allows one to in order to solve medical and health ducted on a regular basis in order provide 24-hour out-of-hospital care problems [7]. to obtain an integrative personal monitoring of a patient, which is When analyzing the problem of omics profile [4]. There are several supposed to reduce medical serv- Big Data in biology and medicine, providers of such data in the case of ices costs in future. At this stage, one should note that the disciplines genomes; Russian laboratories can techniques for forming alarm pat- have been characterized by the ac- use these repositories and employ terns are to be developed based on cumulation of large data sets that their own bioinformatics approach- accumulated data; algorithms are describe the results of observations es to arrive at new results [5]. to be modified for each patient. since the natural philosophy era. The flow of dynamic data for in- The report on the problem of During the genomic era, the aim dividuals (results of monitoring the collecting and processing the geo of data accumulation seemed to be parameters of the organism) will in- location data that are accumulated understandable. However, as the crease as modern analytical meth- by mobile network operators and technical aspect was solved and the ods are adopted. Researchers will collected by aggregators, such as genome deciphered, it turned out face the need for rapid processing Google, Facebook, and AlterGeo, that the data was poorly related to of continuously obtained data and appeared to lie beyond the work- the problems of health maintenance for transferring the information to shop’s topic on the face of it. The [8]. repositories for further annotation lecturer, Artem Wolftrub (lead- In the post-genomic era, bio- and automated decision-making. ing developer at Gramant Ltd.), medical science has returned back There emerges the need for modi- reported that a number of papers to the level of phenomenological fying the technology of data storage have been published by a group description oriented towards data and transfer to ensure a more rap- led by Alex Pentland and David collection only, without an under- id exchange of information. Cloud Laser (Massachusetts Institute of standing of the prospect of its fur- services for storing and transfer- Technology) since 2009, where it ther interpretation. The Human ring large sets of data exist already has been substantiated that the Proteome Project is such an exam- (e.g., AmazonS3). analysis of geo data can be no less ple: data for each protein are col- The development of more rapid informative for predicting socially lected; however, it is not always a methods of mathematical analysis important diseases than the genome given that these data can be used also plays a significant role. The is. Environmental factors (the so- in the applied problems of in-vitro report delivered by Ivan Oseledets called exposome) play a significant diagnostics. Another example is the (Institute of Computational Math- role in the pathogenesis of multi- Human Connectome Project, which ematics, Russian Academy of Sci- gene diseases. Data regarding the is aimed at accumulating data on ences) focused on the mathematical exposome can be obtained with a signal transduction between neu- apparatus for compact presenta- sufficient degree of detail by ana- rons in expectation of the fact that tion of multidimensional arrays lyzing the relocations of a person, having been accumulated to a cer- based on tensor trains (tensor train by comparing the general regulari- tain critical level, these data will format, TT-format). Multidimen- ties of population migrations, and allow one to simulate human brain sional tasks constantly emerge in by identifying the patterns that activity using a computer. biomedical applications; the TT- correlate with health risks (e.g., In summary, the workshop par- format allows one to identify the development of cardiovascular dis- ticipants noted that the Big Data key variables that are sufficient to eases or obesity [6]). phenomenon is related to the new- describe the system or process un- In their discussions, the work- ly available opportunity of mod- der study. shop participants mentioned the ern technogenic media to generate

VOL. 5 № 3 (18) 2013 | Acta naturae | 15 FORUM and store data; however, there is on analyzing Big Data so that the acquainted with the data accu- no clear understanding as to the data array can be converted into mulated within the “Connectome” reason and purpose for the ac- hypotheses applicable for verifica- Project is bound to be the main di- cumulation of such data. Russian tion using a point-wise biochemi- rection of development at the Rus- scientists should primarily focus cal experiment. The task of getting sian subgroup of DELSA.

REFERENCES 5. Tsoy O.V., Pyatnitskiy M.A., Kazanov M.D., Gelfand M.S. 1. Hesla L. Particle physics tames big data // Symmetry. Au- // BMC Evolutionary Biology. 2012. № 12. (doi: 10.1186/1471- gust 01, 2012. (http://www.symmetrymagazine.org/article/ 2148-12-200). august-2012/particle-physics-tames-big-data). 6. Pentland A., Lazer D., Brewer D., Heibeck T. // Studies in 2. Kolker E., Stewart E., Ozdemir V. // OMICS. 2012. V. 3. Health Technology and Informatics. 2009. № 149. Р. 93–102. № 16. P. 138–147. 7. Wakeman N. // IBM’s Watson heads to medical school. 3. Lehne B., Schlitt T. // Human Genomics. 2009. № 3. Washington Technology. February 17, 2011. (http://wash- Р. 291–297. ingtontechnology.com/articles/2011/02/17/ibm-watson- 4. Li-Pook-Than J., Snyder M. // Chemistry & Biology. 2013. next-steps.aspx). № 20. P. 660–666. 8. Bentley D.R. // Nature. 2004. V. 429. № 6990. Р. 440–445.

16 | Acta naturae | VOL. 5 № 3 (18) 2013