Big Data in Biology and Medicine
Total Page:16
File Type:pdf, Size:1020Kb
FOruM Big Data in Biology and Medicine Based on material from a joint workshop with representatives of the international Data-Enabled Life Science Alliance, July 4, 2013, Moscow, Russia O. P. Trifonova1*, V. A. Il’in2,3, E. V. Kolker4,5, A. V. Lisitsa1 1 Orekhovich Research Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Pogodinskaya Str. 10, Bld. 8, Moscow, Russia, 119121 2 Scientific Research Center “Kurchatov Institute,” Academician Kurchatov Sq. 1, Moscow, Russia 123182 3 Skobel’tsyn Research Institute of Nuclear Physics, Lomonosov Moscow State University, Leninskie Gory 1, Bld. 58, Moscow, Russia, 119992 4 DELSA Global, USA 5 Seattle Children’s Research Institute, 1900 9th Ave Seattle, WA 98101, USA *E-mail: [email protected] he task of extracting new dealing with the analysis of large Big Data world: either molecular knowledge from large data experimental data sets and com- biology in the “omics” format, or in- tsets is designated by the mercial companies developing in- tegrative biology in brain modeling, term “Big Data.” to put it simply, formation systems. the workshop or social sciences? the Big Data phenomenon is when participants delivered 16 short re- the tasks of working with large the results of your experiments can- ports that were aimed at discussing data sets can be subdivided into two not be imported into an excel file. how manipulating large data sets groups: (1) when data are obtained estimated, the volume of twitter is related to the issues of medicine interactively and need to be proc- chats throughout a year is several and health care. essed immediately and (2) when orders of magnitude larger than the workshop was opened by there is a large body of accumulat- the volume of a person’s memory Prof. eugene Kolker, who pre- ed data requiring comprehensive accumulated during his/her entire sented a report on the behalf of interpretation. the former category life. As compared to twitter, all the the Data-enabled Life Science of data is related to commercial sys- data on human genomes constitute Alliance (DeLSA, www.delsaglo- tems, such as Google, twitter, and a negligibly small amount [1]. the bal.org). the alliance supports the Facebook. repositories of genomic problem of converting data sets into globalization of bioinformatics ap- and proteomic data exemplify the knowledge brought up by the u.S. proaches in life sciences and the latter type of data. national Institutes of Health in 2013 establishment of scientific commu- Systems for handling large data is the primary area of interest of the nities in the field of “omics.” the arrays are being developed at the Data-enabled Life Science Alliance main idea is to accelerate transla- Institute for System Programming, (DeLSA, www.delsaglobal.org) [2]. tion of the results of biomedical re- russian Academy of Sciences, with Why have the issues of compu- search to satisfy the needs of the special attention on poorly struc- ter-aided collection of Big Data cre- community. tured and ambiguous data that are ated incentives for the formation Large data sets that need to be typical of the medical and biological of the DeLSA community, which stored, processed, and analyzed fields. collections of software utili- includes over 80 world-leading re- are accumulated in many scientific ties and packages, as well as dis- searchers focused on the areas of fields, in addition to biology; there tributed programming frameworks medicine, health care, and applied is nothing surprising about this fact. running on clusters consisting of information science? this new Large data sets in the field of high- several hundreds and thousands of trend was discussed by the partici- energy physics imply several dozen nodes, are employed to implement pants of the workshop “convergent petabytes; in biology, this number is smart methods for data search, technologies: Big Data in Biology lower by an order of magnitude, al- storage, and analysis. Such projects and Medicine.” though it also approaches petabyte as Hadoop (http://hadoop.apache. the total number of workshop scale. the question discussed dur- org/), Data-Intensive computing, participants was 35, including rep- ing the workshop was what russian and noSQL are used to run search- resentatives of research institutes researchers should focus on in the es and context mechanisms when VOL. 5 № 3 (18) 2013 | ActA nAturAe | 13 FOruM From problem to solution: the experts in the field of data processing, Data-Enabled Life Science Alliance-DELSA, are ready to beat back challenge of NIH handling data sets on a number of the analysis of the resulting data derived using genome, transcrip- modern web sites. will become the highest priority. tome and/or proteome profiling Prof. Konstantin Anokhin (Sci- extraction and interpretation techniques; composite, integrative entific research center “Kur- of information from existing data- measures may be used to quantify chatov Institute”), talked on the bases using novel analytical algo- the distance that separate any two fundamentally novel discipline of rithms will play a key role in science samples. However, as each human connectomics, which is focused on in future. the existence of a large organism has both individual genet- handling data sets by integrating number of open information sourc- ic predispositions and a history of data obtained at various organiza- es, including various databases and environmentalal exposure, the tra- tional levels. Large bodies of data search systems, often impedes the ditional concept of averaged norm will accumulate in the field of neu- search for the desired data. Accord- would not be appropriate for per- roscience because of the merging ing to Andrey Lisitsa (research In- sonalized medicine applications in of two fundamental factors. First, stitute of Biomedical chemistry, its true sense. Instead, Prof. Ancha an enormous amount of results ob- russian Academy of Medical Sci- Baranova introduced the concept of tained using high-resolution ana- ences), existing interactomics data- a multidimensional space occupied lytical methods has been accumu- bases coincide to no more than 55% by set of normal sample tissue and lated in the field of neurosciences. [3]. the goal in handling large data the tissue-specific centers within Second, the main concern of scien- sets is to obtain a noncontradictory this space (“the ideal state of the tists is whole-brain functioning and picture when integrating data tak- tissue”). the diseased tissues will be how its function is projected onto en from different sources. located at a greater distance from the system (mind, thought, action), the concept of dynamic profiling the center as compared to healthy rather than the function of individ- of a person’s health or the state of ones. the proposed approach allows ual synapses. Obtaining data on the an underlying chronic disease us- one to abandon binary (yes/no) pre- functioning of the brain as a system ing entire sets of high throughput dictions and to show the departure includes visualization techniques: data without reducing the dataset of a given tissue sample as a point in high-resolution computed tomog- to the size of the diagnostic biomar- an easily understandable line graph raphy, light microscopy, and elec- ker panels is being developed at the that places each sample in the con- tron microscopy. Megaprojects on research center of Medical Ge- text of other samples collected from brain simulation have already been netics of the russian Academy of patients with the same condition launched (e.g., the Human Brain Medical Sciences. the description and associated with survival and Project in europe); the investments of a normal human tissue requires other post-hoc measures. to obtaining new experimental data one to integrate several thousand Prof. Vsevolod Makeev (Institute will be devalued with time, while quantifiable variables that may be of General Genetics, russian Acad- 14 | ActA nAturAe | VOL. 5 № 3 (18) 2013 FOruM emy of Sciences) asserted in his re- Medical data need to be proc- Watson supercomputer in various port that we will be dealing with essed interactively so that a pre- contexts. this supercomputer was large data sets more frequently in liminary diagnosis can be made no designed by IBM to provide an- the near future. there will be two later than several minutes after swers to questions (theoretically types of data: data pertaining to the the data have been obtained. the any questions!) formulated using individual genome (the 1000 Ge- “Progress” company is currently the natural language. It is one of nomes Project), which are obtained developing a system for remote the first examples of expert sys- once and subsequently stored in monitoring of medical indicators tems utilizing the Big Data princi- databases to be downloaded when using mobile devices and the cel- ple. In 2011, it was announced that required. the second type of data lular network for data transfer the supercomputer will be used to pertains to the transcriptome or (telehealth, report by Oleg Gash- process poorly structured data sets proteome analysis, which is con- nikov). this method allows one to in order to solve medical and health ducted on a regular basis in order provide 24-hour out-of-hospital care problems [7]. to obtain an integrative personal monitoring of a patient, which is When analyzing the problem of omics profile [4]. there are several supposed to reduce medical serv- Big Data in biology and medicine, providers of such data in the case of ices costs in future. At this stage, one should note that the disciplines genomes; russian laboratories can techniques for forming alarm pat- have been characterized by the ac- use these repositories and employ terns are to be developed based on cumulation of large data sets that their own bioinformatics approach- accumulated data; algorithms are describe the results of observations es to arrive at new results [5].