New Data Access Challenges for Data Intensive Research in Russia
Total Page:16
File Type:pdf, Size:1020Kb
New Data Access Challenges for Data Intensive Research in Russia © Leonid KalinichenkoI © Alexander FazlievII © Eugene GordovIII © Nadezhda KiselyovaIV © Dana KovalevaV Oleg MalkovV © I. OkladnikovIII © Nikolay PodkolodnyVI © Natalia PonomarevaVII © Alexey PozanenkoVIII © Sergey StupnikovI © Alina VolnovaVIII I Institute of Informatics Problems, FRC CSC RAS, Moscow II Institute of Atmospheric Optics, Siberian Branch of RAS, Tomsk III Institute of Monitoring of Climatic and Ecological Systems, Siberian Branch of RAS, Tomsk IV Institute of Metallurgy and Material Sciences of RAS, Moscow V Institute of Astronomy of RAS, Moscow VI Institute of Cytology and Genetics, Siberian Branch of RAS, Novosibirsk VII Research Center of Neurology , Moscow VIII Space Research Institute, RAS, Moscow [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Abstract applications support in various DIDs. This survey is intended also to serve as a basis for the panel discussion The goal of this survey is to analyze the global trends at the International Conference DAMDID/RCDL’2015. for development of massive data collections and related infrastructures in the world aimed at the evaluation of 1 Introduction the opportunities for the shared usage of such Nowadays scientific research and decision making collections during research, decision making and in various areas of human activity are provided on the problem solving in various data intensive domains basis of data analysis. Volume and variety of data (DIDs) in Russia. The representative set of DIDs accumulated in the respective domains grow selected for the survey includes astronomy, genomics exponentially. and proteomics, neuroscience (human brain In the book The Fourth Paradigm – Data Intensive investigation), materials science and Earth sciences. For Scientific Discovery [30] “data intensive sciences” are introduced by referring to ideas formulated by Jim Gray each of such DID the strategic initiatives (or large in 2007. Gray distinguishes 4 paradigms of scientific projects) in USA and Europe aimed at creation of big research in the order of their historical appearance: data collections and the respective infrastructures Empirical Science (describing natural phenomena), planned up to 2025 are briefly overviewed. The IT Theoretical Science (using models to achieve projects aimed at the development of the infrastructures generalizations), Computational Science (simulating supporting access to and analysis of such data complex phenomena) and Data exploration (also unifying theory, experiment and simulation). collections are also briefly overviewed. The paper concludes with an idea of organizing in Russia of a According to the Fourth Paradigm data intensive research is an integral part of various areas of science, target interdisciplinary program for the development of economics, business. These areas are designated below the pilot project of the distributed infrastructure and as data intensive domains (DIDs). platform for the access to various kinds of data in the Research and development in DIDs are world, storage of data and their analysis during research inconceivable without new data obtained through in various DIDs. As a part of such infrastructure, the observations and measurements in the nature and program should also include development of the high society. Process of data extraction, processing and performance interdisciplinary center for data intensive analysis resembles process of mineral mining operations. Minerals are mined, processed and Proceedings of the XVII International Conference transformed into materials for development of different DAMDID/RCDL’2015 «Data Analysis and products. Similarly to minerals data are extracted from Management in Data Intensive Domains», Obninsk, the nature for observable phenomena and processes. October 13-16, 2015 Data extraction from the nature becomes more and more complicated and sophisticated alongside with the 215 development of knowledge. This complexity is analysis (for instance, natural language texts analysis as motivated by the increasing scale of micro and macro a part of cognition), data accumulation and phenomena to be investigated. Global projects and interdisciplinary usage, design of specific infrastructures missions (including space ones) aimed at data extraction aimed at handling with “data flood” caused by installing and accumulating with the help of the newest new facilities and instruments of big data extraction and specialized high-technological instruments located on processing in nearly real time. A lot of activities are the Earth and in space are organized. Data extraction is undertaken in this direction including organization of very costly process requiring development of specific joint projects, workgroups and their symposia, technologies and huge investments. The process of data conferences, discussion of possible solutions; design extraction during investigation of some kind of and development of new infrastructures; design and phenomena in DID can take many years. Result of data testing right now of the fragments of infrastructures extraction is raw data (“ore”) that have to be processed oriented on access and analysis of data collections and analyzed. Alongside with the data extraction planned to start functioning after 2020; developing of development the rapid advancements and expansion use cases of the problems to be solved. take place in the following areas: Situation in Russia is such that without timely and methods and tools for data accumulating, effective access to data (the most important data processing, analysis and management in different collections are accumulated abroad) scientific research DIDs; in various areas of many DIDs will become less and less variety of problems to be solved on the basis of efficient. extracted data; Data collections overviewed in this paper and examples of their usage are time limited: large projects accumulation of experience of solving of such intended for data accumulation and usage in various problems and interdisciplinary usage of their DIDs carried out up to 2025 are considered. The solutions. representative set of DIDs including astronomy, Considerations mentioned above motivate the genomics and proteomics, neuroscience (human brain authors to analyze global trends for development of investigation), materials science and Earth sciences was massive data collections in the world and opportunities predefined2 for our overview and analysis. This set for the shared usage of such collections during research, includes also the informatics accompanied with a decision making and problem solving in various DIDs collection of existing data analysis methods (machine in Russia. learning, data mining, statistics), software and hardware The main motivation for this work is to initiate platforms. systematic analysis of the following topics attracting Researchers from different institutions of RAS, SB significant interest in the world: RAS and RAMS participated in this work. The list of development of massive data collections in various institutions includes Federal Research Center DIDs; “Computer Science and Control”, Space Research development of infrastructures for accumulating Institute, Institute of Astronomy, Institute of Cytology and usage of massive data collections; and Genetics, Research Center of Neurology, Institute of Metallurgy and Materials Science, Institute of systematizing of experience of problem solving in Monitoring of Climatic and Ecological Systems, DIDs; etc. Institute of Atmospheric Optics. Some of the pragmatic aims of this analysis include: The paper is structured as follows. For every DID revealing technical, legal and financial issues of considered in the next sections the following access of Russian scientists in various DIDs to information is provided: accumulated and expected data collections in the large strategic initiatives in USA and Europe; world1; examples of massive data collections in the world determining needs for specific hardware and up to 2025; software infrastructures for access to massive data collections from Russia; known infrastructure projects and data centers; determining possibilities for Russia to contribute comparable projects in Russia.. to the “world data treasury”, to the creation of Selecting initiatives, collections and infrastructures, infrastructures, methods and tools for data analysis the authors attempted to choose the projects collecting and problem solving. unique data that are important for scientific research in Preliminary analysis shows that the Western world is highly concerned with the issues caused by the “flood” 2 of DIDs with the big data. The issues include data This set has been formed during preparation of the DAMDID/RCDL’2015 conference. In the sequel the set can be 1 enlarged. Physics is not included in the set intentionally. E.g., methods Data extraction in many DIDs as well as acquisition of concrete data and tools for analysis of data got from hadron collider evolve very collections possess high technological complexity and cost. “Import fast. The LHC project occupies a distinguished position in Russia, substitution” is out of the question for at least 10 years in this area. data to be processed and infrastructure are very specific. 216 Russia. Volume of such collections should be at least rare events: each object will be observed