Issues in Integrating Scientific Databases

Project Number: MY12 Issues in Integrating Scientific Databases By Amira Tokatli-Apollon Mohamad El-Rifai An Interactive Qualifying Project submitted to the faculty of Worcester Polytechnic Institute in partial fulfillment of the requirements for the Degree of Bachelor of Science Degree of Bachelor of Science Submitted By: _____________________________ ____________________________ Mohamad El-Rifai Amira Tokatli-Apollon October 11, 2012 APPROVED: _______________________________ Advisor: Prof. Mohamed Y. Eltabakh !" " !!!!"#$%&'$" Scientific databases have emerged over the past two decades to store, analyze, and query science data such as in chemistry, biology, medicine, and earth sciences, among others. One of the major issues in scientific applications is that data are typically generated from the collaboration among many scientific labs and groups, which usually results in many challenging issues in integrating such data and managing their conflicts. These challenges represent a major hurdle to scientists that slowdown the progress in these fields. In this project, we study several of these challenges and highlight them through examples. We take biological databases as a test case in our project. We then focus on one major issue, which is “managing the conflicting data” and propose recommendations and potential solutions to address this issue. #" " Table of Contents: " 1. Abstract…………………………………………………………………….…………………..2 2. Introduction ............................................................................................................................... 5 2.1 Scientific Databases ............................................................................................................... 5 2.2 Example of Scientific Databases ........................................................................................... 6 2.3 Focus and Goals of the Project .............................................................................................. 7 2.4 Target Audience .................................................................................................................... 7 2.5 Roadmap ................................................................................................................................ 8 3. Background ................................................................................................................................ 9 3.1 Scientific Data Management Background ............................................................................. 9 3.1.1 Challenges Involved in Scientific Database ................................................................. 10 3.1.2 Centers for Storing Scientific Data ............................................................................... 12 3.1.3 Scientific Data Formats ................................................................................................. 19 3.1.4 Metadata ........................................................................................................................ 21 3.2 Biological Science Background ........................................................................................... 22 4. Issues in Integrating Scientific Databases ............................................................................. 31 4.1 Overview on Data Warehouse ............................................................................................. 31 4.2 Integrating Heterogeneous Data .......................................................................................... 32 4.3 Heterogeneity Problems ...................................................................................................... 33 4.4 Models of Data Integration .................................................................................................. 34 5. Data Collection ........................................................................................................................ 36 5.1 Preparation and Collection Process ..................................................................................... 36 5.2 Data Sources ........................................................................................................................ 37 5.3 Downloading the Data ......................................................................................................... 39 5.4 Building the Local Database ................................................................................................ 41 5.4.1 Description of Important Fields .................................................................................... 43 6. Data Analysis ........................................................................................................................... 45 7. Proposed Solutions for Conflict Management ..................................................................... 55 7.1 Taxonomy of Existing Model: LITCHI……………………………………………….…..55 7.1.1 Overview of Existing Model: LITCHI…………………………….………….………55 7.1.2 Exploitation and Extensions to LITCHI………………………..………….….………58 7.2 Extended Relational Database Model………………………………………....…………..59 8. Conclusion……………………………...…..……………………………………...…………62 9. References……………………….………………….……………………….....………….….63 $" " List of Figures 2.1 Shifting from file system to DBMSs for storing and querying the data..... 5 3.1 Gene Structure…………………………………………………………… 23 3.2 Gene Sequence…………………………………………………………... 24 3.3 DNA……………………………………………………………………... 24 3.4 DNA Molecule Structure…………………………………………...…… 24 3.5 Example of DNA Sequence…………………………………………...… 25 3.6 Genome Mutation……………………………………………………….. 29 4.1 Reasons of Data Heterogeneity.................................................................. 32 4.2 Federated Databases Diagram.................................................................... 34 4.3 Warehouse Database Diagram................................................................... 35 4.4 Mediation Mechanism Diagram................................................................. 35 5.1 &'()*+",-.')/(*'"/0)"1)02*0"343.'5............................................................... 37 5.2 6(.(3'.3"(7(,8(98'",-"10).2*0"343.'5............................................................. 38 5.3 2:(5;8'"<(.("/)05"10).2*0"343.'5................................................................ 38 5.4 &=>"3'()*+",-.')/(*'"/0)"?'-0@(3'................................................................ 39 5.5 A+'"(-3B')".0"(-"&=>"CD')4",-"?'-0@(3'.......................................................39 5.6 A(98'3")'8(.,0-(8"3*+'5(................................................................................44 7.1 LITCHI System………………………………………………..………… 57 %" " #!"$%&'()*+&$(%" #,-"./0123040/")53565717" Scientists in many fields, e.g., biology, chemistry, physics, medicine and healthcare, astronomy, and earth sciences, have been, for many years, collecting and analyzing data outside the database management systems (DBMSs). This is mostly because scientists used old-fashion methods such as file systems for storing their data, and on top of that most analysis tools have been designed to operate on files. For example, biologists used to store their data in flat files and process those using Perl scripts and other domain-specific tools such as BLAST (Basic Local Alignment Search Tool: http://blast.ncbi.nlm.nih.gov/Blast.cgi). However, with the scientific discoveries and the rate of collecting and generation data, the storage and analysis of scientific data becomes a major hurdle for scientists? They realized that they are of a great need to more complex systems for managing their data than flat files. That is where the database management systems (DBMSs) come into place (Refer to Figure 2.1). DBMSs offer many unique features such as structured schemas, data consistency, high-level query language, recovery and concurrency mechanisms, indexing techniques, and others. Therefore, over the past two decades, DBMSs have provided eminent support for various scientific applications, and since then scientific databases have become commonly used by the scientific community. Figure 2.1: Shifting from file system to DBMSs for storing and querying the data. E" " #,#"895:;<1"=4"./0123040/")53565717>"?0=<=@0/5<")53565717" In this project, we focus on one type of scientific data, which is biological data. In that domain, biologists need to store and analyze massive amounts of biological data generated from computer stimulations, instruments, and lab in database systems in order to perform their work. That is why scientific databases play significant role in computational biology. There are several hundred biological databases and it is increasing at a staggering rate, which makes it very difficult to keep an accurate track of all of the data collected throughout the years. Even though many biologists have pen-and-paper method, they found it very hard to keep up using a regular notebook fitting all of the collected data and comparing it with other data. Moreover, with these data accumulating at a high rate, they started to have bigger issues than just collecting the data. Therefore, biology applications were among the first scientific applications to use databases and share the data across the world. Examples of these databases are MetaBase1, DBcat 2 , and GenBank 3 that are among the biggest biological databases created through collaboration of many research labs. DBcat, for example, is a comprehensive catalog for 500 biological databases, while GenBank is one of the major databases which contains

Issues in Integrating Scientific Databases

BMC Genomics Biomed Central

Enzymes from Deep-Sea Microorganisms - Takami, Hideto

Genome Diversity of Spore-Forming Firmicutes MICHAEL Y

Complete Genome Sequence and Function Gene Identify of Prometryne-Degrading Strain Pseudomonas Sp

Patterns of Horizontal Gene Transfer Into the Geobacillus Clade

Isolation and Diversity of Sediment Bacteria in The

Nicotinamidase from Oceanobacillus Iheyensis HTE831

Reorganising the Order Bacillales Through Phylogenomics

The Life-Cycle of Operons

NOTE Oceanobacillus Kimchii Sp. Nov. Isolated from a Traditional

Phylogenetic Identification of Marine Bacteria Isolated from Deep-Sea

Microbial Culturomics to Map Halophilic Bacterium in Human Gut: Genome Sequence and Description of Oceanobacillus Jeddahense Sp