<<

A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT

Maulik Vyas B.E., C.I.T.C, India, 2007

PROJECT

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

at

CALIFORNIA STATE UNIVERSITY, SACRAMENTO

FALL 2011

A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE

A Project

By

Maulik Vyas

Approved by:

______, Committee Chair Meiliu Lu, Ph.D.

______, Second Reader Ying Jin, Ph. D.

______Date

ii

Student: Maulik Vyas

I certify that this student has met the requirements for format contained in the University format

manual, and that this project is suitable for shelving in the Library and credit is to be awarded for

the Project.

______, Graduate Coordinator ______Nikrouz Faroughi, Ph.D. Date

Department of Computer Science

iii

Abstract

of

A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE

by

Maulik Vyas

Data Warehouses are used by various organizations to organize, understand and use the data with the help of provided tools and architectures to make strategic decisions. Biological such as the annotated protein sequence database is subject oriented, volatile collection of data related to protein synthesis used in bioinformatics. Data mart contains a subset of enterprise data from data warehouse that is of value to a specific group of users. I implemented a data mart based on data warehouse design principles and techniques on protein sequence database using data provided by Swiss Institute of Bioinformatics. While the data warehouse contains information about many protein sequence areas, data mart focuses on one or more subject area. It brings together experimental results, computed features and scientific conclusions by implementing and data cube that supports the data warehouse to make it easier for organizations to distribute data within a unit. This enables them to deploy the data, manipulate it and develop the protein sequence data any way they see fit. The main goal of this project is to provide consistent, accurate annotated protein sequence data to group of researchers working on protein sequence. I took a chunk of this data to extract it from warehouse, transform it and loaded it in staging area. I used HJSplit to split the XML protein sequence data into equal parts and

iv

extract information using XML editor. I populated the database tables in Microsoft Access 2010 from XML file. Once the database was set up, I used MySQL Workbench 5.2 CE to generate queries related to star schema. Finally, I implemented star schema, OLAP operations, and data cube and drill up-down operations for strategic analysis of protein sequence database based on

SQL queries. This ensured explicit support for dimension, aggregation and long-range analysis.

______, Committee Chair Meiliu Lu, Ph.D.

______Date

v

DEDICATION

This project is dedicated to my beloved parents Kirankumar Vyas and Jayshree Vyas for their never-ending sacrifice, love and support and understanding. I would also like to dedicate this to my loving wife Tanvi Desai for encouraging me to pursue Master in Computer Science and for being a pillar of support for me throughout.

vi

ACKNOWLEDGMENTS

I would like to extend my gratitude to my project advisor Dr. Meiliu Lu, Professor, Computer

Science for guiding me throughout this project and helping me in completing this project successfully. I am also thankful to Dr. Ying Jin, Professor, Computer Science, for reviewing my report. I am grateful to Dr. Nikrouz Faroughi, Graduate Coordinator, Department of Computer

Science, for reviewing my report and providing valuable feedbacks. In addition, I would like to thank The Department of Computer Science at California State University for extending this opportunity for me to pursue this program and guiding me all the way to become a successful student.

Lastly, I would like to thank my parents Kirankumar Vyas and Jayshree Vyas and my loving wife

Tanvi Desai for providing me the moral support and encouragement throughout my life.

vii

TABLE OF CONTENTS

Page

Dedication…………………………………………………………………………………... vi

Acknowledgments…………………………………………………………………………... vii

List of Figures………………………………………………………………...... x

List of Abbreviations…………………………………………………………...... xi

Chapter

1. INTRODUCTION……………………………………………………………………….. 1

1.1 Introduction to Data Warehousing………………………………………………… 1

1.2 Introduction to Annotated Protein Sequence and UniProt..……...... 2

1.3 Goal of the Project……...………………………………………...... 3

2. COLLECTION AND ANALYSIS OF UNIPROT IN PROTEIN SEQUENCE………… 4

2.1 Collecting UniProt in Protein Sequence.……………………...…………………... 4

2.2 Extract, Transform, Load (ETL)……….……………………...…………………... 5

3. DESIGNING STAR SCHEMA FOR UNIPROT…………………………………...... …. 10

3.1 Introduction to Star Schema….…………………………………...... 10

3.2 Designing a Star Schema……………….………………………………………… 10

3.2.1 Mapping Dimensions into Tables……………………………...... 11

3.2.2 Dimensional Hierarchy…...……….…………………………...... 12

4. OLAP OPERATIONS IMPLEMENTED ON UNIPROT……………………...... 21

4.1 Introduction to Online Analytical Processing…..…………………………..……... 21

4.2 Type of OLAP Operations…………………….…………………………………... 22

viii

4.3 Data Cube………………………………………………………………………….. 23

4.4 OLAP Operations………………………………………………………………….. 26

5. TESTING………………………………………………………………………………… 29

5.1 Test Cases…………………………………………………………………………. 29

6. CONCLUSIONS…………………………………………………………………………. 31

6.1 Summary……………………………………………………………...... 31

6.2 Strengths and Weaknesses.………………………………………………………... 33

6.2.1 Strengths ………………………………………………….………………… 33

6.2.2 Weakness …………………………………………………………………… 35

6.3 Future Work.…………………………………………………...... 36

Bibliography………………………………………………………………………………... 37

ix

LIST OF FIGURES

Page

Figure 1-1 Data Warehouse Architecture………………………………………………….. 2

Figure 2-1 Structure of ETL and Data Warehouse……………………………………….... 6

Figure 2-2 Sample XML File During Extraction………………………………………….. 7

Figure 2-3 Implementing Transformation on UniProt…………………………………….. 8

Figure 3-1 Dimension of Source………….……………………………………….... 12

Figure 3-2 Sample Data from Source Table……………………………………………...... 13

Figure 3-3 Sample Data from Gene Table…………………………………………………. 14

Figure 3-4 Sample Data from Isoform Table………………………………………………. 15

Figure 3-5 Star Schema Example 1...…………………………………………………...... 16

Figure 3-6 Sample Output of the SQL Query 1…………………………………………..... 18

Figure 3-7 Sample Data of Entry Table……………………………………………………. 19

Figure 3-8 Star Schema Example 2………………………………………………………… 20

Figure 4-1 Front View of Data Cube……..………………………………………………... 24

Figure 4-2 Data Cube………………………………………………………………………. 26

Figure 4-3 Sample Output of the SQL Query 2……….…………………………………… 28

x

LIST OF ABBREVIATIONS

DW: Data Warehouse

OLAP Online Analytical Processing

ETL: Extract, Transform and Load

XML Extensible Markup Language

OLTP Online

xi

1

Chapter 1

INTRODUCTION

1.1 Introduction to Data Warehousing

Data Warehouse (DW) is a methodological approach for organizing and managing database used for reporting and analysis while providing organization with trustworthy, consistent data for applications running in an organization. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an for additional operations before it is used in the DW for reporting. Data warehouse depicts data and its relationship by drawing distinction between data and information [1].

The data is cleaned to remove redundancy, transformed into compatible data and then made available to managers and professionals handling , online analytical processing

(OLAP), market research and decision support.

Essential components of data warehousing include analyzing data, extracting data from the database, transforming data and managing [2].

2

Figure 1-1: Data Warehouse Architecture

1.2 Introduction to Annotated Protein Sequence and UniProt

The protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt,

PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function.

Bioinformatics has revolutionized biological industry by applying computer technology to the management of biological information. Today's computers are able to gather, store, analyze and

3

integrate biological information that can be applied for protein-based drug discovery or gene- based drug discovery [3].

UniProt is a scientific community that has high quality, freely available resources of protein sequence and functional information. I will be using the database of Swiss Institute of

Bioinformatics to manually annotate protein sequence [4].

1.3 Goal of the Project

The goal of this project is to understand the interconnection of data warehousing with UniProt which is a part of bioinformatics as well as carry out research on the current and potential application of data warehousing with the available database. This project covers star schema that can be applied to the database. Research issues that still need to be explored are discussed at the end of the project report. This report is structured as follows: Chapter 2 discusses how UniProt protein sequence was collected, analyzed and implemented. Chapter 3 discusses the implementation of data warehousing on UniProt. Chapter 4 discusses OLAP operations done on

UniProt to get stable, non-redundant data. Chapter 5 discusses star schema implemented on the database. Chapter 6 gives the summary of implementing data mart and data warehouse concepts on annotated protein sequence data, strength and weakness of using star schema and OLAP operations on annotated protein sequence data and future work.

4

Chapter 2

COLLECTION AND ANALYSIS OF UNIPROT IN PROTEIN SEQUENCE

This chapter discusses the collection procedure of UniProt from the Swiss bank website. It

furthermore, discusses the analysis done on the protein sequence to get data from their XML files.

2.1 Collecting UniProt in Protein Sequence

The Universal Protein Resource (UniProt) is the bank for protein sequence and Annotation data.

It is accessible at www.uniprot.org and is collaboration among European Bioinformatics Institute,

Swiss Institute of Bioinformatics and Protein Information Resource. I finalized annotated protein

sequence from Swiss Institute of Bioinformatics. UniProt is updated every four weeks. I opted for

UniProt KB and came across three data sets in three different file formats, which were xml, fasta

and text.

To get to know more about Fasta, I downloaded the database with .fasta extension and then

looked around to find supporting software. Found a software named fasta 1.0.1. Fasta is a

scientific data format used to store nucleic acid sequences (such as DNA sequences) or protein

sequences. The format may contain multiple sequences and therefore, is sometimes referred to as

the fasta database format. Fasta files often start with a header line that may contain comments or

other information. The rest of the file contains sequence data. Each sequence starts with a ">"

symbol followed by the name of the sequence. The rest of the line describes the sequence, and the

remaining lines contain the sequence itself. In order to use data from Fasta, one has to have higher configuration of computer.

5

Since it was not feasible to get a high-performance machine, I decide to go for XML as it is machine and platform friendly file extension for most large . Once I downloaded the data set in xml format, I extracted it to get a complete XML document. Now this XML file is over

2 GB in size. I truncated it in order to use the data from document. For this I used HJ Split to split parts into pre-defined parts (10mb). This enabled me to use the data more effectively since I could generate more fields along with individual tables. Once the XML opened, I analyzed the data from the document by creating missing links in the document. I sorted the document for more clarity about data flow. Each data was categorized in different type of tables. For example, there was a location for sub cellular data as well as gene. So it was put into a set of tables belonging to location tab. Similarly, it was done for the name of gene, isoform, and organism. I followed the Extract, Transform and Load (ETL) procedure, which is explained in detail below.

2.2 Extract, Transform and Load (ETL)

ETL is a process to extract data from different type of systems, transform it into a structure which can be used for analysis and reporting and then load it into a database and/or cube.

6

Figure 2-1: Structure of ETL and Data Warehouse

Extract: I extracted data from different external source; UniProt’s Swiss database website. This data is mostly structured and/or unstructured. Since the data was in xml document it was hard for me to send a query due to incompatibility. So I put the data in staging area, which is structured in the same way as the original data from the website. So, I had to extract individual fields from xml file which I had to split into multiple parts in order to get database fields. Below is one of many xml files generated after extracting it from the universal protein website.

7

Figure 2-2: Sample XML File During Extraction.

8

Transform: Once the data is available in staging area, I ensured that it was on one platform and

one database. This ensures that we can execute basic queries like sorting, filtering, joining the

tables. I also checked for data and cleaned it by adding data or modifying data as per

requirements. As shown in figure below, the source table has plenty of inconsistent and

incomplete data. I ensured the corresponding data was completely filled, and unwanted data was cleared. After all data was prepared, I implemented slowly changing dimensions, which are needed to keep track of attributes that change over time which will help in analysis and reports.

Figure 2-3: Implementing Transformation on UniProt

9

Load: Finally, the above data is loaded into the data warehouse, usually into facts and dimension

tables so that we can combine the data, aggregate them and load into data marts to generate star

schema and/or cubes as necessary. Now what really happens when generating star schema is, by extracting primary key of participating tables or dimensions, we put the list, in and give

its own primary key as well as define the fact table with a name to distinguish it from dimension

tables. Once, it is listed, dimension tables are linked to fact table with corresponding primary

keys.

10

Chapter 3

DESIGNING STAR SCHEMA FOR UNIPROT

This chapter discusses the fundamentals of star schema. We start by introducing the concept of star schema, how it is useful to bioinformatics, then our implementation on annotated protein sequence data by mapping dimension and fact tables to generate a star schema that can be used for analysis.

3.1 Introduction to Star Schema

Star schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because an entity-relationship diagram between dimensions and fact tables resemble a star where one fact table is connected to multiple dimensions. The center of star schema consists of large fact table, and it point towards dimension tables.

3.2 Designing a Star Schema

We start by raising a real-life question: How do we logical data stored in the database?

For example, we can ask some questions like:

• Which protein sequence was affected by Organism host?

• Where are the proteins and genes located?

• In what gene location did the lineage and isoform affect which gene?

Some of the above questions are common in biotechnology. In order to discover answers to questions we first need to know the design procedure of star schema. In order to analyze the

11

protein sequence data we need to first identify the business process then identify facts or

measures followed by identifying the dimensions for facts and list columns that describe each dimension. We conclude by determining the lowest level of summary in fact table.

Most of the above questions are aggregated data asking for counts or sums and not individual transactions. Finally, these questions are looked at ‘by’ conditions, which refer to the data by using some conditions. Figuring out aggregated values to be shown, such as protein sequence, gene location, gene and then figure out the ‘by’ condition that drives the design of star schema.

It is important to note that in star schema, every dimension will have a primary key and

dimension table will not have any parent table. The hierarchies for dimension are stored in

dimensional table itself in a star schema. When we examine a data, we usually want to see some

sort of aggregated data. These are called Measures. Measures are numeric values that are

measurable and additive. Example is accession in entry table. We also need to look at measures

using ‘by’ condition which are called Dimensions. In order to examine accession, most scientist or analyst wants to see what entry keyword and sequences are obtained periodically [6].

3.2.1 Mapping Dimensions into Tables

Dimension table should have a single field primary key. This is typically a and is often just an identity with auto incrementing number. Real information is stored in other fields since the primary key’s value is meaningless. The other fields are called attributes and contain a full description of dimension record. Dimension tables often contain large fields. One of the greatest challenges in a star schema is the problem of changing dimensional data [6].

12

3.2.2 Dimensional Hierarchy

We build dimension tables by implementing OLAP hierarchy, which is usually a single dimension table. Storing hierarchy in a dimension table allows for easiest browsing of dimensional data.

For example, we have a Source table. If we are to create a dimension table for the same, it will look something like what is shown below:

SourceDimension

Source_Id

source_strain

Source_tissue

Subcelllocation_id

Subcelllocation

Subcell_topology

Figure: 3-1 Dimension Table of Source

Source table consist of tissue and strain. Basically, it shows the source where protein gene will affect and whether it will be on a tissue or cell and how is the strain on the source. A typical source table looks as seen below:

13

Figure 3-2: Sample Data from Source Table

Storing the hierarchy in a dimension table allows for easiest browsing of dimensional data. In above example, users can easily choose a category and list all sub cell locations as per required data. The above example shows how a hierarchy of dimension tables is built in a star schema.

This would be done by using drill-down operation which is OLAP based and choose individual location from within the same table. No need to join to an external table for any of hierarchical

14

information. In the overly-simplified example below, there are two dimension tables jointed to fact table.

Gene dimension table below is generated from Gene’s table which is shown below:

Figure 3-3: Sample Data from Gene Table

15

The Isoform dimension table is generated from Isoform table, which contains several different forms of same protein, which are produced from related genes. The sample data from Isoform table is shown below with its structure.

Figure 3-4 Sample Data from Isoform Table

16

In the overly-simplified example 1, there are two dimension tables joined to the fact table. For now, examples will use only one : HostLocation.

Figure: 3-5 Star Schema Example 1

17

In order to see the location for a particular isoform for a particular lineage, a SQL query would look something like this:

SELECT subcell_location, isoform_id, isoform_name

FROM IsoformDimension INNER JOIN (GeneDimension INNER JOIN

LocationFact ON GeneDimension.gene_id = LocationFact.gene_id)

ON IsoformDimension.ID = LocationFact.ID

WHERE GeneDimension.gene_name='HLAA' AND IsoformDimension.isoform_id=’P11171’

AND IsoformDimesion.lineage_id=6

18

The sample output of the above query is shown below:

Isoform Subcellular location

isoform_id isoform_name subcell_location P11171-1 1 Membrane

P11171-2 2 Cytoplasm

P11171-3 3 Nucleus

P11171-4 Erythroid lamellipodium

P11171-5 Non-erythroid A filopodium

P11171-6 Non-erythroid B growth cone P11171-7 7 synaptosome

Figure 3-6: Sample Output of the SQL Query 1

The fact table contains measures often called as facts. The facts are numeric and additive across some or all the dimensions. Fact tables are generally long and skinny while dimension tables are fat. Fact tables hold a number of records represented by the product of the counts in all the dimension tables. When building a star schema, we must decide the granularity of the fact table.

The granularity, or frequency, of the data is determined by the lowest level of granularity of each dimension table. Lower the granularity, more the records existing in the fact table. The granularity also determines how far users can drill down without returning to the base, transaction-level data.

The Entry table that we will use in next star schema example consists of gene sequence, keyword and organism host name and its location. An entry table data is shown below:

19

Figure 3-7: Sample Data of Entry Table

Now let’s look at another 3-dimension, 1 fact table star schema. The measure here again is host location but uses Entry dimension table to gain access to organism where the host is located.

20

Figure 3-8: Star Schema Example 2

21

Chapter 4

OLAP OPERATIONS IMPLEMENTED ON UNIPROT

This chapter discusses OLAP operations implemented on UniProt. The chapter introduces concept of OLAP operations, its type. The chapter also briefly discusses data cube along with examples and query.

4.1 Introduction to Online Analytical Processing

OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view. For example, a user can request that data be analyzed to display a spreadsheet showing all of a company's beach ball products sold in Florida in July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in Florida in the same time period. To facilitate this kind of analysis, OLAP data is stored in a multidimensional database. Whereas a relational database can be thought of as two-dimensional, a multidimensional database considers each data attribute (such as product, geographic sales region, and time period) as a separate "dimension."

OLAP software can locate the intersection of dimensions (all products sold in the Eastern region above a certain price during a certain time period) and display them. Attributes such as time periods can be broken down into sub-attributes. Main goal of OLAP was to support ad-hoc but complex querying performed by business analysis. Since data is explored and aggregated in various ways, it was important to introduce an interactive process of creating, managing,

22

analyzing and reporting on data that included spreadsheet-like analysis to work with a huge amount of data in the data warehouse.

4.2 Types of OLAP Operations

OLAP systems use the following taxonomy.

Multidimensional OLAP (MOLAP) is the 'classic' form of OLAP. MOLAP stores data in an optimized multi-dimensional array, rather than in a relational database. Thus, it requires the pre- computation and storage of information in the cube - the operation known as processing.

Relational OLAP (ROLAP) usually is directly related to relational database. The base data and the dimension tables are stored as relational tables. In order to hold the aggregated information, we use new tables depending on a specialized schema design. The above method is used to manipulate the data stored in the relational database in order to give it a traditional OLAP view by using, slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Comparing the two OLAPs we can distinguish that each type has certain benefits, although there is disagreement about the specifics of the benefits between providers.

• Some MOLAP implementations are prone to the database explosion, a phenomenon

causing vast amounts of storage space to be used by MOLAP databases when certain

common conditions are met: high number of dimensions, pre-calculated results and

sparse multidimensional data.

23

• MOLAP generally delivers better performance due to specialized indexing and storage

optimizations. MOLAP also needs less storage space compared to ROLAP because the

specialized storage typically includes compression techniques.

• ROLAP is generally more scalable. However, large-volume pre-processing is difficult to

implement efficiently so it is frequently skipped. ROLAP query performance can

therefore suffer tremendously.

• Since ROLAP relies more on the database to perform calculations, it has more limitations

in the specialized functions it can use.

4.3 Data Cube

A Data Cube (OLAP Cube or Multi-dimensional Cube) is a data structure that allows faster analysis of data. It also has the capability to manipulate and analyze data from multiple perspectives. The cube consists of numeric facts called Measures, which are categorized by

Dimensions. The cube structure may be created from star schema or of tables in the database. Measures are derived from records in fact table and dimensions are derived from dimension tables. For the current project of UniProt, we will consider cube created from star schema.

24

Figure 4-1 Front View of Data Cube

The above cube is used to represent data along some measure of interest. Although it is called a

‘cube’, it can be 2-dimensional, 3-dimensional or higher. Each dimension represents some attribute in database and cells in data cube represent a measure of interest. For example, we can count the number of times that attribute combination occurs in the database or the minimum, maximum, sum of some attribute. Queries are performed on the cube to retrieve decision support information.

25

In above example, we have three tables that are related to gene, organism where it resides and source tissue. So the data cube formed from this is the 3-dimensional representation with each table cell of the cube representing a combination of values shown from organism, source and gene. The content of each cell is counted number of times that specific combination of value occurs together in database. Cells that appear blank, in fact, have a value of zero. The cube can then be used to retrieve information within the database about which gene affects which organism and which specific source tissue is affected.

Now let us consider another data cube example in which we show the maximum value of three attributes isoform, lineage and interactant. What this shows is how many times isoform interacts with lineage and uses interactant as label. So basically, we have isoform names which are listed that interact with taxon through the listed names and uses interactant labels to find out the maximum number of interaction between taxon and isoform names. Because of too many cells in cube is filled with no data it takes up valuable processing time by effectively adding up zeros, which are in empty cells. This condition is called Sparsity and to overcome this, we have to use

Linking cubes. For example, gene may be available for all organisms and source, but the location may not be available with this amount of analysis. So instead of creating a sparse cube, it is sometimes better to create another separate but linked cube in which a sub-set of data can be analyzed in great detail. The linking ensures that data in the cube remain consistent.

26

Figure 4-2: Data Cube

4.4 OLAP Operations

Common operations include slice and dice, drill down, roll up, and pivot. With OLAP, we can analyze multidimensional data from multiple perspectives. OLAP consists of three basic analytical operations: consolidation, drill-down, and slicing and dicing.

In consolidation, we implement aggregation of data so that it can be accumulated and computed in one or more dimensions. Slicing and dicing is where users take out a specific set of data of the cube and view the slices from different viewpoints.

27

Usually OLAP uses multi dimensional to configure so that we can do complex

analytical and ad-hoc queries with a rapid execution time. The core of any OLAP system is an

OLAP cube (also called a 'multidimensional cube' or a hypercube). Cube consists of numeric

facts called measures, which are categorized by dimensions. The cube metadata is typically

created from a star schema or snowflake schema of tables in a relational database. Measures are

derived from the records in the fact table, and dimensions are derived from the dimension tables.

Each measure can be thought of as having a set of labels, or meta-data associated with it. A dimension is what describes these labels; it provides information about the measure.

OLAP SLICING

A slice is a subset of a multi-dimensional array corresponding to a single value for one or more members of the dimensions not in the subset. For example, if the member Actuals is selected from the Scenario dimension, then the sub-cube of all the remaining dimensions is the slice that is specified. The data omitted from this slice would be any data associated with the non-selected members of the Scenario dimension, for example budget, variance, forecast, etc. From an end user perspective, the term slice most often refers to a two- dimensional page selected from the cube [7].

OLAP Drill-up and drill-down:

Drilling down or up is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down). In example 1, if we have to drill down to a subcategory, the SQL would change to look like this:

28

SELECT subcellularlocation_id, isoform_id, isoform_name

FROM IsoformDimension INNER JOIN (GeneDimension INNER JOIN

LocationFact ON GeneDimension.gene_id = LocationFact.gene_id)

ON IsoformDimension.ID = LocationFact.ID

WHERE GeneDimension.gene_name='HIBADH’ AND IsoformDimension.subcellloc_id = 37

AND IsoformDimesion.lineage_id=6

Sample output of the above SQL query would be as shown below:

isoform_id isoform_name subcelllocation_id Subcell_location P53353-1 FSA-Acr.1 37 Secreted

Figure 4-3: Sample Output of the SQL Query 2

29

Chapter 5

TESTING

In this chapter, we discuss some test cases implemented on the data mart and the procedure that is followed while testing.

5.1 Test Cases

It is very important to effectively test a project for successful implementation. For testing, most projects are recommended to do unit testing and black box testing methodologies. The test data set is generated by working with end-users. The test files were used to check if data is being populated correctly, and that extraction is done exactly as desired. The following information defines the test cases and how results were documented.

Test # Test Case

1 Test continuous internet connection to ensure successful data download of xml file

from universal protein website.

2 Use HJ Split for splitting 2.26 GB of data from XML document. Check the split for

loose ends. Enter appropriate code to beginning and termination of code.

3 Extract data from first XML split file and subsequently follow with others.

4 Since 2.26GB of data is huge, I decided to extract limited amount of data and had to

truncate the file after getting 2500 samples of protein sequence.

30

5 Check the data format for XML File. This is important as when extracting data, if it

is not in required XML format it could lead to improperly distributed data.

6 Once, the data is transformed and loaded into database, check for consistent data.

7 Check for redundancy, clean the data; perform database operations to optimize the

data to handle queries and load generated from the schemas.

8 Create new SQL query to insert, update, select data from database to generate star

schema and data cube. Mention primary key, and relationships.

9 Run SQL query to populate data which is to be used in star schema and Data Cube

Sample runs for each of the above test plans are covered with star schema example in section 3.2 and for data cube; it is discussed in detail in section 4.3.

Each section in star schema has two examples of how a query is used to generate star schema and show what data is generated as well as what measures are used to identify the relationship.

Similarly, each section in data cube has an example of how three tables in the database are used to count the number of occurrences in the database as well as maximum times the occurrence takes place with organism and gene.

31

Chapter 6

CONCLUSIONS

6.1 Summary

The main purpose of this project was to understand the working of data warehouses in bioinformatics especially related to protein sequence. In this project, we learnt that research group, analyst and lab technicians in organizations will greatly benefit by implementing technologies like star schema, data cube, OLAP operations on the data in the data warehouse to obtain cohesive and analytical results. The test case led us to approximations for the missing or biased aggregates of those cells that have missing or low support. The method we implement is adaptive to sudden changes of data distribution, called discontinuities that inevitably occur in real-life data, which is collected for the purpose of being analyzed. Since most of these data are collected to handle on-going research, it is usually called operational data. The data warehouse is

used to collect and organize data for analysis, which can also be referred to as informational data

and use OLAP. I integrated protein sequence data with gene and source. The reason behind this is

that integration plays a vital role in the data warehouse since data is gathered from a variety of source, merged into a coherent whole database. This helps add stability to the data stored in the data warehouse is useful for users.

This project was developed to accumulate experimental knowledge of protein function due to easy availability of protein sequence data. These enabled me to model protein sequence as per research group requirements and develop evolution of protein sequence function. The protein sequence data can be used in protein classification service. Proteins can be classified using

32

protein sequences at family and sub family levels. The second application is an expression data

analysis service, where functional classification information can help find biological patterns in

data obtained from genome-wide experiment. The third application of this project is by coding

SQL query for single-nucleotide polymorphism scoring service. In this case, information about proteins is used to assess the likelihood of deleterious effect from substituting a taxon or lineage at a specific position. The technology used to implement the data warehouse like star schema, data cube can be very beneficial for above applications.

The coursework for data warehouse and data mining that I took under expert guidance of Dr.

Meiliu Lu was an enlightening and an enriching experience. It helped me understand the goals

and techniques used in data warehousing. The course helped me construct a data warehouse,

understand design techniques for relational databases like star schema design, online analytical

processing. I also learned how to implement data cube in 3-dimension using multidimensional

databases by creating and maintaining it. Through this coursework, I was motivated to implement

data mart on annotated protein sequence data, use design techniques like star schema, data cube

and OLAP operations on the protein sequence data.

A data cube contains cells, each of which is associated with some summary information, or

aggregate, which the decisions are to be based on. However, in protein sequence databases, due to

the nature of their contents, data distribution tends to be clustered and sparse. The Sparsity

situation gets worse, as the number of cells increases. It is necessary to acquire support for those

33

cells that have support levels below a certain threshold by combining with adjacent cells.

Otherwise, incomplete or biased results could be derived due to lack of sufficient support.

The data often comes from OLTP systems but may also come from spreadsheets, flat files and other sources. In this case, database came from xml file. The data is formatted in such a way that

it provides fast response to queries. Star schemas provide fast response by denormalizing

dimension tables and potentially through providing many indexes. If we go through the protein

sequence database, we find ‘Db Reference id’ was used as indexing to accelerate the fetching of data.

We implemented OLAP operations on star schema to get more result oriented data by implementing data cube also known as OLAP Cube. Once we have the query to generate data for star schema, we can get factual information that is stored in the database.

6.2 Strengths and Weaknesses

We discuss strength and weakness of star schema, OLAP operations and data warehouse in brief here. Most importantly, it is organization specific about how they want to utilize DW and OLAP operations to effective monitor and analyze their data.

6.2.1 Strengths

The simplicity for users to write and process database is a very important benefit of using star schema. Since queries are written with simple inner joins between the facts and a small number of dimensions. Star joins are simpler than possible in the snowflake schema. By using ‘Where’ conditions we can use to filter on attributes desired. Aggregation is quite fast using da.

34

Additionally, it provides a direct and intuitive mapping between the business entities being

analyzed by end users and the schema design. For typical star queries, it provides highly

optimized performance.

Furthermore, star schema is widely supported by a large number of tools that can anticipate or require that data-warehouse schema contain dimension tables.

One of the major benefits of the star schema is that the low-level transactions may be summarized to the fact table grain. This greatly speeds the queries performed as part of the decision support process. However, the aggregation or summarization of the fact table is not always done if cubes are being built [8].

OLAP allows for the minimization of data entry. For each detail record, only the primary key value from the Source table is stored, along with the primary key of the gene table, and then the sub cellular location is added. This greatly reduces the amount of data entry necessary to add a product to an order.

Not only does this approach reduce the data entry required, it greatly reduces the size of a Source record. The records take up much less space with a normalized table structure. This means that the table is smaller, which helps speed inserts, updates, and deletes.

In addition to keeping the table smaller, most of the fields that link to other tables are numeric.

Queries generally perform much better against numeric fields than they do against text fields.

35

Therefore, replacing a series of text fields with a numeric field can help speed queries. Numeric

fields also index faster and more efficiently.

With normalization, there are frequently fewer indexes per table. Each transaction requires the

maintenance of affected indexes. With fewer indexes to maintain, inserts, updates, and deletes run

faster [9].

6.2.2 Weakness

In a star schema there is no relationship between two relational tables. All dimensions are de-

normalized and query performance degrades. A star schema is hard to design. It is easier on users

but very hard for developers and modelers. Dimensional table is in denormalized form so it can

decrease performance and increase queries response time.

There are some disadvantages to an OLAP when trying to analyze queries. Queries must utilize

joins across multiple tables to get all the data, which make it to read slowly. When normalization

is implemented, developers have no choice but to query from multiple tables to get the detail

necessary for a report.

Fewer indexes per table which is an advantage of OLAP sometimes are a disadvantage too. Fewer

indexes per table speed up insert, update, and delete. However, if we use fewer indexes in the

database, then the select queries will run slower. For data retrieval, a higher number of correct

indexes help speed retrieval. Since one need to speed transactions by minimizing the number of

indexes, OLAP databases trade faster transactions at the cost of slowing data retrieval. Last but

36

not least, the data in an OLAP system is not user friendly. If an analyst wants to spend more time performing analysis by looking at the data, the IT group should support their desire for fast, easy queries, so it is important that we solve the problem since the data retrieval may be slow as a

trade off. We can solve this by having a second copy of the data in a structure reserved for

analysis. This is heavily indexed allowing analyst and customers to perform large queries against

the data without impacting modification on the main data [9].

6.3 Future Work

As a part of future implementation, I would like to develop a tool or application which enables to

extract data directly from the data warehouse to generate a star schema and data cube by

providing desired data according to the user requirements. I would like to implement other OLAP

operations like pivot, dicing, slicing in more detailed data warehouse where it is implemented

across multiple tables. It would be a great learning experience if I can prepare a general data

warehouse so that it can benefit multiple organizations in bioinformatics and not be specific to protein sequence. Time permitting; I would also effectively implement testing on the star schema to see how good it holds according to various user requirements when it is changed dynamically.

37

BIBLIOGRAPHY

[1] Oracle9i Data Warehousing Guide, Release 2 (9.2), Part Number A96520-01 [online] http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/concept.htm

[2] National Library of Medicine http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

[3] Federal Energy Management Program, - “Using Distributed Energy Resources” [online] http://www.ebi.ac.uk/uniprot/index.html

[4] Howard Hamilton, Ergun Gurak, Leah Findlater, Wayne Olive, and James Ranson, - “Knowledge Discovery in Databases” [online] http://www2.cs.uregina.ca/~hamilton/courses/831/notes/dcubes/dcubes.html

[5] Passionned Tools, “ETL Tools Comparison” [Online] http://www.etltool.com/what-is-etl.htm

[6] Craig Utley, Designing Star Schema” [Online] http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase /tabid/101/Default.aspx

[7] OLAP Council, “OLAP and OLAP Server Definitions”, January 1995 [Online] http://altaplana.com/olap/glossary.html#SLICE

[8] Oracle9i Data Warehousing Guide Release 2 (9.2) Part Number A96520-01 [Online] http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/schemas.htm

[9] Katherine Drewek, Data Warehousing: Similarities and Differences of Inmon and Kimball [online] http://www.b-eye-network.com/view/743

[10] Business Intelligence and Data Warehousing [Online] http://www.sdgcomputing.com/glossary.htm