Data Warehousing on Uniprot in Annotated Protein
Total Page:16
File Type:pdf, Size:1020Kb
A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE Maulik Vyas B.E., C.I.T.C, India, 2007 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2011 A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE A Project By Maulik Vyas Approved by: __________________________________, Committee Chair Meiliu Lu, Ph.D. __________________________________, Second Reader Ying Jin, Ph. D. ____________________________ Date ii Student: Maulik Vyas I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Graduate Coordinator ________________ Nikrouz Faroughi, Ph.D. Date Department of Computer Science iii Abstract of A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE by Maulik Vyas Data Warehouses are used by various organizations to organize, understand and use the data with the help of provided tools and architectures to make strategic decisions. Biological data warehouse such as the annotated protein sequence database is subject oriented, volatile collection of data related to protein synthesis used in bioinformatics. Data mart contains a subset of enterprise data from data warehouse that is of value to a specific group of users. I implemented a data mart based on data warehouse design principles and techniques on protein sequence database using data provided by Swiss Institute of Bioinformatics. While the data warehouse contains information about many protein sequence areas, data mart focuses on one or more subject area. It brings together experimental results, computed features and scientific conclusions by implementing star schema and data cube that supports the data warehouse to make it easier for organizations to distribute data within a unit. This enables them to deploy the data, manipulate it and develop the protein sequence data any way they see fit. The main goal of this project is to provide consistent, accurate annotated protein sequence data to group of researchers working on protein sequence. I took a chunk of this data to extract it from warehouse, transform it and loaded it in staging area. I used HJSplit to split the XML protein sequence data into equal parts and iv extract information using XML editor. I populated the database tables in Microsoft Access 2010 from XML file. Once the database was set up, I used MySQL Workbench 5.2 CE to generate queries related to star schema. Finally, I implemented star schema, OLAP operations, and data cube and drill up-down operations for strategic analysis of protein sequence database based on SQL queries. This ensured explicit support for dimension, aggregation and long-range analysis. _______________________, Committee Chair Meiliu Lu, Ph.D. _______________________ Date v DEDICATION This project is dedicated to my beloved parents Kirankumar Vyas and Jayshree Vyas for their never-ending sacrifice, love and support and understanding. I would also like to dedicate this to my loving wife Tanvi Desai for encouraging me to pursue Master in Computer Science and for being a pillar of support for me throughout. vi ACKNOWLEDGMENTS I would like to extend my gratitude to my project advisor Dr. Meiliu Lu, Professor, Computer Science for guiding me throughout this project and helping me in completing this project successfully. I am also thankful to Dr. Ying Jin, Professor, Computer Science, for reviewing my report. I am grateful to Dr. Nikrouz Faroughi, Graduate Coordinator, Department of Computer Science, for reviewing my report and providing valuable feedbacks. In addition, I would like to thank The Department of Computer Science at California State University for extending this opportunity for me to pursue this program and guiding me all the way to become a successful student. Lastly, I would like to thank my parents Kirankumar Vyas and Jayshree Vyas and my loving wife Tanvi Desai for providing me the moral support and encouragement throughout my life. vii TABLE OF CONTENTS Page Dedication…………………………………………………………………………………... vi Acknowledgments…………………………………………………………………………... vii List of Figures………………………………………………………………......................... x List of Abbreviations…………………………………………………………...................... xi Chapter 1. INTRODUCTION……………………………………………………………………….. 1 1.1 Introduction to Data Warehousing………………………………………………… 1 1.2 Introduction to Annotated Protein Sequence and UniProt..…….............................. 2 1.3 Goal of the Project……...………………………………………............................. 3 2. COLLECTION AND ANALYSIS OF UNIPROT IN PROTEIN SEQUENCE………… 4 2.1 Collecting UniProt in Protein Sequence.……………………...…………………... 4 2.2 Extract, Transform, Load (ETL)……….……………………...…………………... 5 3. DESIGNING STAR SCHEMA FOR UNIPROT…………………………………......…. 10 3.1 Introduction to Star Schema….…………………………………............................ 10 3.2 Designing a Star Schema……………….………………………………………… 10 3.2.1 Mapping Dimensions into Tables……………………………........................ 11 3.2.2 Dimensional Hierarchy…...……….…………………………........................ 12 4. OLAP OPERATIONS IMPLEMENTED ON UNIPROT…………………….................. 21 4.1 Introduction to Online Analytical Processing…..…………………………..……... 21 4.2 Type of OLAP Operations…………………….…………………………………... 22 viii 4.3 Data Cube………………………………………………………………………….. 23 4.4 OLAP Operations………………………………………………………………….. 26 5. TESTING………………………………………………………………………………… 29 5.1 Test Cases…………………………………………………………………………. 29 6. CONCLUSIONS…………………………………………………………………………. 31 6.1 Summary……………………………………………………………....................... 31 6.2 Strengths and Weaknesses.………………………………………………………... 33 6.2.1 Strengths ………………………………………………….………………… 33 6.2.2 Weakness …………………………………………………………………… 35 6.3 Future Work.…………………………………………………................................. 36 Bibliography………………………………………………………………………………... 37 ix LIST OF FIGURES Page Figure 1-1 Data Warehouse Architecture………………………………………………….. 2 Figure 2-1 Structure of ETL and Data Warehouse……………………………………….... 6 Figure 2-2 Sample XML File During Extraction………………………………………….. 7 Figure 2-3 Implementing Transformation on UniProt…………………………………….. 8 Figure 3-1 Dimension Table of Source………….……………………………………….... 12 Figure 3-2 Sample Data from Source Table……………………………………………...... 13 Figure 3-3 Sample Data from Gene Table…………………………………………………. 14 Figure 3-4 Sample Data from Isoform Table………………………………………………. 15 Figure 3-5 Star Schema Example 1...…………………………………………………........ 16 Figure 3-6 Sample Output of the SQL Query 1…………………………………………..... 18 Figure 3-7 Sample Data of Entry Table……………………………………………………. 19 Figure 3-8 Star Schema Example 2………………………………………………………… 20 Figure 4-1 Front View of Data Cube……..………………………………………………... 24 Figure 4-2 Data Cube………………………………………………………………………. 26 Figure 4-3 Sample Output of the SQL Query 2……….…………………………………… 28 x LIST OF ABBREVIATIONS DW: Data Warehouse OLAP Online Analytical Processing ETL: Extract, Transform and Load XML Extensible Markup Language OLTP Online Transaction Processing xi 1 Chapter 1 INTRODUCTION 1.1 Introduction to Data Warehousing Data Warehouse (DW) is a methodological approach for organizing and managing database used for reporting and analysis while providing organization with trustworthy, consistent data for applications running in an organization. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting. Data warehouse depicts data and its relationship by drawing distinction between data and information [1]. The data is cleaned to remove redundancy, transformed into compatible data and then made available to managers and professionals handling data mining, online analytical processing (OLAP), market research and decision support. Essential components of data warehousing include analyzing data, extracting data from the database, transforming data and managing data dictionary [2]. 2 Figure 1-1: Data Warehouse Architecture 1.2 Introduction to Annotated Protein Sequence and UniProt The protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function. Bioinformatics has revolutionized biological industry by applying computer technology to the management of biological information. Today's computers are able to gather, store, analyze and 3 integrate biological information that can be applied for protein-based drug discovery or gene- based drug discovery [3]. UniProt is a scientific community that has high quality, freely available resources of protein sequence and functional information. I will be using the database of Swiss Institute of Bioinformatics to manually annotate protein sequence [4]. 1.3 Goal of the Project The goal of this project is to understand the interconnection of data warehousing with UniProt which is a part of bioinformatics as well as carry out research on the current and potential application of data warehousing with the available database. This project covers star schema that can be applied to the database. Research issues that still need to be explored are discussed at the end of the project report. This report is structured