A Web-Interface to Cap3- a Dna Sequence Assembly Program That Processes Genetic Sequences

A WEB-INTERFACE TO CAP3- A DNA SEQUENCE ASSEMBLY PROGRAM THAT PROCESSES GENETIC SEQUENCES TO PRODUCE HIGH-QUALITY CONTIG SEQUENCES

By

SUSMITHA PABBARAJU

Bachelor of Technology, Jawaharlal Nehru Technological University, India, 2005

A REPORT

Submitted in partial fulfillment of requirements for the degree

MASTER OF SCIENCE

Department of Computing and Information Sciences

College of Engineering

KANSAS STATE UNIVERSITY

Manhattan, Kansas

2007

Approved by:

Major Professor

Dr. Daniel Andresen, Ph. D

ABSTRACT

A DNA sequence is a succession of letters A, C, G and T, representing the four nucleotide subunits. A succession of any number of nucleotides greater than four is liable to be called a sequence. An Expressed Sequence Tag or EST is a low-quality sequence produced by sequencing cloned cDNA (complimentary DNA). It represents a unique stretch of DNA that can be used to identify the full length of a gene. EST sequences sometimes include vector information, redundancy or incomplete transcripts. To remove these inconsistencies, it is processed through a pipeline that consists of various steps like cleaning, sequencing into clusters, assembling the clusters, etc.

CAP3, a third-generation of the Contig Assembly Program (CAP) is an assembly program used to perform the assembling of the DNA clusters in the pipeline. CAP3 reads an EST sequence stored in FASTA format. Besides the EST sequence, CAP3 also takes two other optional files: forward-reverse constraint file and quality files and produces assembly results in CAP format. The CAP3 assembly program requires a sequence of commands for execution. Users, typically bioinformatics researchers, need to remember and execute these commands sequentially, every time they input a sequence file.

The purpose of this project is to develop a user-friendly web-interface to the CAP3 assembly program so that it eliminates the necessity of remembering sequence of commands by the users. The entire process of cleaning, processing, clustering an EST sequence is now automatically done on the web server, independent of user’s interaction with CAP3. Once a user uploads the necessary files, a program thread on the web server supplies these files to CAP3 and executes a sequence of commands necessary to invoke CAP3 through a batch file. They can also check the execution status of the files they have uploaded.

The web-interface is developed using Microsoft ASP.NET and XML WebServices with Internet Information Services (IIS) as the web server. The report is written in detail with description of implementation, tools used and details of testing, results obtained and conclusions drawn. A brief description of the future work to be conducted in this area is also illustrated.

TABLE OF CONTENTS

LIST OF FIGURES v

LIST OF TABLES vi

ACKNOWLEDGEMENTS vii

CHAPTER 1 – INTRODUCTION 1

CHAPTER 2 – IMPLEMENTATION PLATFORM 4

2.1 Microsoft .NET 4

CHAPTER 3 – TOOLS AND TECHNOLOGIES USED 8

3.1 ASP.NET 2.0 8

3.2 VB.NET 9

3.3 Internet Information Services 9

3.4 Microsoft SQL Server 2005 10

3.5 Microsoft Visual Studio 2005 10

3.6 XML Web services 10

3.6.1 Web services components 11

3.6.2 Web services programming stack 11

3.6.2.1 Extended Markup Language (XML) 13

3.6.2.2 Simple Object Access Protocol (SOAP) 13

3.6.2.3 Web Service Description Language (WSDL) 13

3.6.2.4 Universal Description, Discovery and Integration (UDDI) 14

3.6.3 Web Services Benefits 14

CHAPTER 4 – IMPLEMENTATON 15

4.1 System Architecture 15

4.2 Case Diagram 16

4.3 Class Diagram 17

4.4 Database Schema 18

4.5 Functionality 19

CHAPTER 5 – WEB-APPLICATION TESTING 22

5.1 Unit Testing 22

5.2 Load and Stress Testing 24

5.2.1 Screenshots of recorded page requests 27

5.2.2 Analysis of test results 30

5.2.2.1 Average response time 30

5.2.2.2 Hits per second 32

5.2.2.3 Throughput 32

5.2.2.4 HTTP Error rate 34

5.2.2.5 Distribution of Page response time 34

CHAPTER 6 –PROJECT METRICS AND EXPERIENCE 36

6.1 Project Metrics 36

6.2 Overall Experience 37

CHAPTER 7 – CONCLUSION AND FUTURE WORK 39

7.1 Conclusion 39

7.2 Future Work 40

REFERENCES 41

LIST OF FIGURES

Figure 1: An Overview of Microsoft .NET Architecture [5]. 5

Figure 2: Fundamentals operations of Web services architecture [13]. 12

Figure 3: Web services programming stack [13]. 12

Figure 4: System Architecture. 16

Figure 5: Use Case Diagram. 17

Figure 6: Class Diagram. 18

Figure 7: Database schema. 19

Figure 8: Screenshot of NUnit testcase execution. 25

Figure 10: Screenshot of file upload page recording. 28

Figure 11: Screenshot of View uploaded files request recording. 28

Figure 12: Screenshot of file status request recording. 29

Figure 13: Screenshot of file deletion request recording. 29

Figure 14: Average response time (seconds) for all pages. 30

Figure 15: Average response time (seconds) for all requests. 31

Figure 16: Number of hits on the server by users. 32

Figure 17: Throughput – KB of data returned by the server. 33

Figure 18: Error rate, in errors per sampling interval. 34

Figure 19: Distribution percentage of page response time. 35

LIST OF TABLES

Table 1: System Configuration. 22

Table 2: Project phases and their duration 36

Table 3: Project Lines of code 36

ACKNOWLEDGEMENTS

I would like to thank my major professor Dr. Daniel Andresen for his encouragement and guidance throughout the project. I specially acknowledge his patience to listen to my ideas and provide valuable inputs for the project.

I would also like to thank Dr. Doina Caragea for providing constant support in understanding the concepts that helped me establish a foundation for the project.

I would also like to thank Dr. Gurdip Singh for graciously accepting to serve on my committee.

I would also like to thank Dr. Sue Brown, Dept. of Bio-Informatics Head, for supporting me to work on this project and Mr. Sanjay Chellapilla, Bioinformatics Specialist, Dept. of Bio-Informatics, for his help with the tools used in this project.

Finally, I would also like to thank my beloved family and friends for their words of encouragement that kept my spirits high through difficult times.

v

CHAPTER 1 – INTRODUCTION

Bioinformatics involves the use of techniques including applied mathematics, statistics, computer science, artificial intelligence, chemistry and biochemistry to solve biological problems usually at the molecular level. A DNA sequence or genetic sequence is a sequence representing the primary structure of real or hypothetical DNA molecule. It is a succession of letters – any of the four nucleotides- A, G, C and T, with the capacity to carry information. A succession of any number of nucleotides greater than four is liable to be called a sequence. A sequence can be derived from the biological raw material through a process called DNA sequencing. In the current generation of bioinformatics research, genome assembly is a major research area where a large number of short DNA sequences are assembled to create a representation of the original chromosomes from which the DNA originated. This process involves aligning short sequences to one another, and detecting all places where two of the short sequences overlap. Genome assembly is a very complex computational problem because genomes contain large number of identical sequences. A genome assembly algorithm is required to assemble such kind of short sequences [1]. The shotgun sequencing strategy has been used widely in many genome sequencing projects. In genetics, shotgun sequencing is a method used for sequencing long DNA strands. This strategy assembles short sequences into long sequences. These short sequences are represented as EST’s or Expressed Sequence Tags. They are mainly used to identify gene transcripts and play a significant role in gene sequence determination. An EST is a low-quality sequence produced by sequencing cloned cDNA (complimentary DNA) [2]. It represents a unique stretch of DNA that can be used to identify the full length of a gene. An EST sometimes includes vector information, redundancy or incomplete transcripts. Higher quality EST’s are produced by processing it through a pipeline. An EST pipeline consists of various steps like cleaning, sequencing into cluster, assembling the clusters, etc.

A number of sequence assembly programs have been developed previously. CAP, Contig Assembly Program, is one such assembly program implemented using C language that supports DNA shotgun sequencing, by finding the shortest common superstring of a set of fragments. CAP3, a third-generation of the CAP assembly program has been developed in the later stage with some improvements and new features. Efficient algorithms are used to identify and compute overlaps between sequences. It performs the assembling of DNA sequences and also performs cleaning and sequencing of an EST to an extent. CAP3 reads an EST stored in FASTA format as input. FASTA format is a text-based format, typically used in bioinformatics research for representing a nucleic acid sequence or protein sequences, in which protein residues are represented using single-letter codes [3]. Besides the EST sequence, CAP3 also takes two other optional files: forward-reverse constraint file and a quality file and produces assembly results in CAP format to the standard output which needs to be redirected to a file.

Executing CAP3 program for a given input file should be done through a set of commands executed in a particular sequence. Bioinformatics researchers, who typically use this program, need to remember this set of commands and their order of execution in order to extract the assembly results from CAP3’s execution on an EST sequence. Also, the results which are generated from CAP3 do not contain any statistical information about the process that is sequenced.

The purpose of this project is to develop a user-friendly web-interface to the CAP3 assembly program so that users do not have to remember the set of commands to be executed for CAP3 to generate the output files. The entire process of cleaning, processing and clustering EST sequences is now automatically done on the web server, independent of user’s interaction with CAP3. The user has to just upload input files on to the web server. Once a user uploads the necessary files, a program thread on the web server automatically invokes the CAP3 program by giving the uploaded files as input. It also executes a set of commands necessary for the execution through a batch file. Based on the length of the file uploaded, the CAP3 execution may span between a few seconds to a couple of hours. Users can check the execution status of each file they have uploaded. The next two chapters discuss the platform the project is implemented on, and the tools and technologies used in the implementation. In the following chapter, the details of implementation, architecture, and functionality of the web-interface are explained. Chapter 5 discusses testing and the obtained results for the web-interface. The following chapters draw conclusions and future work.

CHAPTER 2 – IMPLEMENTATION PLATFORM

To implement this web-interface we need a programming platform that supports application development over the Internet. This chapter introduces the Microsoft .NET framework that is used for developing the web-interface.

2.1 Microsoft .NET

The .NET framework, introduced by Microsoft, is a new computing platform that simplifies application development in the highly distributed environment of the Internet. It is a common environment for building, deploying, and running web-applications and web services. The .NET framework design goals are to unify programming models, support multiple programming languages, dramatically simplify development and deployment of applications, provide robust execution environment, and natively support XML Web services. The main advantage of using the .NET framework is that it builds all the communications based on industry standards to ensure that applications implemented on .NET framework can integrate with any other application [4]. The .NET architecture is shown in Figure 1.

The most important component of the .NET Framework lies within the Common Language Infrastructure (CLI). The purpose of the CLI is to provide a unified programming model platform for application development and execution including exception handling, garbage collection, security and interoperability. Microsoft’s implementation of the CLI is called Common Language Runtime (CLR). CLR is the

Figure 1: An Overview of Microsoft .NET Architecture [5].

foundation of the .NET Framework. Code management is the fundamental principle of the CLR. It works as an agent that manages code at execution, providing core services such as memory management, thread management, remoting as well as enforces strict type safety and other forms of code accuracy that ensure security and robustness. CLR is comprised of four primary parts: Common Type System (CTS), Common Language Specification (CLS), Just-In-Time (JIT) compiler, Virtual Execution System (VES).

The Common Type System defines how types are declared, used and managed at runtime, and is also an important part of the runtime’s support for cross-language integration. The function of CTS is to establish a framework that helps enable cross-language integration, type safety and high performance code execution. It defines rules that languages must follow, which helps ensure that the objects written in different languages can interact with each other [6]. Common Language Specification (CLS) is a set of basic language features needed by many applications to fully interact with other objects regardless of the language they were implemented in. The CLS rules define a subset of the CTS. It helps ensure language interoperability [7]. Just-In-Time (JIT) compilation is a technique for improving the runtime performance of a computer program. In JIT environment, the source code is first translated to an intermediate representation, and deployed onto the target machine. When the code is executed, the runtime’s environment compiler translates it into the native machine code, i.e. the code is compiled when it is just about to be executed. The Virtual Execution System (VES) provides an environment for executing managed code: code that is executed by CLR, including support for a set of built-in data types, a set of control flow constructs, and an exception handling model.

Another important component of the .NET Framework is the .NET Framework Class Library (FCL) which is a library of classes, interfaces and value types that are included in the .NET Framework SDK. This library provides access to system functionality and is designed to be the foundation on which .NET Framework applications, components, and controls are built.

In addition to the above features, .NET Framework also provides a security mechanism to limit the access to protected resources and operations. This feature is provided through a mechanism called Code Access Security (CAS). CAS defines permissions that represent the right to access various system resources, enforces restrictions on code at runtime by comparing the granted permissions of every caller on the call stack to the permissions that callers can have [8].