A Web-Interface to Cap3- a Dna Sequence Assembly Program That Processes Genetic Sequences

A WEB-INTERFACE TO CAP3- A DNA SEQUENCE ASSEMBLY

PROGRAM THAT PROCESSES GENETIC SEQUENCES TO PRODUCE

HIGH-QUALITY CONTIG SEQUENCES

SUSMITHA PABBARAJU

Bachelor of Technology, Jawaharlal Nehru Technological University, India, 2005

A REPORT

Submitted in partial fulfillment of requirements for the degree

MASTER OF SCIENCE

Department of Computing and Information Sciences

College of Engineering

KANSAS STATE UNIVERSITY

Manhattan, Kansas

Approved by:

Major Professor Dr. Daniel Andresen, Ph. D

ABSTRACT

A DNA sequence is a succession of letters A, C, G and T, representing the four nucleotide subunits. A succession of any number of nucleotides greater than four is liable to be called a sequence. An Expressed Sequence Tag or EST is a low-quality sequence produced by sequencing cloned cDNA (complimentary DNA). It represents a unique stretch of DNA that can be used to identify the full length of a gene. EST sequences sometimes include vector information, redundancy or incomplete transcripts. To remove these inconsistencies, it is processed through a pipeline that consists of various steps like cleaning, sequencing into clusters, assembling the clusters, etc.

CAP3, a third-generation of the Contig Assembly Program (CAP) is an assembly program used to perform the assembling of the DNA clusters in the pipeline. CAP3 reads an EST sequence stored in FASTA format. Besides the EST sequence, CAP3 also takes two other optional files: forward-reverse constraint file and quality files and produces assembly results in CAP format. The CAP3 assembly program requires a sequence of commands for execution. Users, typically bioinformatics researchers, need to remember and execute these commands sequentially, every time they input a sequence file.

The purpose of this project is to develop a user-friendly web-interface to the CAP3 assembly program so that it eliminates the necessity of remembering sequence of commands by the users. The entire process of cleaning, processing, clustering an EST sequence is now automatically done on the web server, independent of user’s interaction with CAP3. Once a user uploads the necessary files, a program thread on the web server supplies these files to CAP3 and executes a sequence of commands necessary to invoke

CAP3 through a batch file. They can also check the execution status of the files they have uploaded.

The web-interface is developed using Microsoft ASP.NET and XML WebServices with

Internet Information Services (IIS) as the web server. The report is written in detail with description of implementation, tools used and details of testing, results obtained and conclusions drawn. A brief description of the future work to be conducted in this area is also illustrated. TABLE OF CONTENTS

LIST OF FIGURES...... V

LIST OF TABLES...... VI

ACKNOWLEDGEMENTS...... VII

CHAPTER 1 – INTRODUCTION...... 1

CHAPTER 2 – IMPLEMENTATION PLATFORM...... 4

2.1 MICROSOFT .NET...... 4

CHAPTER 3 – TOOLS AND TECHNOLOGIES USED...... 8

3.1 ASP.NET 2.0...... 8 3.2 VB.NET...... 9

3.3 INTERNET INFORMATION SERVICES...... 9

3.4 MICROSOFT SQL SERVER 2005...... 10

3.5 MICROSOFT VISUAL STUDIO 2005...... 10

3.6 XML WEB SERVICES...... 10 3.6.1 Web services components...... 11 3.6.2 Web services programming stack...... 11 3.6.2.1 Extended Markup Language (XML)...... 13 3.6.2.2 Simple Object Access Protocol (SOAP)...... 13 3.6.2.3 Web Service Description Language (WSDL)...... 13 3.6.2.4 Universal Description, Discovery and Integration (UDDI)...... 14 3.6.3 Web Services Benefits...... 14

CHAPTER 4 – IMPLEMENTATON...... 15

4.1 SYSTEM ARCHITECTURE...... 15

4.2 CASE DIAGRAM...... 16

4.3 CLASS DIAGRAM...... 17

4.4 DATABASE SCHEMA...... 18

4.5 FUNCTIONALITY...... 19

iii CHAPTER 5 – WEB-APPLICATION TESTING...... 22

5.1 UNIT TESTING...... 22

5.2 LOAD AND STRESS TESTING...... 24 5.2.1 Screenshots of recorded page requests...... 27 5.2.2 Analysis of test results...... 30 5.2.2.1 Average response time...... 30 5.2.2.2 Hits per second...... 32 5.2.2.3 Throughput...... 32 5.2.2.4 HTTP Error rate...... 34 5.2.2.5 Distribution of Page response time...... 34

CHAPTER 6 –PROJECT METRICS AND EXPERIENCE...... 36

6.1 PROJECT METRICS...... 36

6.2 OVERALL EXPERIENCE...... 37

CHAPTER 7 – CONCLUSION AND FUTURE WORK...... 39

7.1 CONCLUSION...... 39

7.2 FUTURE WORK...... 40

REFERENCES...... 41

iv LIST OF FIGURES

Figure 1: An Overview of Microsoft .NET Architecture [5]...... 5 Figure 2: Fundamentals operations of Web services architecture [13].....12 Figure 3: Web services programming stack [13]...... 12 Figure 4: System Architecture...... 16 Figure 5: Use Case Diagram...... 17 Figure 6: Class Diagram...... 18 Figure 7: Database schema...... 19 Figure 8: Screenshot of NUnit testcase execution...... 25 Figure 10: Screenshot of file upload page recording...... 28 Figure 11: Screenshot of View uploaded files request recording...... 28 Figure 12: Screenshot of file status request recording...... 29 Figure 13: Screenshot of file deletion request recording...... 29 Figure 14: Average response time (seconds) for all pages...... 30 Figure 15: Average response time (seconds) for all requests...... 31 Figure 16: Number of hits on the server by users...... 32 Figure 17: Throughput – KB of data returned by the server...... 33 Figure 18: Error rate, in errors per sampling interval...... 34 Figure 19: Distribution percentage of page response time...... 35 LIST OF TABLES Table 1: System Configuration...... 22 Table 2: Project phases and their duration...... 36 Table 3: Project Lines of code...... 36

v ACKNOWLEDGEMENTS

I would like to thank my major professor Dr. Daniel Andresen for his encouragement and guidance throughout the project. I specially acknowledge his patience to listen to my ideas and provide valuable inputs for the project.

I would also like to thank Dr. Doina Caragea for providing constant support in understanding the concepts that helped me establish a foundation for the project.

vi I would also like to thank Dr. Gurdip Singh for graciously accepting to serve on my committee.

I would also like to thank Dr. Sue Brown, Dept. of Bio-Informatics Head, for supporting me to work on this project and Mr. Sanjay Chellapilla, Bioinformatics Specialist, Dept. of

Bio-Informatics, for his help with the tools used in this project.

Finally, I would also like to thank my beloved family and friends for their words of encouragement that kept my spirits high through difficult times.

vii CHAPTER 1 – INTRODUCTION

Bioinformatics involves the use of techniques including applied mathematics, statistics, computer science, artificial intelligence, chemistry and biochemistry to solve biological problems usually at the molecular level. A DNA sequence or genetic sequence is a sequence representing the primary structure of real or hypothetical DNA molecule. It is a succession of letters – any of the four nucleotides- A, G, C and T, with the capacity to carry information. A succession of any number of nucleotides greater than four is liable to be called a sequence. A sequence can be derived from the biological raw material through a process called DNA sequencing. In the current generation of bioinformatics research, genome assembly is a major research area where a large number of short DNA sequences are assembled to create a representation of the original chromosomes from which the DNA originated. This process involves aligning short sequences to one another, and detecting all places where two of the short sequences overlap. Genome assembly is a very complex computational problem because genomes contain large number of identical sequences. A genome assembly algorithm is required to assemble such kind of short sequences [1]. The shotgun sequencing strategy has been used widely in many genome sequencing projects. In genetics, shotgun sequencing is a method used for sequencing long DNA strands. This strategy assembles short sequences into long sequences. These short sequences are represented as EST’s or Expressed Sequence Tags.

They are mainly used to identify gene transcripts and play a significant role in gene sequence determination. An EST is a low-quality sequence produced by sequencing cloned cDNA (complimentary DNA) [2]. It represents a unique stretch of DNA that can

1 be used to identify the full length of a gene. An EST sometimes includes vector information, redundancy or incomplete transcripts. Higher quality EST’s are produced by processing it through a pipeline. An EST pipeline consists of various steps like cleaning, sequencing into cluster, assembling the clusters, etc.

A number of sequence assembly programs have been developed previously. CAP, Contig

Assembly Program, is one such assembly program implemented using C language that supports DNA shotgun sequencing, by finding the shortest common superstring of a set of fragments. CAP3, a third-generation of the CAP assembly program has been developed in the later stage with some improvements and new features. Efficient algorithms are used to identify and compute overlaps between sequences. It performs the assembling of DNA sequences and also performs cleaning and sequencing of an EST to an extent. CAP3 reads an EST stored in FASTA format as input. FASTA format is a text- based format, typically used in bioinformatics research for representing a nucleic acid sequence or protein sequences, in which protein residues are represented using single- letter codes [3]. Besides the EST sequence, CAP3 also takes two other optional files: forward-reverse constraint file and a quality file and produces assembly results in CAP format to the standard output which needs to be redirected to a file.

Executing CAP3 program for a given input file should be done through a set of commands executed in a particular sequence. Bioinformatics researchers, who typically use this program, need to remember this set of commands and their order of execution in order to extract the assembly results from CAP3’s execution on an EST sequence. Also,

2 the results which are generated from CAP3 do not contain any statistical information about the process that is sequenced.

The purpose of this project is to develop a user-friendly web-interface to the CAP3 assembly program so that users do not have to remember the set of commands to be executed for CAP3 to generate the output files. The entire process of cleaning, processing and clustering EST sequences is now automatically done on the web server, independent of user’s interaction with CAP3. The user has to just upload input files on to the web server. Once a user uploads the necessary files, a program thread on the web server automatically invokes the CAP3 program by giving the uploaded files as input. It also executes a set of commands necessary for the execution through a batch file. Based on the length of the file uploaded, the CAP3 execution may span between a few seconds to a couple of hours. Users can check the execution status of each file they have uploaded.

The next two chapters discuss the platform the project is implemented on, and the tools and technologies used in the implementation. In the following chapter, the details of implementation, architecture, and functionality of the web-interface are explained.

Chapter 5 discusses testing and the obtained results for the web-interface. The following chapters draw conclusions and future work.

3 CHAPTER 2 – IMPLEMENTATION PLATFORM

To implement this web-interface we need a programming platform that supports application development over the Internet. This chapter introduces the Microsoft .NET framework that is used for developing the web-interface.

2.1 Microsoft .NET

The .NET framework, introduced by Microsoft, is a new computing platform that simplifies application development in the highly distributed environment of the Internet.

It is a common environment for building, deploying, and running web-applications and web services. The .NET framework design goals are to unify programming models, support multiple programming languages, dramatically simplify development and deployment of applications, provide robust execution environment, and natively support

XML Web services. The main advantage of using the .NET framework is that it builds all the communications based on industry standards to ensure that applications implemented on .NET framework can integrate with any other application [4]. The .NET architecture is shown in Figure 1.

The most important component of the .NET Framework lies within the Common

Language Infrastructure (CLI). The purpose of the CLI is to provide a unified programming model platform for application development and execution including exception handling, garbage collection, security and interoperability. Microsoft’s implementation of the CLI is called Common Language Runtime (CLR). CLR is the 4 Figure 1: An Overview of Microsoft .NET Architecture [5].

foundation of the .NET Framework. Code management is the fundamental principle of the CLR. It works as an agent that manages code at execution, providing core services such as memory management, thread management, remoting as well as enforces strict type safety and other forms of code accuracy that ensure security and robustness. CLR is

5 comprised of four primary parts: Common Type System (CTS), Common Language

Specification (CLS), Just-In-Time (JIT) compiler, Virtual Execution System (VES).

The Common Type System defines how types are declared, used and managed at runtime, and is also an important part of the runtime’s support for cross-language integration. The function of CTS is to establish a framework that helps enable cross- language integration, type safety and high performance code execution. It defines rules that languages must follow, which helps ensure that the objects written in different languages can interact with each other [6]. Common Language Specification (CLS) is a set of basic language features needed by many applications to fully interact with other objects regardless of the language they were implemented in. The CLS rules define a subset of the CTS. It helps ensure language interoperability [7]. Just-In-Time (JIT) compilation is a technique for improving the runtime performance of a computer program. In JIT environment, the source code is first translated to an intermediate representation, and deployed onto the target machine. When the code is executed, the runtime’s environment compiler translates it into the native machine code, i.e. the code is compiled when it is just about to be executed. The Virtual Execution System (VES) provides an environment for executing managed code: code that is executed by CLR, including support for a set of built-in data types, a set of control flow constructs, and an exception handling model.

Another important component of the .NET Framework is the .NET Framework Class

Library (FCL) which is a library of classes, interfaces and value types that are included in

6 the .NET Framework SDK. This library provides access to system functionality and is designed to be the foundation on which .NET Framework applications, components, and controls are built.

In addition to the above features, .NET Framework also provides a security mechanism to limit the access to protected resources and operations. This feature is provided through a mechanism called Code Access Security (CAS). CAS defines permissions that represent the right to access various system resources, enforces restrictions on code at runtime by comparing the granted permissions of every caller on the call stack to the permissions that callers can have [8].

7 CHAPTER 3 – TOOLS AND TECHNOLOGIES USED

The tools and technologies used to build the web-interface are ASP.NET 2.0,

VB.NET, IIS 5.0 Server, XML Web Services, Microsoft SQL Server 2005 and Microsoft

Visual Studio 2005 IDE. This chapter discusses these technologies in detail.

3.1 ASP.NET 2.0

ASP.NET 2.0 is the web-application framework developed by Microsoft that programmers can use to build dynamic websites, web-applications and XML web services. It is a part of Microsoft .NET framework and is the successor of Microsoft’s

Active Server Pages (ASP) technology. ASP.NET is built on Common Language

Runtime (CLR), meaning programmers can write ASP.NET code in any Microsoft .NET language. The services provided by ASP.NET for building enterprise-class web- applications are: page and controls framework, the ASP.NET compiler, security infrastructure, state-management facilities, application configuration, debugging support,

XML web services framework, extensible hosting environment and application life cycle management and extensible designer environment [9]. ASP.NET 2.0 introduces several new server controls that enable powerful declarative support for data access, login security, wizard navigation, menus, tree views, etc. It also provides a new feature called

Master Pages where programmers can have the ability to define a common structure and interface elements for a site such as page header, footer, and navigation menus in a common location called “master page”, to be shared by many pages in the website. This improves the maintainability of the website and avoids unnecessary duplication of code

8 for shared side structure or behavior. Besides the above features, ASP.NET also provides features like Themes and Skins, Personalization, and has improved caching, performance and scalability.

3.2 VB.NET

The Microsoft Visual Basic .NET programming language is a high-level programming language for the Microsoft .NET Framework. Although it is designed to be an approachable and easy-to-learn language, it is powerful enough to satisfy the needs of experienced programmers. The Visual Basic .NET provides strongly typed semantics that performs all type-checking at compile-time and disallows runtime binding of method calls. This guarantees maximum performance and also ensures that the type conversions are correct. This is useful when building production applications in which speed of execution and execution correctness are important [10].

3.3 Internet Information Services

Internet Information Services is a web server included in Windows Operating System which provides World Wide Web publishing services, File Transfer Protocol (FTP) services, Simple Mail Transfer Protocol (SMTP) services and Network News Transfer

Protocol (NNTP) services. It provides highly reliable and manageable infrastructure for web-applications. It is a high performance, secure and extensible web server provided by

Microsoft [11]. IIS 5.0 is built on features and capabilities needed to deliver web- applications required in an increasingly Internet centric business environment. It is easy to install, maintain and has features that make it reliable and better performing. 9 3.4 Microsoft SQL Server 2005

Microsoft SQL Server 2005 is a relational database management system and analysis system for e-commerce, line-of-business and data warehousing solutions. SQL Server

2005 is Microsoft’s next generation data management and analysis software that will deliver increased scalability, availability, and security to enterprise data and analytical applications while making them easier to create deploy and manage [12]. Its primary query language is Transact-SQL, an implementation of the ANSI/ISO standard SQL.

3.5 Microsoft Visual Studio 2005

Microsoft Visual Studio 2005 is a Microsoft’s flagship software development product for computer programmers. It centers on an Integrated Development Environment which lets programmers create standalone applications, web sites, web-applications, and web services that run on any platforms supported by Microsoft’s .NET framework. Visual

Studio includes Visual Basic .NET, Visual C++, Visual C#, Visual J# and ASP.NET.

3.6 XML Web services

Successful IT systems increasingly require interoperability across platforms and flexible services that can easily evolve over time. This has led to the prevalence of XML as the universal language for representing and transmitting structured data that is independent of programming language, software platform, and hardware. Built on the broad acceptance of XML, Web services are applications that use standard transports, encodings and protocols to exchange information. With broad support across vendors and businesses,

10 Web services enable computer systems on any platform to communicate over corporate intranets, extranets, and across the Internet with support for end-to-end security, reliable messaging, distributed transactions and management. Web services are based on a core set of standards that describe the syntax and semantics of software communication: XML provides the common syntax for representing data; the Simple Object Access Protocol

(SOAP) provides the semantics for data exchange; and the Web services Description

Language (WSDL) provides a mechanism to describe the capabilities of a Web services

3.6.1 Web services components

Several essential activities need to happen in any service-oriented environment: a web service needs to be created, and its interfaces and invocation methods must be defined; a web service needs to be published to one or more intranet or Internet repositories for potential users to locate; a web service needs to be located to be invoked by potential users; a web service needs to be invoked to be of any benefit; a web service may need to be unpublished when it no longer available. Web services architecture then requires three fundamental operations: publish, find and bind. Service providers publish services to a service broker. Service requesters find required services using a service broker and bind to them. These operations are shown in Figure 2.

3.6.2 Web services programming stack

Web services are composed of mainly four components: Extended Markup Language

(XML), Simple Object Access Protocol (SOAP), Web service Description Language

(WSDL) and Universal Description, Discovery and Integration (UDDI). These

11 technologies form the programming stack of the Web services architecture, which is described in Figure 3.

Figure 2: Fundamentals operations of Web services architecture [13].

Figure 3: Web services programming stack [13].

12 3.6.2.1 Extended Markup Language (XML) XML is a simple and flexible format designed to meet the challenges of large scale electronic publishing. It contains descriptions of the classes called XML Documents and behavior of the computer programs that process them. The documents in XML contain parsed or unparsed data called Entities which also contain some form of Markup which define the layout of the application. XML imposes constraints on layout and logical structure of the data. XML allows data to be structured and self-describing. This helps in interchanging the data over different platforms.

3.6.2.2 Simple Object Access Protocol (SOAP) SOAP is a simple XML based protocol that helps people to interchange data over HTTP or computer based network. SOAP is a platform independent protocol and can be used to send messages of various Internet protocols. There are several different types of messaging patterns in SOAP, while the most popular pattern is the Remote Procedure

Call. This feature makes a web service more feasible for interoperability.

3.6.2.3 Web Service Description Language (WSDL) WSDL is an XML based document used to describe network based Web services and how to access them. It is used in combination with SOAP and XML format to provide

Web services over the Internet. It specifies the location of the service and the operations that service exposes. The supported operations and messages are described abstractly, and then bound to a concrete network protocol and message format. Clients who want to access these Web Services can use SOAP to call one of the functions listed in WSDL.

13 3.6.2.4 Universal Description, Discovery and Integration (UDDI) UDDI is a platform independent XML based registry for IT systems to list their Web

Services on Internet. The UDDI specification defines a SOAP-based Web Service for location Web Services and programmable resources on a network. UDDI provides a foundation for developers and administrators to readily share information about internal services across the enterprise and public services on the Internet. A Web Service provider exposes selected application functionality for others to share. The provider can use the

UDDI publishing methods to enable consumers to find the functionality. The consumer can use the UDDI discovery methods with a UDDI server by following these steps: discover Web Services with the desired functionality on a UDDI server on an intranet, extranet or on the Internet; retrieve a WSDL description using the information found on the UDDI server; create a client application, which uses SOAP messaging, from the

WSDL service description; run the client application that will connect to the Web

Service.

3.6.3 Web Services Benefits

Use of the Web Services architecture provides the following benefits: promotes interoperability by minimizing the requirements for shared understanding; enables just- in-time integration; reduces complexity by encapsulation; enables interoperability of legacy applications.

14 CHAPTER 4 – IMPLEMENTATON

The purpose of this project to develop a web-interface to the CAP3 assembly program – a DNA sequencing assembly program that produces high quality EST’s

(Expressed Sequence Tags) from the inconsistent EST’s (as discussed in Chapter 1).

Using the interface users, typically bioinformatics researchers, upload multiple files to the web server where the CAP3 program is invoked automatically using a thread for each file uploaded by each user. Each file is referred to as a job and CAP3 processes multiple jobs at a time in a separate thread assigned by the .NET run-time. CAP3 may take a few seconds to couple of hours to process the job and produce the output files. When the user logs in to the web-application he would be shown the status of each job marked with different colors.

The application is designed using ASP.NET and the details of the uploaded files and user login information is stored in the SQL Server 2005 database. Web services are implemented to authenticate user, supply the uploaded files to CAP3 program, invoke individual threads on each job, and check the processing status of a job.

4.1 System Architecture

The system design for the web-application that interfaces CAP3 is shown in Figure 4.

Users log in to the application through a client GUI, specifically a web browser that supports HTML and upload their files that are to be processed. The uploaded file is then stored on the web server with a job ID assigned to it automatically. The file details and

15 Function Call to Code Behind

XML Client GUI – Web Browser that supports HTML and JavaScript Web Service Call

CAP3 Assembly Thread Pool Program

ADO.NET SQL Server

IIS Web Server

VB.NET Database Business Logic containing user and Web and job details Services

Figure 4: System Architecture. the details of the user who has uploaded the file are stored in the database. Jobs are then assigned to the CAP3 program along with some other files through an active thread for processing. CAP3 generates a set of output files after processing these files which may take from a few seconds to a couple of hours depending on the size of the file users have uploaded.

4.2 Case Diagram

The Use Case Diagram for the implementation is shown in Figure 5. The user can register for an account, log in to the web-application, edit personal profile, upload files, view summary of uploaded files, search for a particular file and check the status of a file which

16 Figure 5: Use Case Diagram. is being processed, delete files, etc.

4.3 Class Diagram

The Class diagram consists of a group of classes and interfaces and the relationships between those classes and interfaces. Classes in a class diagram are related to each other in a hierarchical fashion, like a set of parent classes and related child classes. A Class diagram represents the collaborations and relationships among classes and interfaces. The

Class diagram for this application is shown in Figure 6. All the business logic of the application is divided into different classes each with some attributes and functions.

Business logic such as create and edit user profile, edit password, forgot password, upload file, store file, delete file, view upload summary, search for uploaded files, and check status of the file processing, etc.

17 UploadFile CreateUpdateProfile -username -username DeleteFile -password -password -filepath -email -username +uploadFile() -firstname -password +storeFile() -lastname -jobID -middleinitial -filename +signupUser() +deleteFile() +updateProfile()

Database CheckStatus EditPassword -username +SearchFiles() -password -username +RetrieveFilePath() -password +findFiles() +UpdateStatus() -newpassword +retrieveStatus() +InvokeCap3() +editPassword()

ViewUploads ForgotPassword -username -username -password -email +retrieveJobs() +returnPassword() +displayStatus() SearchFiles -username -password -filename +findFile() +displayFile ()

Figure 6: Class Diagram.

4.4 Database Schema

Details about user accounts, user information, list of uploaded files by each user, job IDs for each job are stored in the SQL Server database. These details are stored as tables in the database where tables are related with other tables using the primary key-foreign key relationship. The database schema is shown in Figure 7. The relationships between tables

18 are indicated with solid arrow pointing towards the primary key in the master table and the other end of the arrow pointing towards the foreign key in the child table.

JobInfo

UserAccount UserInfo PK jobID PK userID PK userID jobname filename username firstname filesize password lastname uploadtime middleinitial userID email

Figure 7: Database schema. The entities in the database schema are UserAccount – which contains data about online accounts such as username, password and userID, UserInfo – which contains user’s personal information such as firstname, lastname, middleinitial, and email, and JobInfo – which contains details about the jobs that are created for each file that has been uploaded such as jobname, jobID, filename, filesize, uploadtime, and userID.

4.5 Functionality

Following are the set of functionalities the web-application provides to the user:

Setup an Online Account – New users who want to use the functionality provided by the web-interface need to register for an online account so that a separate folder is created for each user where the uploaded files are stored.

Log in to the Account – Registered users can log in to their accounts to access the functionality the interface provides for accessing CAP3.

19 Edit Profile – Registered users can edit their profile like change email, change username, change password, etc.

Upload files – Users can upload the input files to be processed by CAP3 assembly program. The CAP3 assembly program resides in the web server. Whenever a user uploads a file, a job is assigned to that file with a jobID and the file is stored under a folder with jobID as the folder name. This folder is stored inside a separate folder assigned to the user on the web server. After this process, a thread is automatically invoked and assigns the job i.e. the input file to CAP3 for processing. Each job has a separate thread running in the background on the web server. Users can upload multiple files at a time and each time the user uploads a file a thread runs on the server. Each job is processed in a separate thread because CAP3 wouldn’t let any other user to upload their files if it has already been processing a file on the web server; so to isolate the processing of each job and let other users to upload their files without any wait-time.

Processing the files may take from a few seconds to a couple of hours depending on the file size the user has uploaded. So other users cannot get to wait until CAP3 finishes processing the current job; hence separate threads are necessary to isolate the processing of each file.

Check processing status – Once users have uploaded the files and CAP3 starts processing them, they can check the processing status any time they want by logging in to the web-application.

View Upload Summary - User can keep track of the files they have uploaded by viewing the upload history when they log in to the application. Using this feature they do

20 not have to remember the file names of those that have been uploaded. Users can easily identify what output files are generated for what input files.

Search for uploaded files – Users might want to search for a previously uploaded file, open the file make some changes to the file and re-upload it to the server for a new run on

CAP3. This feature provides a solution for this requirement.

Delete files – Users can delete the uploaded files from the web server. Once the user deletes a file, the entire job folder that has the uploaded file is deleted permanently.

21 CHAPTER 5 – WEB-APPLICATION TESTING

This section discusses the various tests performed on the web-application with a set of test scenarios. Results are analyzed and compared against some performance metrics.

Testing is done on a local machine with the following system configuration:

Processor Intel Pentium (M) Processor Speed 1.66 GHz Physical Memory 1.00 GB RAM Operating System Windows XP Professional

Table 1: System Configuration.

5.1 Unit Testing

Unit testing is the procedure to validate that individual units of source code are working properly. Unit testing is a common approach in software development to verify whether the end-system satisfies the requirements of the client i.e. the business logic should be tested for correct execution of client requirements. NUnit is such a tool to test functional units of source code. NUnit is a unit testing framework for all Microsoft .NET languages.

It is written entirely in C# and has been completely redesigned to take advantage of many

Microsoft .NET language features, for example custom attributes and other reflected capabilities [14]. The nunit.exe program is a graphical runner. It shows the test cases in an explorer-like browser window and provides visual indication of success or failure of the tests. It allows you to selectively run single tests or suites and reloads automatically as you modify and re-compile your code.

22 Following are the test cases executed against the web-application using NUnit:

. TestLoginWebservice – Tests the functional unit in the application that is

responsible for authenticating users who have set up an online account with the

application. This functional unit checks for valid userID and password

combination that can let the user log in to the web-interface, ensuring

unauthenticated access to the application.

. TestUserInformation - Tests the functional unit that is responsible for fetching

the user information such as first name, last name, email, etc and displayed to the

user. This function ensures whether a user who logs in to the web-application is

shown his profile information.

. TestCheckUsernameAvail – Tests the functional unit that is responsible for

verifying with the database whether the username chosen during online account

registration is available to be assigned or not. When new users need to set up an

online account they need to provide a username for the account. This username

should be unique for each user thereby ensuring no duplicate usernames in the

database.

. TestUserforJob – Tests the functional block that is responsible for storing the

uploaded file into the corresponding user folder in the web server. When a user

uploads a file to the server a job ID is assigned to it and a folder with a name as of

job ID is created inside the user folder on the web server. The file is then stored

inside the job ID folder for that user. Therefore, the function should ensure that

uploaded file goes into a job ID folder created for it and inside that corresponding

user folder.

23 . TestJobsforUser – Tests the functional unit that is responsible for displaying the

list of files users have uploaded previously. Since users shouldn’t be remembering

all the details of an uploaded file such as file name, file size, file extension,

additional constraint files, etc. the application should consistently maintain these

details in the database and input them to the CAP3 program.

. TestVerifyStatus – Tests the function that is responsible for verifying the

processing status of each file uploaded by the user. Since CAP3 takes a few

seconds to a couple of hours to process an input file, it is necessary that the

processing status of a file shown to the user should be consistent with CAP3

processing.

. TestSearchfile – Tests the functional unit that is responsible for searching the

web server for specific file the user has uploaded. This module searches for the

file inside the user folder on the server, and if found, displays it to the user so that

it would be able to open, view and edit the contents of the file.

The screenshot of above NUnit test cases execution is shown in Figure 8.

5.2 Load and Stress Testing

Load testing generally refers to the practice of modeling the expected usage of a system by simulating multiple users accessing the system’s services concurrently. Load testing a web-application is important because it predicts the performance and identifies web-site issues in a simulated environment, eliminating problems before they occur in real situations. By Stress testing a web-application we can be sure that it will be 100% reliable before it goes into production. As such, this testing is more relevant for multi-user

24 Figure 8: Screenshot of NUnit testcase execution.

systems, often one built using client-server model, such as web servers. When the load placed on the system is raised beyond normal usage patterns, on order to test the system’s response at unusually high or peak loads, it is known as Stress testing. The load is usually so great that error conditions occur, although no clear boundary exists when an activity ceases to be a load test and becomes stress test [15].

This web-application is tested against a particular load in terms of the number of concurrent users using the application and number of requests they make. The tests are performed using the web-application testing tool – NeoLoad. NeoLoad is a Web Testing

25 Software running high-load scenarios through web-applications to measure performance prior deployment. Comprehensive reports pinpoint issues affecting reliability and long- term scalability. NeoLoad supports all web servers and applications servers – J2EE,

.NET, PHP, ASP, CGI, AJAX, and SOAP. NeoLoad simulates realistic user behavior. A typical load test will monitor any number of virtual users as they complete common tasks: navigating with in an application, filling out forms with dynamic values, etc.

Different user profiles, each with its own load variation policy (constant, ramp up or peaks), can be run simultaneously. During load testing, NeoLoad displays real time reports for infrastructure performance and key statistics: hits per second, average response time, throughput, etc. When load testing is complete, NeoLoad generates performance reports to identify strategic short and long-term issues.

Using NeoLoad, a load of 10 concurrent users each making a HTTP request to a page in the web-application and repeating each request for three iterations is simulated on the web server. The load is simulated in a ramp-up manner where the number of concurrent users is increased from 1 to 10 with a span of 5 seconds for each user. Also a virtual think time of 5 seconds for each user between each request is applied i.e. it simulates a situation where the user reads the page downloaded for 5 seconds and then requests another page. A sequence of page requests in the web-application is tested against the simulated load. The sequence starts from the home page of the application and then proceeds to other pages such as file upload page, etc. Now each operation in a page that user performs is a HTTP request to the web server. For example, operations such as uploading a file to web server for CAP3 processing, view uploaded files, delete uploaded

26 file, check processing status of an uploaded file, etc. each are HTTP request to the web server. NeoLoad records this browsing sequence and once recording is done it simulates the specified user load using this recorded browsing sequence i.e. it executes this sequence on each user as if a real user is browsing all the pages recorded in the sequence.

5.2.1 Screenshots of recorded page requests

Using NeoLoad, a sequence of page requests are made to the web server and then simulated against the load scenarios. The page requests include, home page, file upload page, view uploaded files request, file processing status request, and delete file request.

The screen shots of the recorded pages are shown in Figures 9, 10, 11, 12 & 13 respectively.

Figure 9: Screenshot of home page recording.

27 Figure 10: Screenshot of file upload page recording.

Figure 11: Screenshot of View uploaded files request recording.

28 Figure 12: Screenshot of file status request recording.

Figure 13: Screenshot of file deletion request recording.

29 5.2.2 Analysis of test results

The recorded the page sequence is subjected to load test using the specified load of 10

concurrent users where the simulation starts with 1 user and ramps up to 10 users with 1

user count incremented every 10 seconds. The simulation is run for 3 minutes and each

user run is iterated 3 times. The resulting values of all the performance metrics such as

Average response time, Hits per second, Error rate, Throughput and Page response time

distribution plotted as two-dimensional graphs, comparing the number of users with

values for the metrics.

5.2.2.1 Average response time

In a web-application, response time is the time from the first byte of page request sent to

server till the last byte received from the server. Average response time is the response

time averaged among all users.

Graph Min Average Graph Max Median Average 90% Std. Deviation 0.062 0.803 12.4 0.135 0.486 1.95

Figure 14: Average response time (seconds) for all pages.

30 Figure 14 & 15 show the values for Average response time for pages and requests

respectively.

Observing the graph, we can say that response time increased as the user load ramped up

and reached a maximum for a full user load. But taking the median into consideration, the

response time was sustainable. Efficient programming techniques may make the response

time better.

Graph Min Average Graph Max Median Average 90% Std. Deviation 0.013 0.212 2.88 0.041 0.136 0.485

Figure 15: Average response time (seconds) for all requests.

In a web page, a number of HTTP requests can be sent to the server, for example, an

image click, a button click, page click on a grid view, etc can be considered as requests

that originate from the same page. Figure 15 shows the response time for these kinds of

requests.

31 5.2.2.2 Hits per second

A Hit is a completed HTTP request i.e. request sent to the server and got response

completely. Hit can be a page request of a click or its frames, images, buttons, etc. The

graph in Figure 16 compares the values for number of Hits/s with the active number of

users. A higher percentage of hits/s for the entire user load indicates better performance

of the server in terms of the number of HTTP requests served completely.

Graph Min Average Graph Max Median Average 90% Std. Deviation 0 5.5 9 6 5.29 2.9

Figure 16: Number of hits on the server by users.

In the graph, the average number of hits/s is 5.5 for a maximum value of 9. As the user

load increased, the rate at which the number of hits satisfied by the server is consistent.

5.2.2.3 Throughput

Throughput of a server is the number of bytes of data returned to user from the server.

For better performance of a web-application, its throughput should either be consistent or

increasing but should never degrade. A reduction in throughput delivered by the server 32 may result in longer wait-time for the user to download the data thereby increasing the

frustration coefficient for the application. The throughput delivered by the IIS server

during the above load simulation is shown in Figure 17. As we observe the graph,

throughput of the server increased as the user load increased, except there was a steep

decrease for a small duration. But considering the average value the throughput is said to

be consistent. A higher throughput may be achieved using a high performance server.

Figure 17: Throughput – KB of data returned by the server.

33 5.2.2.4 HTTP Error rate

Figure 18: Error rate, in errors per sampling interval.

A HTTP Error is the HTTP reply code other than 200 (OK) for page requests. Errors% is

the percentage of those requests. In a web-application, most of the errors constitute bad

requests, forbidden errors, timeouts, and internal server errors. Figure 18 shows the graph

plotting the error count Vs the number of users for the load simulation.

5.2.2.5 Distribution of Page response time

Figure 19 displays the percentage of pages that were performed within a given time

range. This graph helps determine the percentage of pages that meet a performance

objective. For example, it can tell that 90% of the pages have a response time under ‘n’

seconds.

34 Figure 19: Distribution percentage of page response time.

Observing the graph, we see that about 95% of the page requests from the users have an average response time under 2.5 seconds which is quite encouraging for a web-interface like this. Because CAP3 would be running in the server background, it should not be the case that rate at which the pages requests are served by the server is not affected by

CAP3 processing of input files. Otherwise, users may have to wait for a long time to upload their files. The reason for better page response time for most of the page requests, as seen in the above graph is that each job uploaded by a user is handled separately in an individual thread assigned to each job. This helps the server fork the main process and split the processing into different processes thereby operating on each process simultaneously.

35 CHAPTER 6 –PROJECT METRICS AND EXPERIENCE

6.1 Project Metrics

The amount of time spent on the each phase of the project and the lines of code written in the project are shown in Table 2 and 3 respectively.

Learning Project background 30 - 40 hrs Requirements Capture & Design 15 – 20 hrs Learning .NET Thread concepts 5 – 8 hrs Implementation 100 – 110 hrs Testing 10 – 20 hrs Documentation 40 – 50 hrs

Table 2: Project phases and their duration

ASP.NET & VB.NET server side 1160 lines

code XML Web services 1600 lines

Table 3: Project Lines of code

6.2 Overall Experience

The very big task I had to face while doing this project was to learn about project background i.e. a brief introduction to Bioinformatics and a deep insight of how CAP3 works and what it delivers. The most part of the project is learning about how CAP3 36 should be executed using a sequence of commands through the command prompt. Since the web-interface should completely eliminate user’s direct interaction with CAP3, there needs to be some way for executing the commands automatically and in a particular sequence required by CAP3. So I implemented batch files for executing the commands sequentially and store the output files in the user folder. Learning how to create a batch file was really a challenging phase for me in the project.

After successfully able to implement the execution of the sequence of commands using a batch file, I encountered a problem where I was unable to upload another input file to the server while CAP3 was processing a previously uploaded file. Since both the command prompt and the CAP3 are stored in the root folder in the web server, CAP3 is not letting to upload another file once it starts working on a previous file. This was a problem for users since they have to wait until CAP3 finishes executing the current file in the pipeline. Moreover, CAP3 takes a few seconds to a couple of hours to process a file depending on the input file size. This problem was solved using Thread concepts in

.NET where a thread’s duty is to fork the main process execution into two or more simultaneously running tasks. When users upload their files to the server a separate thread is invoked for each file that is uploaded. This thread uses memory and supplies the input file to CAP3. Since each thread has separate memory, multiple files are executed by

CAP3 at a time on the server, thereby eliminating the need for users to wait until CAP3 executes previously uploaded file.

37 Overall, I have learnt a lot of new things in developing dynamic web-applications using

Microsoft ASP.NET. During the course of this project, I have learnt new concepts in programming and web development such as Multi-threading, creating batch files, implementing web services which would be really useful throughout my career.

38 CHAPTER 7 – CONCLUSION AND FUTURE WORK

7.1 Conclusion

A user-friendly web interface for CAP3 assembly program is developed using ASP.NET and XML Web services on IIS. The input files are processed asynchronously with the help of threads to improve the performance. The processing of the job files is automated with the help of batch files generated dynamically based on the user input file name. Efficiency of the application is improved by the use of web methods that help in separating application tier from the presentation tier. The performance of the application is evaluated by rigorously testing it against various test scenarios. Efficiency and correctness of the application is evaluated with the help of various test cases.

The main advantage of the interface is that the interface eliminates the user to remember sequence of commands to process a file. It also eliminates the necessity of the user to wait for the output files to be generated as they are stored on the web server and retrieved as and when required. It also provides a log for the user to know which files he has uploaded and processed previously. It also helps the user to know the status of a job and view the results of a completed process. An option to search for the files based on various parameters helps the user to retrieve a previously uploaded file.

39 7.2 Future Work

 CAP3 application is presently being run with default options. The

application can be easily extended to accept various input parameters from

the user.

 The Project can be extended to include other steps of the pipeline used in

the processing of EST sequences.

 Visualization of the output can be done by streaming the output file to

available visualization tools.

 Statistical analysis of the results generated and comparison of two or more

result files can be achieved.

40 REFERENCES

[1] Wikipedia, “Introduction to Genome Assembly”, http://en.wikipedia.org/wiki/Genome_assembly#Genome_assembly. [2]Wikipedia,“ExpressedSequenceTags”, http://en.wikipedia.org/wiki/Expressedsequence_tag. [3] Wikipedia, “FASTA Format”, http://en.wikipedia.org/wiki/Fasta_format. [4]Microsoft Developer Network, “Introduction to Microsoft .NET”, http://msdn2.micr osoft.com/en-us/library/zw4w595w.aspx. [5]CodeGuru,“The .NET Architecture”, http://www.codeguru.com/csharp/sample_cha pter/article.php/c8245. [6] Microsoft Developer Network, “Introduction to Common Type System”, http://msdn2.microsoft.com/en-us/library/zcx1eb1e(VS.71).aspx. [7] Microsoft Developer Network, “Common Language Specification”, http://msdn2. microsoft.com/en-us/library/12a7a7h3.aspx. [8] Microsoft Developer Network, “Introduction to Code Access Security”, http://msdn2 .microsoft.com/en-us/library/c5tk9z76.aspx. [9] Microsoft Developer Network, “Introduction to ASP.NET and its environment”, http://msdn2.microsoft.com/en-us/library/4w3ex9c2.aspx. [10] Microsoft Developer Network, “Introduction to VB.NET and its specifications”, http://msdn2.microsoft.com/en-us/library/aa711604(VS.71).aspx.. [11] Microsoft Developer Network, “Internet Information Services SDK’, http://msdn2. microsoft.com/en-us/library/ms525568.aspx. [12] Microsoft Developer Network, “Microsoft SQL Server 2005”, http://msdn2. microsoft.com/en-us/virtuallabs/aa740409.aspx. [13] IBM Developer Networks, “Web Services Architecture Overview”, http://www.ibm.com/developerworks/library/w-ovr/. [14] NUnit Testing Framework, “Unit Testing Framework for Microsoft .NET languages – NUnit”, http://nunit.org.