Techniques for Storing and Processing Next-Generation DNA Sequencing Data

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Terry Camerlengo

Graduate Program in Biophysics

The Ohio State University

2014

Master's Examination Committee:

Professor Kun Huang, PhD

Professor Raghu Machiraju, PhD

Professor Carlos Alvarez, PhD

Copyright by

Terry Camerlengo

2014

ABSTRACT

Genomics is undergoing unprecedented transformation due to rapid improvements in genetic sequencing technology, which has lowered costs for genetic sequencing experiments while increasing the amount of data generated in a typical experiment

(McKinsey Global Institute, May 2013, pp. 86-94). The increase in data has shifted the burden from analysis and research to expertise in IT hardware and network support for distributed and efficient processing. Bioinformaticians, in response to a data-rich environment, are challenged to develop better and faster algorithms to solve problems in genomics and molecular biology research.

This thesis examines the storage and data processing issues inherent in next- generation DNA sequencing (NGS). This work details the design and implementation of a software prototype that exemplifies the current approaches as it relates to the efficient storage of NGS data. The software library is utilized within the context of a previous software project which accompanies the publication related to the HT_SOSA assay. The software for the HT_SOSA, called NGSPositionCounter, demonstrates a workflow that is common in a molecular biology research lab. In an effort to scale beyond the research institute, the software library‟s architecture takes into account scalability considerations

ii for data storage and processing demands that are more likely to be encountered in a clinical or commercial enterprise.

iii

DEDICATION

This Masters thesis is dedicated to my beautiful wife Ellen Nixon.

iv

ACKNOWLEDGMENTS

I would like to thank Dr. Joel Saltz and Dr. Tahsin Kurc for giving me the rare opportunity of working at The Ohio State University Comprehensive Cancer Center‟s

Biomedical Informatics Shared Resource. Without them I never would have been exposed to the fascinating and exciting areas of bioinformatics and scientific computing.

I would also like to thank them for fully supporting my decision to pursue graduate work in computational biology while being employed fulltime at the shared resource.

I would also like to thank Dr. Kun Huang for his mentorship over the years as both my graduate advisor and as my supervisor. I have found Dr. Huang to not only be a brilliant individual that I was most fortunate to work with, but a kind and caring teacher whose door was always open when it came to navigating the difficulties of achieving balance between career, academic studies, and personal life. Thank-you Dr. Huang for all you have done.

A special thanks to Dr. Raghu Machiraju and Dr. Carlos Alvarez for their assistance both as committee members, but also co-PIs on various grants that I had the opportunity to work on. I was deeply enriched by their depth of knowledge and guidance.

Hopefully our collaborations will continue in the coming years. v

I would also like to acknowledge the “Department of Defense Congressionally

Directed Medical Research Programs grant awards W81XWH-11-2-0224, -0225 and -

0226” which was instrumental in the development of many of the ideas in this thesis.

vi

VITA

March 1988 ...... Steubenville High School

1994...... B.A. Philosophy, The Ohio State University

1997...... B.A. Computer Science, The Ohio State

University

1997 to 2004 ...... Software Programmer (various places)

2004 to 2013 ...... Research Specialist, Department of

Biomedical Informatics, The Ohio State

University and Comprehensive Cancer

Center

2013 to present ...... Principal Research Scientist,

Battelle Memorial Institute

vii

PUBLICATIONS

 Co-author, SCJD Exam with J2SE 5, 2nd Edition, Apress Books, ISBN 1-59059-

516-5, December 2005

 Co-author, The Sun Certified Java Developer Exam with J2SE 1.4, Apress Books,

ISBN 1590590309, August 2002

 Terry Camerlengo, C. Johnson "Make the Java-Oracle9i Connection", JavaWorld

Magazine, http://www.javaworld.com/javaworld/jw-06-2003/jw-0613-

oracle9i.html, June 2003

 Kurc T, Janies D, Johnson A, Langella S, Oster S, Hastings S, Habib F,

Camerlengo Terry, Ervin D, Catalyurek U, Saltz J. “An XML-based System for

Synthesis of Data from Disparate Databases” J Am Med Inform Assoc, 2006, in

press.

 Hatice Gulcin, Doruk Bozdağ, Terry Camerlengo, Jiejun Wu, Yi-Wen Huang,

Tim Hartley, Jeffrey D. Parvin, Tim Huang, Umit V. Catalyurek, Kun Huang, “A

Comprehensive Analysis Workflow for Genome-Wide Screening from ChIP-

Sequencing Experiments”, SpringerLink,

http://www.springerlink.com/content/c882314242m17018, April 2009

viii

 Terry Camerlengo, Gulcin Ozer, Guojuan Zhang, Tarek Joobeur, Tea Meulia,

Joanne Trgovcich, Kun Huang, "Computational Challenges and Solutions to the

Analysis of Micro RNA Profiles in Virally-Infected Cells Derived by Massively

Parallel Sequencing", occbio, pp.32-36, 2009 Ohio Collaborative Conference on

Bioinformatics,

http://www.computer.org/portal/web/csdl/doi/10.1109/OCCBIO.2009.24, 2009

 Hatice Gulcin Ozer, Terry Camerlengo, Tim Huang, Kun Huang, "A New Method

for Mapping Short DNA Sequencing Reads by Using Quality Scores", OccBio,

pp.21-25, 2009 Ohio Collaborative Conference on Bioinformatics,

http://www.computer.org/portal/web/csdl/doi/10.1109/OCCBIO.2009.35, 2009

 Terry Camerlengo, Hatice Gulcin Ozer, Mingxiang Teng, Francisco Perez,

Pearlly Yan, Lang Li, Jeffrey Parvin, Tim Huang, Tashin Kurc, Yunlong Liu, and

Kun Huang, “Enabling Data Analysis on High-throughput Data in Large Data

Depository Using Web-based Analysis Platform – A Case Study on Integrating

QUEST with GenePattern in Epigenetics Research”, 2009 IEEE International

Conference on Bioinformatics and Biomedicine, Nov. 2009

 (Terry Camerlengo H. G.-S., 2012)Kumar PS, Brooker MR, Dowd SE,

Camerlengo T (2011), "Target Region Selection Is a Critical Determinant of

Community Fingerprints Generated by 16S Pyrosequencing.", PLoS ONE 6(6):

e20956. doi:10.1371/journal.pone.0020956

(Taggart, et al., 2013)

ix

FIELDS OF STUDY

Major Field: Biophysics

x

Table of Contents

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

Publications ...... viii

Fields of Study ...... x

List of Tables ...... xiii

List of Figures ...... xiv

Chapter 1: The NGS Challenge ...... 1

Chapter 2: NGS Workflows And Institutional Decision Support ...... 5

Chapter 3: An Automated Pipeline for Processing Next Generation Sequencing ...... 13

Chapter 4: Outlines of a Software Library for NGS Storage ...... 30

Chapter 5: Efficient Storage Of NGS Data ...... 36

Chapter 6: Conclusion ...... 65

Bibliography ...... 72

Appendix A – Comparison of 4-bit and 3-base/Byte Encoding Schemes ...... 75

xi

Appendix B – QUEST System Overview ...... 82

xii

LIST OF TABLES

Table 1. 4-bit encoding of DNA bases...... 39

Table 2. API for reference genome lookups ...... 54

Table 3. PersistenceMgr CRUD types ...... 61

Table 4. Comparison of various storage techniques ...... 62

xiii

LIST OF FIGURES

Figure 1. The workflow for ChIP-seq data processing and analysis (Ozer, et al., 2009) ... 8

Figure 2. NGS data processing and automation pipeline ...... 17

Figure 3. Execution of the Configuration file...... 22

Figure 4. Main Page for viewing Studies...... 23

Figure 5. GAII run history and comments/analyzing instructions...... 24

Figure 6. Flow cell properties header...... 25

Figure 7. An example of lane properties panel for a flowcell...... 26

Figure 8. Supplemental HT-SOSA Sequences ...... 34

Figure 9. SAM File Format Column Headers ...... 37

Figure 10. Cigar String Operators ...... 38

Figure 11. Base-Pair Storage Representation and Strategy ...... 43

Figure 12. Prototype Overview Diagram ...... 44

Figure 13. Reading a SAM file line by line ...... 46

Figure 14. The NGSContainerMgr ...... 47

Figure 15. The SeqRead abstract class ...... 50

Figure 16. SeqReadRepresentation ...... 52

Figure 17. The finder method ...... 55

xiv

Figure 18. The Persistence Layer...... 57

Figure 19. The Service Locator...... 58

Figure 20. PersistenceMgr Connection Handling ...... 58

Figure 21. A sample configuration file ...... 59

xv

CHAPTER 1: THE NGS CHALLENGE

The biomedical sciences are inundated with data; Genetic data, or DNA sequences, that are generated by specialized Next-Generation DNA sequencers that output lots of „A‟s, „C‟s, „T‟s, „G‟s and „N‟s in seemingly random order. It is only going to get worse as the cost of sequencing decreases and nears the critical $1,000 mark (ER,

2006). The increase in demand for genomic-related applications is inevitable and has already reached Big-Data levels (Schatz & Langmead, 2013).

Additionally, as our understanding of the genome increases, we will become more adept at interpreting and integrating different types of genomic data from distinct biomedical domains like epigenetics, RNA expression, metagenomics and the microbiome, targeted sequencing, and de novo sequencing and DNA reassembly.

Integrating genomic data from each of the aforementioned research areas will be vital to developing a comprehensive and coherent profile for an individual, a community, and even an entire species.

Currently, most DNA sequencing is done in the research lab where a clear strategy for handling the resulting bottlenecks related to data storage and computational processing is missing. Computational tools and techniques will need to shift from use cases or processing paradigms common in academia to those necessary in the clinic. The

1 main bottleneck related to NGS technologies is not actually the sequencing, but the computing aspect.

To this end, software systems for managing large genomic data sets will play a critical role in transforming the nascent and burgeoning personalized genomic industry.

There is an obvious need for a genomic database system focused on the efficient storage and retrieval of next-generation DNA sequencing (NGS) data where traditional databases and storage techniques are ill-suited.

The ideas in this thesis explore the use of various compression and encoding techniques specifically designed for the storage of NGS data. Additionally, the requirements and features for the transition from traditional data management and processing environment common in the academic research institute to those found in a

“big data” enterprise arena will be considered. The “big data” enterprise refers to organizations that collect “large pools of data that can be captured, communicationed, aggregated, stored and analyzed” (McKinsey Global Institute, June 2011). The enterprise requirements that will be touched upon in this thesis are automation, scalability, distributed processing and efficient storage. The resulting program is a software framework for efficiently storing and indexing NGS data so that it is optimized for quick search and simplified integration with existing bioinformatic programs in a variety of domain areas such as pharmacogenomics, electronic medical record system integration, translational medicine, personalized genomics, metagenomics, etc.

The computational challenges are actually two-fold. Much focus has been applied towards the processing bottleneck as many researchers are looking to internet search and

2 social media companies for inspiration that have met similar computational demands. Big data techniques such as high-performance computing (HPC), Amazon‟s ec2, Hadoop and

MapReduce are currently being explored as alternatives to processing large scale datasets.

The second bottleneck involves storage; as the amount of sequencing continues to increase, where, and how, will all of the data be stored? Later in this thesis in chapter 5, a software framework for efficient storage is described which separates storage-related concerns from processing concerns.

Chapter 2 takes a look at the sort of NGS workflow problems that are encountered in research institutions where scalability and enterprise requirements like automation are not design objectives. Chapter 2 will explore a specific research-oriented workflow and its associated computing challenges and how these challenges were addressed at Ohio

State. Chapter 3 moves closer to the “big data” enterprise model by introducing a system design for NGS alignment processing focused on scalability and automation. Chapter 4 tackles the problem of NGS processing using custom and novel bioinformatic programs by describing a global alignment problem used to characterize the mutagenic profile of a series of Translesion-Synthesis experiments (e.g. HT-SOSA assay) described in the

Taggart/Camerlengo paper. The HT-SOSA experiments provide a context and problem domain for the eventual prototype developed in the penultimate chapter. Chapter 5 describes the principles of efficient NGS storage employed in the prototype and offers a software design intended for use in a “big data” enterprise environment. Chapter 5 also compares the results of employing the software prototype on the data sets and the global

3 alignment algorithm from the Taggert/Camerlengo Translesion-Systhesis publication.

Chapter 6 is the concluding chapter and considers the next steps toward improving the software prototype and design.

4

CHAPTER 2: NGS WORKFLOWS AND INSTITUTIONAL DECISION SUPPORT

Currently, most NGS sequencing occurs at research institutions and universities and the techniques to manage the deluge of data resulting from NGS experiments have been developed at the same institutions. Cutting edge groups at places like the Broad

Institute (IGV, GenePattern, GTK), UCSC (The Genome Browser), European

Bioinformatics Institute (CRAM) and many others have played a leading role in developing the software tools for analyzing, and to a lesser extent, storing NGS data for research institutes. Often the task of managing data sets and performing custom analysis is the responsibility of an individual or small team of researchers and is highly dependent on the nature of their research and varies somewhat from institution to institution. These techniques have centered on traditional infrastructure solutions common in research and academic institutions. Examples of these types of solutions are network file storage (i.e.

SANS) and script-based access using tools like R, Perl, and Python. Some of the more sophisticated techniques have involved cluster computing and task-parallelization techniques such as MPI or threading.

A current standard for storing NGS data is the SAM file format (Li H.*, 2009).

BAM files are the binary format version of a SAM file. The FASTA format is another popular file format. Each file corresponds to a lane on a flow-cell of a sample that has

5 been sequenced. In multiplexed experiments - more than one sample per lane - adapter tags are used to identify individual samples. Access to a BAM file is provided using

SAM/BAM command line tools and there are bindings in many languages. The software described in this thesis, (Picard) is used as the Java language binding for accessing and manipulating SAM files.

Typical workflows for processing NGS data aggregates (or copies) sequence files into a specific directory so that a script can access and process the NGS files located in the directory. Often multiple scripts can be run sequentially in multi-step fashion in the case of a processing pipeline. Another approach, which is slightly more sophisticated, would be to invoke the scripts in a parallel and/or concurrent fashion by using an HPC cluster and partitioning the data sets in order to take advantage of the HPC multi-core architecture, typically by using PBS scripting. Since the raw sequencing repository is of manageable size (i.e. not in the Big-Data range) the files can be processed easily without resorting to advanced storage techniques.

Let‟s consider a common scenario: A researcher wants to perform analysis on a specific loci for a multiplexed sequencing experiment. A researcher or investigator will aggregate a group of samples he or she is interested in studying. After identifying the samples of interest, each corresponding BAM file will then be loaded using SAM tools and filtered based on some criteria of interest to the researcher. Next all of the results from each sample will be aggregated and further downstream analysis may be performed on the result set above. The next section is an excerpt from the publication “A

6

Comprehensive Analysis Workflow for Genome-Wide Screening Data from ChIP-

Sequencing Experiments” (Ozer, et al., 2009).

This following excerpt demonstrates a typical computational workflow encountered in a research organization for NGS data processing. A key feature of this workflow is that it was organized to analyze data for a specific biological assay related to

ChIP-sequencing alignment and the calculation of binding densities. What the design does not address are computational issues such as scalability, centralized storage, querying subsets of the data, and a general and flexible programming model, which are primarily enterprise computing challenges.

7

Figure 1- The workflow for ChIP-seq data processing and analysis (Ozer, et al., 2009)

Mapping Sequences to Reference Genome

The ChIP-seq data generated from sequencing equipments such as the Solexa and

SOLiD sequencers are in the form of short DNA segments of no more than 50 bases.

These short DNA segments need to be aligned to the reference genome before further processing. While the commercial vendors usually provide sequence alignment software such Eland provided by Illumina as part of the Solexa data analysis pipeline, they are usually not sufficient for all situations.

8

For instance, Eland is hard coded to align sequences up to 32 nucleotides in length and allow at most 2 mismatches. In some cases it is necessary to use full length of the sequence segments and allow more than 2 mismatches, even gaps, for the alignment.

To overcome limitation of Eland [5], we also integrate several other algorithms including SeqMap, RMAP, MapReads, and xMAN into our workflow. The SeqMap algorithm allows up to 5 mismatches and gaps in combination and use full length of the sequence segments. In addition, at the end of Solexa pipeline, base-call quality scores for the sequence segments are reported. Eland uses these quality scores to filter out low quality sequence reads. These quality scores can be integrated into the alignment process by evaluating only the high quality bases in the sequence segments during the alignment. This approach is implemented in RMAP algorithm. Compared to Eland and SeqMAP, RMAP is much slower [5]. However, RMAP provides better mapping accuracy [6]. Therefore, in addition to Eland and SeqMap, we also integrate

RMAP algorithm into our workflow.

Output Files and Initial Visualization

The sequence mapping results are stored in files containing millions of chromosomal locations and strand orientation tags. Such files cannot be interpreted directly by biologists. In order to facilitate the interpretation, our workflow automatically splits the sequencing mapping results based on the , i.e. the segments mapped to each stored in a separate file. In addition, for each chromosome, the file is converted to the BED and WIG formats. BED and WIG files are the common

9 file formats used by the UCSC Genome Browser. The BED files allows user to visualize the binding locations of sequence tags and the WIG files allows user to visualize computed binding densities of the sequence tags over the genome.

In addition, the workflow also contains a module to automatically generate the histogram of the density of the sequence tags. Specifically, the workflow generates the histogram for each chromosome using three different bin sizes: 200bp, 500bp,

1000bp. The choice of the bin sizes is based on the fact that most of the DNA segments generated in the ChIP experiments are between 200-500bp. The histogram files are essential for the next stage of data processing – the analysis and visualization stage.

Data and Workflow Management

We developed a data management and workflow control system named QUEST.

This system allows the user to store, share, query, and retrieve large bioinformatics data including genechip and ChIP-sequencing. In addition, it enables the user to carry out pre- defined data analysis workflow in many scripts including Matlab, R, Perl, Python and

Java. Currently the preprocessing workflow, generating BED files, WIG files and histogram files, for the mapping results are integrated into the QUEST system.

In the QUEST system, all of the bam files for a particular sample are stored and processed together in batch. A typical study can have multiple samples and will often use

SAM/BAM tools to analyze all the BAM files that correspond to a particular study. Often by building a workflow comprised of a number of scripts (in R, perl or python) a researcher can isolate the specific data they are interested in for a particular study.

10

The ChIP-Seq workflow described in this chapter has a few noteworthy aspects from a systems perspective. First, the workflow is tightly coupled to an access pattern based on samples/NGS runs. When a researcher wants to study a feature of their data, they must first identify the samples (or experiments) they are interested. Second, the above approach is potentially parallelizable and amenable to concurrent processing, rather than sequential processing.

The above approaches do not scale well when moving to very large data sets that are in the petabytes as the number of samples would be very large and difficult to identify and isolate. This is especially true when when access and processing is not based on samples, but unique features which span across many different genomic regions and sample types.

Below are some use cases that might be encountered in an enterprise computing environment:

1.) A large pharmaceutical/contract research organization has a large collection of sequences of transgenic mice and they want to perform BLAST style operations to find human-to-mouse orthologs.

2.) A hospital system is sequencing some of its patients and would like to store known biomarkers and well established genetic patterns into their Electronic Medical

Record system (EMR) without storing the raw genomic sequence data directly (which would be a problematic). The hospital system would like to periodically update their

11

EMR system by processing their very large NGS data repository with new rules and/or an updated reference genome.

3.) A research cancer center is sequencing all of its patients through its sequencing core facility that agree to enroll in a clinical study and want to provide their researchers and core facility with a standard and centralized platform for performing computations supporting secondary and tertiary analysis routines that can run on its internal clusters.

Thus the same centralized cluster can be used by different parties to scale its centers analysis demands.

The next chapter discusses an architectural framework that can scale to meet the above demands as the amount of NGS data and various types of bioinformatics processing increases. Scalability refers to a systems ability to meet increased usage and growth. The conference paper, “From Sequencer to Supercomputer” (Terry Camerlengo,

2012) describes a distrubted architecture where the data processing and sequencing alignment is performed on an HPC server which is managed by a web-based front-end.

12

CHAPTER 3: AN AUTOMATED PIPELINE FOR

PROCESSING NEXT GENERATION SEQUENCING DATA1

Introduction

During the past three years, the Next Generation Sequencing (NGS) technology

has been widely adopted in biomedical research and is revolutionizing many research

areas since it enables the researchers to directly examine the genome at single base

resolution (1-4). However, the processing and analysis of NGS data presents many

new computational challenges (5). Specifically, from the computational point of view,

NGS is highly resource intensive in the following ways described below.

NGS data processing is computationally intensive

NGS requires dedicated high-end computer servers for long-running data processing pipelines to convert raw data formats (i.e. intensity files) into sequence data and map the massive amount of sequences to the reference genome as well as converting the results to useful output formats such as BAM and SAM for further processing or WIG and BED files for visualization in tools such as the UCSC Genome

Browser. The typical minimum hardware specification for NGS data processing is at least 8 cores in CPU and 48 GB of memory on a 64-bit Linux server. These minimum requirements far surpass the computing capabilities of a typically workstation.

1 Originallyfrom conference proceeding publication “From Sequencer to Supercomputer: An Automated Pipeline for Managing and Processing Next-Generation Sequencing Data” 13

Additionally, storage requirements for NGS data are quite significant based on archiving policy (e.g., which files to keep, for how long, etc). For instance, for a

Illumina Genome Analyzer II (GAII), the average results from a single flow cell (ie, a single run) can range anywhere from 200GB to 500 GB.

NGS data processing is labor intensive

The human effort involved in maintaining an NGS operation is considerable

after taking account of the staff required to manually run data processing pipelines,

manage NGS result files, administer processing and ftp servers (for data

dissemination), set up sequencing runs, adjust configuration files and processing

parameters. In addition, numerous miscellaneous programming and scripting are

needed to maintain and automate mundane tasks encountered in NGS processing.

The motivation is to mitigate, if not completely eliminate, the above hurdles, as well as accommodate the anticipated advances in NGS high throughput data generation.

To achieve this goal we designed and implemented an automation pipeline based on our previous work on NGS data processing pipeline and data management system.

Methods

With the fast progress in the NGS technology, the problem of managing and

processing NGS data becomes a major hurdle for many sequencing cores and labs and

will be an even more widespread issue once the third generation sequencers are widely

available to a large number of small labs. Currently several commercial systems are

14

available for dealing with this issue but the prices for such LIMS systems are usually

too high to be adopted by a small lab or sequencing core. Our system thus provides an

example solution for this problem with the described features and architecture. In

addition, our solution avoided the use of individual computing servers. Instead we take

advantage of the computer cluster in the publicly accessible Supercomputer Center.

This approach is highly scalable once we obtain additional sequencers.

Our goal is to release our system with a set of open source codes and modules.

The users can then pick the relevant modules to fit their own pipeline and processing/dissemination needs. In addition, we are also working on developing a publicly accessible web portal hosted on the Supercomputer Center to allow the users to submit their own data for processing and analysis.

System architecture

Our system incorporated the following features:

1. A configuration manager for setting up NGS experiments that tracks metadata

related to accounts/labs, users, samples, projects, experiments, genomes, flow

cells, and lanes. The details of an NGS experiment are then captured as a

structured configuration file which is parsed and executed by our Automation

Server.

2. An automation server which executes the instructions in a configuration file

including bundling NGS raw data sets, transferring data to and from a compute

cluster (supercomputer), setting up a series of jobs or a pipeline, reporting

15

progress and status updates to the QUEST LIMS, of which an overview is

displayed in Figure 2, and copying results to a backup location and ftp server if

applicable.

3. A computer cluster with the required software programs, genomes, and pipelines

installed to process NGS raw data into a suitable result format including mapping

the short reads to the reference genomes.

4. Integration with QUEST (see figure in Appendix B), a Laboratory Information

Management System (LIMS) system for cataloguing NGS runs and the resulting

output files for easy lookup and retrieval.

5. Notification and logging system for reporting pipeline and data transfer progress,

any data processing errors, and automatic emails sent to end-users when pipeline

execution and data transfer is completed.

6. FTP server for external users to download their NGS results.

7. Pluggable architecture to accommodate additional NGS sequencers from different

manufacturers, computing clusters, and/or data portals.

8. Loosely coupled automation server and compute cluster to simplify pipeline

modifications.

System workflow

The workflow for the NGS run is shown in Figure 3 and we describe the workflow below in details. The workflow begins when a user enters a sample (on a study) in QUEST. Once a sample is entered, it will be assigned by the sequencing core facility

16 to a lane corresponding to the physical process of preparing a sample on a lane of an

Illumina flow cell. This association is made once the sequencing core manager configures the flow cell via the Configuration Manager in QUEST. Once the sequencing core facility has completed the GAII run, it indicates that the run is completed with notifies the

Biomedical Informatics Shared Resource (BISR) that the run‟s raw data files are ready for processing. The BISR analyst will then look over the auto-generated configuration files and then launch the automated pipeline. This is done by selecting the “Execute this configuration” button as shown in Figure 4.

Figure 2 NGS data processing and automation pipeline

17

Once the NGS automation pipeline has been launched with the given configuration, QUEST makes an XML-RPC call to a Python server listening on a registered port. QUEST sends the Python server information about the run such as run id, run folder name, run location, and the configuration files generated with the

ConfigurationManager. The configuration server resides on a Linux machine that has access to the storage device the GAII instrument writes the raw data. The python server processes the run in the following basic steps:

1.) Transfer data to the compute cluster at the Ohio Supercomputer Center (OSC)

2.) Generate PBS scripts to make and run the Gerald pipeline. PBS scripts are used for the scheduling task on OSC computer cluster. The scripts will automatically specify the number of nodes needed and the expected CPU time of execution.

3.) Update QUEST with status updates about the run. (Report errors if encountered)

4.) When a sequencing run is completed, transfer data to a sftp server and QUEST (Users can either ftp results or use QUEST to download http style).

5.) Updates QUEST that run is completed which automatically sends notification to the end users via e-mail including the information such as sftp server directory and temporary password.

Key technical challenges and solutions

The system we presented here involves many different components including interfacing with instruments, heterogeneous computing facilities (i.e., computer on

18 instrument, local data processing server, remote computer cluster), database, graphical user interface and both proprietary and public softwares. These heterogeneous hardware and software components cause many challenges for the system development.

Below is a list of key technical challenges encountered during the implementation and our solutions.

1.) Since NGS result files are very large, sessions often timeout during downloads. To keep sessions alive, file downloads are streamed to the client. Currently our rates range from 1 Mbs to 5 Mbs on a typical day. This part of our architecture is currently being upgraded due to limitations in the current hardware and network. We are looking at ways to speed up transfer rates to around 15-20 Mbs.

2.) In order to transfer files to and from the compute cluster, the python scripts make rsync calls. This is done in the server on a separate thread (i.e. an asynchronous call since rsync can take a long time.) and uses both „–arv‟ and „–e ssh‟ options. These options can slow down an rsync, but since this step is not a bottleneck for us and we generated an ssh key, we are content for now using this approach.

3.) A specific Gerald pipeline run must first be generated using make. Therefore we enforce that our python server generates a PBS file dynamically that takes the name of the newly created Gerald folder before Gerald can be ran. It does that by parsing the

Gerald make command output and passing the folder name as a variable to the new PBS script. This is all done using Python on the compute cluster.

4.) The Automation pipeline is distributed at four different points: The GAII raw data folder and the Python server, the QUEST LIMS, the compute cluster, and finally the

19 data portal. To integrate this distributed architecture, messages are sent and received using XML-RPC with the Cook Computing libraries.

QUEST – A LIMS system interfacing with the sequencer

The features listed above were added to an existing LIMS system, called QUEST

(https://bisr.osumc.edu), which enables users and labs, to track results in studies, samples, and experiments. Appendix B shows the two main actor groups: A researcher or lab and the NGS sequencing core. Also shown in Appendix B are the various use-cases and an overview of the QUEST system.

For lab researchers, this system enables pre-experiment planning by submitting information sample, lab and experimental protocol information. Such information can be deposited to the database even before the samples are generated. This greatly facilitated the planning for the sequencing core manager. Once the experiments are finished and data are available, the researchers can download the data, share the data (e.g., to bioinformatics collaborators for advanced analysis), visualize the data (e.g., on UCSC

Genome Browser or Integrate Genome Viewer), and publish the data to public once the paper is accepted.

From the LabCentral page, users can also upload results to the UCSC Genome

Browser (via an OSU mirror installation) and share results with other labs through the

Collaboration panel. The database panel shown in Figure 6 also enables researchers to integrate high throughput data besides NGS data such as microarray results.

20

Figure 7 shows the new feature of the Configuration Management screen for configuring Illumina GAII runs that associates flow cell and lane information with sample meta-data. The sequencing core manager can list sequencing runs filtering and displaying by a variety of options and can link to a particular run. All sequencing run configurations, results, processing status, and associated meta-data are archived for easy retrieval. The processing status is also visible to the lab researchers so that they can track the progress of their sequencing experiments.

21

Figure 3 - Execution of the Configuration file.

22

Figure 4 Main Page for viewing Studies

23

Figure 5 GAII run history and comments/analyzing instructions.

24

Before the sequencing experiments, sequencing runs needs to be configured by the sequencing core manager or operator. The configurations are comprised of two primary panels. One is for flow-cell properties (Figure 7) and the other is for lane properties (Figure 8). Each panel is configurable and its settings persisted between sessions. In GAII, there are eight lanes per flow cell and a sample can be sequenced in either one lane or multiple lanes. This can be easily expanded to accommodate more lanes used in the updated Illumina HiSeq2000. The logging information such as the time for the sequencing run and processing is shown in the bottom window in Figure 7.

Figure 6 - Flow cell properties header.

25

The Lane Properties panel stores information about the study, sample, genome, assay, organism, and group information (Figure 8) for each of eight lanes on a flow cell.

By choosing the organism associate with each sample, the operator also specifies the reference genome to which the short reads will be mapped.

Figure 7 - An example of lane properties panel for a flowcell (showing first 4 lanes).

26

From the information on the two aforementioned panels, QUEST generates two configuration files displayed on a Configuration Manager (Figure 8). One configuration is for running Illumina‟s Gerald pipeline and the other is for creating a custom configuration file for packaging and distributing results to the client.

The Gerald pipeline is Illumina‟s proprietary data process pipeline which handles tasks including mapping short reads to the reference genomes using the ELAND algorithm. In this configuration, the users can specify many parameters including the mapping parameters for ELAND (e.g., allowed number of mismatches, expected sequence length, allowed repeated mapping instances), the file paths (e.g., the folder for the mapping results) and the computing parameters (e.g., the number of CPUs to be used) for parallel computing. The second configuration file specifies the automation parameters and gets information from the previous lane and flow cell configuration panels. The automation parameters allow the users to command the system on the type of processing needed and the depositary location for the results files as well as miscellaneous requests such as if the results need to be copied to an FTP server for dissemination.

The Both configuration files are saved (and can be overwritten) and sent to the compute cluster, along with the raw data files, as one of the initial steps of the automation pipeline. These configuration files are also reusable. The operator can load saved files and edit them as necessary before a new sequencing experiment.

Results

27

The Configuration Manager has been deployed in a production capacity since

December 2009 and has configured and catalogued results from over 75 Illumina GAII flow cells for two GAII sequencers to January 2011. The automation pipeline has been in production since May 2010. Over 700 samples have been processed and we have over

5,000 result files and 4 TB of processed data have been stored till early 2011.

Conclusion and discussion

In this paper we present a scalable automation pipeline for managing and processing the massive amount of sequencing data generated by the NGS sequencers.

Our system is extremely configurable. The compute cluster, and data portal can be changed with a simple modification to a configuration file. To add a new pipeline, all that is needed is a new python module, or handler, to generate the required PBS file for the compute cluster. We anticipate adding new modules as additional NGS technologies are acquired, such as ABI SOLiD, or as we adopt additional algorithms for analysis. There is actually no change to the architecture when new sequencers are added that already have existing python handlers – such as GAII; only an entry in a configuration file denoting the name and location of the new sequencer is needed.

The next chapter introduces work that examined polymerase fidelity under conditions of translesion synthesis on thymine dimers. The original software and data that accompanied that work was used as a starting point for the prototype discussed later in this thesis. Originally, the HT-SOSA (Taggart) assay came with an accompanying software program called NGSPositionCounter. This software counted descreptencies

28 between the reference sequence and the experiemental sequence using a global-alignment algorithm called Needleman-Wunsch. The thesis software prototype uses the same experimental data and same alignment algorithm, but does so using a new library for efficiently storing the experimental data-set. The next chapter provides a background for what the original software, NGSPositionCounter, added to the HT-SOSA Translesion experiments.

29

CHAPTER 4: OUTLINES OF A SOFTWARE LIBRARY FOR NGS STORAGE

This chapter introduces the TT_SOSA assay and NGS Position Counter software as described in the following publication excerpt.

Overview of Biological Domain

DNA damage often causes replicative DNA polymerases to stall at lesion sites, arresting DNA synthesis, and eventually leading to apoptosis. To rescue stalled replication machinery, cells often switch to a DNA polymerase that is specialized to bypass DNA lesions, a process known as translesion DNA synthesis (TLS). Among the six DNA polymerase families, most of the DNA lesion bypass polymerases phylogenetically belong to the Y-family. The Y-family polymerases lack proof-reading exonuclease activity and catalyze both error-free and error-prone TLS. Notably, 4 of the

16 identified human DNA polymerases, designated as DNA polymerases eta (hPolZ), kappa (hPolk), iota (hPoli) and Rev1 (hRev1), are in the Y-family. Although there is significant overlap in the lesion bypass abilities of these four enzymes, the nucleotide incorporation efficiency and fidelity of each human Y-family polymerase during TLS is likely lesion specific. Therefore, it is possible that each Y-family polymerase has evolved to bypass a specific set of lesions in vivo. However, with the exception of ultraviolet

(UV)-induced cyclobutane pyrimidine dimers (CPDs) such as cis–syn thymidine- 30 thymidine (TT) dimers, the question of which Y-family polymerase is responsible for the bypass of a specific lesion type remains unanswered.

CPDs are estimated to account for 80% of UV induced mutations within mammalian genomes. Among the human Y-family members, hPolZ is known to catalyze the mostly error-free bypass of cis–syn TT dimers in vitro (4–6) and in vivo (7–9).

Inactivation of hPolZ through genetic mutation or deletion leads to the xeroderma pigmentosum variant disease, which predisposes individuals to increased incidence of skin cancer (9–12). In the absence of hPolZ, both hPolk (7) and hPoli (13–15) have been proposed to be responsible for the error-prone TLS of cis–syn TT dimers. Thus, it is biologically important to investigate the mutagenic patterns of TLS of a cis–syn TT dimer catalyzed by these human Y-family enzymes.

To analyze the mutagenic profiles, we developed specialized software called the

„Next-Generation Sequencing Position Counter‟ to align and quantify millions of DNA sequences. Therefore, in comparison with our first generation of the short oligonucleotide sequencing assay (SOSA) and other methods in the literature

HT-SOSA enables the assessment of the mutagenic consequences of lesion bypass in a cost-effective manner, with exponentially increased sequencing information.

Analysis of DNA sequences

Initially, all of the raw sequence reads that matched the PhiX genome were removed. Sequence reads that contained one or more base calls that were not identified with >99.9% accuracy (Phred quality score of <30) were subsequently removed by using

31 the NGS QC Toolkit (26). The remaining sequence reads were sorted into groups corresponding to the six unique barcode sequences used for analysis (i.e. Iota-control,

Iota-damage, Kappa-control, Kappa-damage, Eta-control and Eta-damage) and stored as individual Sequence Alignment/Map (SAM) files. After sorting, erroneous sequences were further removed by eliminating all sequence reads that did not perfectly match the reference sequences within the first six nucleotides, which consist of the four-nucleotide barcode sequence and two adjacent nucleotides. Each SAM file of sequence reads was then analyzed by using a novel computer program called „The Next-Generation

Sequencing Position Counter‟. This program, written in Java, aligned each query sequence within the SAM file to a reference sequence by using a Needleman–Wunsch algorithm (Needleman-Wunsch Program) and produced an audit file containing an annotated „best fit‟ alignment for each sequence.

The program simultaneously tallied the total number of matches, mismatches (i.e. substitutions), insertions and deletions at each template position of the query sequence relative to the reference sequence by using the „best fit‟ alignments. In cases where the

Needleman–Wunsch algorithm reported multiple „best fit‟ alignments of the quarry and reference sequences with equivalent alignment scores, one alignment was selected for analysis at random to avoid alignment biases within our data sets. The sequence analysis for each SAM file was calculated and summarized in an output text file. Additionally, the details pertaining to how the program aligned and tabulated the mutations for each individual sequence read were also output to a separate audit file. The Next-Generation

32

Position Base Counter software is available for download at: http://bisr.osumc.edu/temp/NGSPositionCounter.tar.gz and https://chemistry.osu. edu/suo.3/index.html.

The base comparisons at each template position along the sequence read were scored as follows. Each base in the query sequence read was compared with the corresponding base in the reference sequence. If the two bases were identical, the comparison was scored as a match. If the two bases were different, the comparison was scored as a mismatch. Insertions or deletions were indicated with a dash. If the dash was reported in the reference read, the comparison was scored as an insertion in the query sequence. If the dash was reported in the query sequence, then the comparison was scored as a deletion. For insertions, the numbering for the reference read was adjusted from the spot of the insertion.

33

Figure 8 - Supplemental HT-SOSA Sequences

The NGSPositionCounter software ran adequately as a desktop command line

Java program accessing a SAM or Fasta file. But what if we wanted to run the same sort of analysis using the Needleman-Wunsch algorithm on a much larger data set?

What if the sequencing corpus was in the terabytes? Or even petabytes? Storing the files in a directory and processing them using the Java program NGSPositionCounter or even simple python or R scripts would probably not be adequate. In this case, you need

34 to resort to a distributed parallel model of computation and storage. In the next section, we take a look at compression and encoding techniques for efficient storage.

35

CHAPTER 5: EFFICIENT STORAGE OF NGS DATA

Overview

Genomic data is stored as characters in text files. Examples are FASTA and SAM files. This chapter examines techniques for efficiently representing DNA base pairs using

ASCII based characters. The chapter starts off by looking at the SAM file format for storing DNA bases and then examines reference-based compression and various encoding techniques for a more compact representation. Next, the chapter explores a software design for capturing the efficient storage techniques previously discussed that makes use of a SAM-like interface for interoperability with other libraries and programs. Finally, the new software design as a prototype that makes to the same data and alignment algorithm

(i.e. Needleman-Wunsch) used in the NGSPositionCounter software and the results are discussed.

The SAM Format

SAM files are ASCII encoded text files with specific tab-delimited formatting.

BAM files are a binary representation of SAM files that encode sequences (the seq field) as 4-bits. Thus, one byte, the minimum size of a data type in most programming languages, can encode two bases. Figure 10 displays the SAM file format.

36

Figure 9 - SAM File Format Column Headers (SAM Format Specification)

Two ideas in the SAM file format will be utilized in the thesis prototype are cigar strings and 4-bit coding described below.

Cigar Strings

A cigar string, or a Compact Idiosyncratic Gapped Alignment Report (SAM

Format Specification), is a concise description of a sequence alignment with minimal redundancy contrasted with the reference. In a SAM file when there is no reference, the cigar string is empty. Since the „M‟ operator can refer to a match or a mismatch, the researcher needs to look to the offsets, „=‟ and „X‟ to know which. The idea is to get the most concise representation. Newer versions of SAM/BAM have an „MD‟ field which

37 eliminates the need for the rarely used „X‟ and „=‟ fields and instead achieves a

SNP/Indel approach to denoting an alignment.

Figure 10 - Cigar String Operators (SAM Format Specification)

4-bit Encoding

For unaligned sequence reads, one cannot take advantage of reference-based compression. However, a researcher can still exploit the fact that he does not need an entire byte to represent a single base. In fact, with 4 bits we can represent a maximum of

16 different bases. Table 1 assigns 4-bit patterns to genomic bases with five out of sixteen possibilities occupied. The remaining slots are left open for other possibilities such as degenerative bases, or RNA-specific bases. Typically, the four bases and the symbol „N‟ are sufficient for most sequencing projects. On the TT_SOSA prototype, only the characters „N‟,‟A‟,‟T‟,‟C‟, and ‟G‟ are used. Most programming languages, including

Java, use a byte as the minimum encoding length. Two 4-bit base encodings will form a byte or 8-bits. Thus, two DNA bases are represented by a single byte, doubling the storage efficiency.

38

4-bit encoding Base 0000 N (indeterminate, low-quality read, no read) 0001 A 0010 T 0011 C 0100 G 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Table 1- 4-bit encoding of DNA bases

But 4-bit encoding does leave a number of possibilities on the table that a more efficient scheme would utilize. The table in Appendix A, 8-Bit Encoding Scheme

Comparison, compares the 2-bases/byte or 4-bit encoding approach with the 3-bases/byte encoding approach.

The table in Appendix A shows that a 4-bit encoding maps 25 possible two combinations, (i.e. 5^2=25 possibilities). However, 8-bits allows for 256 possibilities, (e.g. the extended ASCII chart), and can be used to represent more than just

25 base pair combinations that one gets using a 4-bit scheme. If extended, the 3-bases per

39 byte scheme can encode more information in fewer bits. Unfortunately with 5 possible genetic bases {A,C,T,G, and N}, the scheme cannot be extended to 4 bases per byte and cover all the possible combinations within an 8-bit configuration.

Scheme # of bases in a # of combinations bits per base Unused byte configurations (slots) 8-bit 1 1 8 255 (256-1) 4-bit 2 25 (5^2) 4 231 2.67-bit 3 125 (5^3) 2.67 131 2-bit 4 625 (5^4) 2 NA Table 2 below summarizes the various encoding schemes.

Thoughts on Huffman Compression

Table 2 compares the various encoding representations for both 4-bit and 3- bases/byte (2.67 bit) encoding schemes. 4-bit encoding uses only 25 of the possible 256 representations, while 3-bases/byte encoding uses 125 of the possible 256 representations. Respectively, 231 and 131 possible configurations are not utilized in table two above. To improve the efficiency of a 4-bit and 3-base/byte encoding scheme, one should try to fill the remaining available slots with a DNA base pattern. The most likely approach is to start pulling from 4-base patterns, where the sequence has a length of 4 from the pool {„A‟,‟C‟,‟T‟,‟G‟, and „N‟} until a pattern is assigned for each of the

256 possible representations with the first 125 being 3-base patterns and the remaining

131 being from a set of 4-base pair patterns.

40

Not all 4-base pair patterns can be represented as there are 625 possible configurations of 4-base pair combinations, but an approach similar to Huffman compression can be adopted where the most frequently encountered 4-base pair patterns are used to fill out the rest of the 256-slot table of possibilities.

Huffman compression builds a binary tree of all outcomes ordered by frequency and assigns the smallest informational unit to the most frequently encountered pattern.

Building the tree is O (n log n) runtime and requires bit-shifting operations not available in all programming languages. In the accompanying software library, Huffman compression and bit-shifting techniques are not used for the following reasons:

1. Compatibility with external libraries written in Java such as Picard, NGS

Position Counter, the MongoDB driver. Java does not support bit-shifting

operations and the smallest size data type is the Byte.

2. Building the Huffman/binary frequency tree (i.e. the compression) incurs a

cost when updating to reflect a new data set, especially a large data set,

which is typical in NGS scenarios. Using a dictionary approach is much

simpler and faster and the runtime is O(1).

The figure below demonstrates how the software library approaches storage and base pair representations based on alignment status. A similar figure is shown in the paper (Fritz, Leinonen, [...], & Birney, 2011). The primary difference between the prototype approach and the approach in the Fritz paper is that the prototype makes use of a 4-bit encoding scheme instead of creating a new reference genome based on unaligned

41 short-reads. A 4-bit approach could easily be swapped out with a 3-bases/byte approach by changing the dictionary associating the base pair pattern with the corresponding

ASCII code (e.g. the 1-byte/8-bit configuration). Aligned sequences use a version of cigar strings for encoding differences between the reference and the actual short-read genetic sequence. In situations where there is no difference between the reference and the query (e.g. experimental) sequence, the cigar string is empty thus requiring no space to represent it.

42

Figure 11 – Base-Pair Storage Representation and Strategy (based on a similar graphic from the Fritz/Birney paper)

43

Figure 12 - Prototype Overview Diagram

44

Framework Architecture

There are a few software patterns used in the prototype implementation. They are the Factory Method, the Service Locator pattern, and the Inversion of Control pattern.

The Inversion of Control design pattern (Wikipedia) is sometimes referred to as the

“Don‟t call us, we will call you” principle. This pattern requires that the program‟s control is inverted and delegated to a skeleton framework and a series of sub-classes. The framework decides when certain operations are invoked and when they are not. It is the key feature used to distinguish between a framework and a library. A framework requires that the specific functionality is plugged in to get the behavior that is desired. A library is simply a set of useful operations that can be invoked for simplifying certain tasks. In the prototype implementation, the Inversion of Control pattern is achieved through use of the

Factory Method (Gamma, Helm, Johnson, & Vlissides, 1995) and Service Locator

(Wikipedia) patterns.

The NGSPositionCounter software uses the Picard library to read BAM files programmatically. Here is a code snippet of that usage:

45

SAMFileReader inputSam = new SAMFileReader(new File(samFile)); for (final SAMRecord samRecord : inputSam) { try { this.processSamRecordNeedle(samRecord, referenceRead); } catch(Exception exc) { .. } }

Figure 13 Reading a SAM file line by line

The above for loop in code listing in figure 14 makes use of the Iterable interface of java.lang.Iterable. The Iterable interface is used for moving through a collection using a foreach statement. The software prototype also makes use of Iterable while decoupling the underlying storage representation from the interface that resembles the SAM Record collection. The prototype uses both a MongoDB NoSQL database and a FASTA file for the storage type.

The entry point into the NGS Storage framework is the NGSContainerMgr class. A few features of the class are as follows:

46

1.) Has a handle to the PersistanceMgr object through containment or a “uses”

relationship.

2.) Makes use of the Iterator and Iterable interface similar to the Picard

SamFileReader class.

3.) Again through containment, has a handle to a collection of SeqRead objects,

which is an abstract class representing the NGS data.

4.) The NGSContainerMgr also stores the reference name.

Figure 14 The NGSContainerMgr

47

Easily extend or modify the attributes stored.

The SAM format (see table) has a specific format meant for storing raw data from a variety of sequencer formats and manufacturers. Often researchers would like more control over which attributes are stored. For instance, a researcher may want to eliminate a number of attributes that are not important or are not relevant given their specific requirements such as quality score or the PL (platform identifier) or FO (flow order). One may want to add attributes that may carry specific importance for our analysis needs. For instance, one may want to add a sample id, sample preparation date, or even a patient ID or Medical Record Number (MRN) for establishing links to legacy or clinical systems.

The NGS framework can have many subclasses of SeqRead. In order to support multiple subclasses, the ServiceLocator pattern could be used and implemented in a

SeqRead Locator. The SeqRead class will use a configuration file for tracking various

SeqReads via an Inversion Of Control approach.

SeqRead represents a short sequence read typical in most shotgun sequencing applications. There is no strict requirement on the length of the sequence read – although there could be upper limits depending on the persistence device used. Some database types limit varchars, or strings, to a maximum number of characters. But currently most limits are way beyond the technological capabilities of even longer read sequencers that can sequence up to a thousand bases in a single read.

The SeqRead abstract class has a few essential features. First, it stores a unique key which identifies this specific read. Second, it stores the sequence which has a special representation depending on whether the short-read is aligned. The getSequence method

48 always returns the sequence as the actual sequence which can be used by any

Bioinformatics algorithm or human reader with all of the encoding and compression elements removed. The base constructor takes the sequence and id – similar to Picard‟s

SamRecord.getReadName. The remainder of the attributes are set and managed in the subclass.

Depending on which attributes are selected in the SeqRead subclass, information can be loss. This approach values flexibility over data preservation when necessary. If an implementation drops quality scores because they are not deemed relevant to the domain, then technically this approach can be considered a “lossy” format for NGS data management. This decision is up to the implementor and designer of the subclass of

SeqRead to determine which bits of information are relevant and which are not. If the entire contents of the SAM file are maintained, then there is no loss of information.

49

Figure 15 The SeqRead abstract class

Abstract the underlying sequence representation that leverages storage efficiencies

One of the key concepts in this framework is the efficient storage of NGS data using both compression and encoding techniques, yet hiding the complexities of the representation from a software client. As the amount of NGS data generated increases there are only a few ways to meet demands of future growth. (Fritz, Leinonen, [...], &

Birney, 2011)

1.) Add Storage

50

2.) Throw Data Away (i.e. Triage)

3.) Compress Data

4.) Efficient Encoding

Of the above, this framework will explore two strategies for lossless sequence read storage. They are:

1.) Reference-Based Compression

2.) 4-bit Encoding

Reference-Based Compression

Reference-based compression is a technique of comparing a sequence read against a reference and storing the differences. There are a number of reference-based compression versions that vary slightly in the details and how quality and unaligned sequences are dealt with.

A data storage issue regarding unaligned reads is what and how much to triage?

Do you discard unaligned short-reads? Discard low-quality unaligned? Some approaches

(*E.Birney) create a pool of unaligned reads as a new reference. The CRAM project

(http://genome.cshlp.org/content/21/5/734.full) combines quality information with nucleic-acid bases to maximize bases per byte represented.

51

Figure 16 SeqReadRepresentation2

2 In the prototype software, the SeqReadRepresentation class is renamed to the base class, TransformedSequence. 52

The SeqReadRepresentation class is based on the Façade Pattern. Façade Pattern provides a simplified access to a much larger and more complex body of code, or a class library. (The Facade Pattern). The SeqRead subclass only wants to pass and get human readable, algorithmic compatible sequences without regard to the storage. The Façade pattern guarantees that the complexities of navigating between 4-bit encoding and reference-based compression are hidden from the software client. The

SeqReadRepresentation then delegates through a “uses” containment relationship to the appropriate class. The FourBitEncodedSeqRead class is responsible for translating 4-bit sequences to regular 1-byte per base sequences. ReferenceCompressionSeqRead is responsible for switching between variant calls and full sequence reads given a reference.

The ReferenceCompressionSeqRead class can be subclassed to provide the following:

1.) Static access to a reference – as in our prototype. (Taggart)

2.) A gateway to a subsystem for complex storage and lookup of genome references.

A gateway to a reference storage subsystem could be architected to make use of large

SMP machines, or a large memory cache like memcache depending on the degree of scale that is required. Some potential candidates are Memcached (http://memcached.org/) or Apache Spark (http://spark.incubator.apache.org/). It could also have web-service or

RESTful API access to a number of different reference genomes. Below is what such an

API might look like:

53

API method Signature Description

GetSequence(ReferenceName, Returns a sequence given the reference chromosomeNumber, StartLoci, StopLoci) name/version, and a genomic locus

Table 2 - API for reference genome lookups

For purposes of this thesis, a memory cache for storing references is outside the scope and this thesis will only focus on short static references stored inline as given by earlier work on HT-SOSA and NGSPositionCounter. (citation: Taggart)

Persistence Layer – Storage Options

The sequence data and its affiliated meta-data needs to be stored somewhere. In this prototype we use a mongodb, NoSQL database. The MongoDB class requires the standard CRUD operations as well as a finder method. Let‟s take a look at our finder method in the MongoDB Persistance class.

This finder method in Figure 18 can be called from a client class as follows:

MongoDBPersistenceMgr mdbMgr = new MongoDBPersistenceMgr(connectionString);

Collection results = mdbMgr.finder(“ID”, “4579912”);

54

public static void finder(String key, String value) { DBCursor cursor = null; DB catalog = null; try {

MongoClient dbConn = new MongoClient( "localhost" , 27017 ); catalog = dbConn.getDB("test");

DBCollection collection = catalog.getCollection("test"); BasicDBObject jsonQuery = new com.mongodb.BasicDBObject(key, value); cursor = collection.find( jsonQuery);

while(cursor.hasNext()) { DBObject dbObj = cursor.next(); String refName = (String)dbObj.get("referenceName"); String sequence = (String)dbObj.get("sequence"); } }

Figure 17 The finder method3

This works fine for a standalone code base. However, what if the NGS data is not stored in MongoDB and instead stored as a flat file, FASTA file, SAM file, XML file, or even a relational database? There should be a way to make the above pseudo-code operate similarly, without having to understand the underlying type.

3 In the thesis prototype, the finder method has been renamed to find and has a number of overloads. 55

The persistence handle type should be of the abstract class as shown:

PersistenceMgr pMgr = PersistenceMgr.GetConnection();

Collection results = pMgr.finder(“ID”, “4579912”);

In the above, subtypes can easily be swapped without worrying about implementation details. The PersistenceMgr.getConnection method is a Factory method.

The Factory Method design pattern is:

“"Define an interface for creating an object, but let the classes that implement the interface decide which class to instantiate. The Factory method lets a class defer instantiation to subclasses” (Gang of Four) (Gamma, Helm, Johnson, & Vlissides, 1995)

56

Figure 18 The Persistence Layer

How does PersistanceMgr know which subtype to provide?

The Service Locator Pattern

“The basic idea behind a service locator is to have an object that knows how to get hold of all of the services that an application might need. So a service locator for this application would have a method that returns a movie finder when one is needed”

(martin fowler EAPP)

57

The GetConnection method is shown in Figure 20 below.

PersistanceMgr = ServiceLocator.GetConnection(); class ServiceLocator... public static PersistanceMgr GetConnection() { return MongoDBPersistanceMgr(connectionString); }

Figure 19 The Service Locator.

Or the class can be more complex if there are multiple run-time possibilities for a given instance.

public static PersistanceMgr GetConnection(String name) { String subclass = PersistanceServiceLocator.find(“PersistanceClass”, name); Return typeof(subclass); } Public static PersistanceMgr GetConnection() { String subclass = PersistanceServiceLocator.find(“PersistanceClass”, “Default”); Return new typeof(subclass); }

Figure 20 PersistenceMgr Connection Handling

58

Figure 21 A sample configuration file

A benefit of this approach is that as the specific implementation changes, none of the existing code needs to change so moving from a flat file or a FASTA file or a

MongDB instance is exactly the same. This complies with both the Liskov-Subsitution

Principle as well as the Open-Closed Principle.

The Liskov Substitution Principle states:

“Functions that use pointers or references to base

classes must be able to use objects of derived classes

without knowing it.” (Object Mentor)

In our implementation, since we do not really use pointers like in C or C++, it refers to Base Classes and interfaces. Any code that makes use of the abstract class

PersistanceMgr need not know which specific sub-class it is dealing with since the virtual method will be overridden in the subclass and the proper implementation will be

59 invoked. The Open-Close principle – “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification” (Meyer, 1988)

As we add new types of subclasses we should never have to redefine the base class or interface IPersistanceMgr. We will give our PersistenceMgr the following methods.

The Find method is overloaded as shown in table 3 below. The first two versions search on meta-data or fields specific to our attributes in NGS Container like MRN, ID, sex, age, chromosome #, loci, etc. The find method takes a sequence or character array performs a sequence base search and can be implemented as a Global or Local Alignment like Needleman-Wunsch or BLAST. In our prototype this is not implemented as we utilize other a specific Global Alignment processing library to perform a Needleman-

Wunsch search. But this could easily be implemented in our MongoDB implementation use the MongoDB Map/reduce functionality and the Hadoop Connector.

60

Name Params Return Type Description

Create NGS Record/Doc True/False Create a new entry

Update NGS Record/Doc True/False Updates the specific

record

Delete NGS Record ID True/False Removes the

specific record. Not

implemented in

many instances

since our data

stores are often

read-only

Find Key/value pair Iterator/Collection Meta-data Searches

based on input

Find Array of Key/value Iterator/Colleciton Meta dataSearches

pair[] based on multiple

filters

Find Sequence (string or Iterator Collection Not implemented.

char[]) Rely on other

processing

techniques

Table 3 PersistenceMgr CRUD types 61

Results

The S_5* files contain all of the original unmapped sequences for each of the polymerases used in the Taggart (i.e. Translesion-synthesis/HT-SOSA) study. The prototype encoded the original data set using the 4-bit encoding technique described earlier. (Only the 4-bit encoding scheme and not reference-based compression could be used because this was prior to alignment via Needleman-Wunsch).

Filename Size (kb) Windows ZIP % savings Compression Iota_Control_Original 276,911 33,598 NA Iota_Control_headeronly 148,915 16,465 NA Iota_Control_aligned 246,050 32,528 24% Iota_Control_encoded (4- 207,381 29,352 54% bit) S_5_all_original 1,529,240 147,452 NA S_5_all_encoded 1,197,071 137,437 48% S_5_all_headeronly 833,266 84,460 NA Table 4 - Comparison of various storage techniques

Table 4 above shows roughly 1.5 GB being reduced to 1.2 GB using the 4-bit encoding technique. However, 833 MB of the above is header information – which the prototype did not encode or compress. Removing the header information, the savings from going from the original file to the encoded version was approximately 48%. Here is the calculation:

62

((S_5_all_original – S_5_all_headeronly) – (S_5_all_encoded – S_5_all_headeronly)) ÷ (S_5_all_original

– S_5_all_headeronly)

% savings = ((1,529,240 – 833,266) – (1,197,071 – 833,266))÷ (1,529,240 – 833,266)

% savings = (695974 – 363805) ÷ 695974 = .477 or ~ 48%

After using 4-bit encoding, the Needleman-Wunsch alignment algorithm is used in the NGSPositionCounter software to determine the global alignment process described in chapter 4 in the Taggart paper (see appendix). The alignment process excluded sequences with an „N‟, which was a quality step. In the prototype, these sequences used

4-bit encoding to represent them. The alignment results from the Needleman-Wunsch algorithm was translated into a cigar-like string for reference-based compression and stored in the “aligned” fasta file. The alignment step was only performed on the

Iota_Control data set.

For the Iota_Control data set the 4-bit encoding technique yielded a 54% savings.

Ironically, the aligned files (which contain both 4-bit encoding and reference-based compression) was not as good, yielding only a 24% savings. This is most likely due to the fact that the sequence reads were short with many differences. In many cases storing the cigar string was less efficient than storing the entire sequence using 4-bit encoding. With longer sequencing reads, the cumulative advantage of storing the differences would be more pronounced.

The table above also displays the windows compression size as a proxy for approaches that combines reference-base compression with general compression techniques like described in the Fritz paper. The additional savings using windows

63 compression comes from compressing the cigar-string and the header line. Compressing the final result is worthwhile if the primary aim is storage efficiency, rather than indexed search and computation.

64

CHAPTER 6: CONCLUSION

The ideas and their order in this thesis track closely with my own professional growth in approaching NGS-related challenges I encountered at The Ohio State

University Comprehensive Cancer Center‟s Biomedical Informatics Shared Resource.

One of my first experiences processing and managing NGS data involved studies similar to the ChIP-Seq experiments discussed in chapter two. My initial approach involved copying a group of files from one network to another to be analyzed by a script or program on a workstation and, in some cases, a cluster. The results of the analysis and processing were then copied back, often with the original files, to the source directory.

During this period simple efficiencies were introduced such as writing a script to handle recurring tasks, but little thought was given as to the structure of a data repository and how such a repository could be searched, processed, and annotated.

Eventually a simple data management system called QUEST was developed which cataloged NGS data through annotations like Flow Cell, Run ID, Sample ID,

Experimental features, similar to the way microarray experiments were annotated using

MIAME (FGED Society), only simpler.

Next, some improvements were made to the QUEST system which involved adding enterprise features such as scalability and automation. Chapter 3 discusses an extension to the QUEST system which enabled automation for queuing up NGS

65 alignment jobs at the Ohio Super Computing (OSC) center. The system dynamically created PBS scripts and packages through a simple web interface when a sequencing run was marked as complete. Data was migrated to OSC, and the processing scripts were executed and results sent back to the NGS repository in QUEST. The advantage to this approach was full automation since there was minimal manual intervention required to create a PBS script, copy the files to and from the server, and submit a job to the OSC job queue. A second advantage was that multiple jobs could be processed at once on the OSC infrastructure. Each PBS job ran within its own context. The only limit to scalability was the number of nodes available for processing at OSC.

After working with Dr. Taggart on the translesion synthesis project, various techniques were considered to integrate alignment algorithms such as the Needleman-

Wunsch with more sophisticated storage systems. QUEST only stored NGS data as

FASTA, BAM, and Qual files and only managed the file links which were stored in a relational database. But thoughts turned to storing the actual sequencing data in a database and executing an algorithm on the results of a query. This line of thinking led to the work of the CRAM project and the reference-based compression techniques described in the Fritz paper.

Next steps

1.) Implement find(String sequence) for sequence similarity searches, similar to

BLAST. In general, database search, noSQL or SQL databases, use exact string

matching for queries. But for DNA sequences, one often has to rely on sequence

66

similarity. This would be a nice feature to have for a DNA database. Obviously

BLAST-style searches can be very expensive so adding this feature would require

considerable thought in the design and architecture. Most likely any

implementation would have to take advantage of parallelism and distributed

technologies such as Hadoop or MapReduce, which leads us nicely into next step

number two below.

2.) Integrate a distributed parallel processing technology, such as MapReduce, for

data processing. This feature involves extending the current design to simplify the

steps a programmer would need to take in order to write a custom script or

algorithm to analyze NGS data based of the results of a query. This would be the

equivalent of submitting a PBS script or MapReduce job using the provided

interfaces of the NGSContainerMgr class. For instance, consider the following

method signature:

List process(List, DelegatorFx)

The above method receives a delegate (similar to a C# delegate) that will act upon

the list of sequence reads passed in as the first parameter. A delegate is basically a

function that can be passed around like an object type (Microsoft.com). Java does

not support delegates natively but there are some workarounds that make use of the

Reflection API‟s invoke method and clever interface design that comes close

enough to C#‟s delegate functionality. Ideally, the framework would be smart

enough to know whether a delegate should be invoked synchronously or

asynchronously, and whether a MapReduce, multithreaded, or single threaded

67

approach is best suited. Consideration concerning callbacks in the case of

asynchronous delegates would need to be taken into account as well.

3.) The find() method in the MongoDBPersistenceMgr class returns a

List collection as required by the abstract parent class

PersistenceMgr. In order to match the interface, the cursor must be iterated and

results stored into a SeqRead collection. This results in a couple of issues such as

poor performance (due to iterating through a list and creating a new collection)

and the potential for OutOfMemory exceptions in the case of very large result sets

being returned from the search. In the prototype implementation, I was required to

add a new function (not supported by the interface), findAll() that avoided these

issues by returning the MongoDB cursor. The advantage of returning the cursor is

that the method call can take advantage of lazy loading. Lazy loading is,“when an

object doesn‟t contain all the data that one needs, but knows how to get it”

(Fowler). Ideally the find() method can utilize lazy loading while still maintaining

a clean and generic interface. Since an abstract high-level PersistenceMgr

interface that contains MongoDB classes as parameters violates the Liskov-

Substitution principle, returning a DBCursor object is not ideal. But adding the

findAll() method was a necessary compromise for this version of the prototype. In

the future, an effort should be made to maintain a clean, well-designed interface

as well as take advantage of the lazy loading of the cursor class.

4.) The service locators were not necessary in the prototype since there is only a

single SeqRead subclass (i.e. HT_SOSASeqRead). In an enterprise setting with

68

many persistence managers and SeqRead subclasses, a service locator would be

very valuable. In the situation where there are a lot of SeqRead subclasses and

invocation of the appropriate SeqRead class is rule-based and dynamic, a

dependency injection framework to manage the various SeqRead subclasses

should be considered.

5.) Another future improvement would be to utilize a memory cache for reference-

based compression that is available within a distributed processing environment.

In the prototype, the reference-based compression requirements were very simple

as there were only six sequences corresponding to each of the polymerase

barcodes used in the HT-SOSA experiments. The references sequences were

stored as constant variables available to the program. However, in situations

where the reference genome is very large (i.e. or maize), this

approach is not suitable and a lookup will be required based on loci. To do this

efficiently the reference genome should be available for fast lookup for multiple

clients, nodes or threads simultaneously. One approach is using a memory cache

such as Memcached(see http://memcached.org). Another possibility is a cluster-

computing infrastructure that supports in-memory computation like Apache‟s

Spark. (http://spark.incubator.apache.org/). These and other similar options should

be explored in future work.

6.) Implement a 3-bases/byte (e.g. 2.67 bit) encoding scheme. In the prototype, only

4-bit encoding was used, but a 3-bases/byte version would be more effective.

Along the same lines, taking advantage of the remaining unused representations

69

out of the 256 possibilities by placing the most common four word options in the

dictionary lookup step would also improve storage efficiency. It might make

sense to look at a Huffman encoding approach for the remaining unused options.

7.) An intelligent SeqRead Manager (actually named TransformedSequenceMgr in

the prototype) that will determine which representation scheme to use (3-

bases/byte or reference compression) based on which scheme would result in a

more compact and efficient storage representation. In the prototype used for this

thesis, a version of cigar strings was used for reference-based compression

anytime there was an alignment. For unaligned sequences or those that included

an „N‟, 4-bit encoding was always used. But in many cases, the 4-bit encoding

scheme was more compact than a long cigar string. Instead of relying on an

alignment flag to determine the scheme, a better approach would be to

algorithmically determine the more efficient storage scheme and use that

representation.

8.) The prototype was developed on a 32-bit Windows laptop using Java version 1.7

and MongoDB version 2.4.6. This version has a 2GB limitation that does not exist

with the 64-bit version. For a prototype this was fine, although it did require some

awkward rewrites in a test case which wrote alignment results back into a new

MongoDB collection. Instead the alignment results could only be written to a

.fasta file. This was sufficient for the purposes of this prototype, but the next

version should address this with a 64-bit installation.

70

Source Code and Results

Please contact [email protected] for source code, results, source data,

and any questions.

Source code can be downloaded at:

https://github.com/terrycamerlengo/MastersThesis

71

BIBLIOGRAPHY

C, T., & SL, S. (2009, May). How to map billions of short reads onto genomes. Nature biotechnology , 455-4577.

Camerlengo T, O. H. (2009). Enabling Data Analysis on High-throughput Data in Large

Data Depository Using Web-based Analysis Platform – A Case Study on Integrating

QUEST with GenePattern in Epigenetics Research. International Conference on

Bioinformatics in Biomedicine. Washington, D.C: IEEE.

FGED Society. (n.d.). Recommendations for Microarray Data Standards, Annotations,

Ontologies and Databases. Retrieved from MIAME Working Group: http://www.mged.org/Workgroups/MIAME/

Fowler, M. (n.d.). Lazy Load. Retrieved from MartinFowler.com: http://www.martinfowler.com/eaaCatalog/lazyLoad.html

Fritz, M. H.-Y., Leinonen, R., [...], & Birney, E. (2011). Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research

.

Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1995). Factory Method Pattern. In

Design Patterns (pp. 107-116). Addison-Wesley.

Group, T. S. (n.d.). Sequence Alignment/Map Format Specification. Retrieved from SAM

Tools: http://samtools.sourceforge.net/SAMv1.pdf

72

McKinsey Global Institute. (June 2011). Big data: The next frontier for innovation, competition,and productivity.

McKinsey Global Institute. (May 2013). Disruptive technologies: Advances that will transform life, business,and the global economy.

Meyer, B. (1988). Object-Oriented Software Construction. Prentice Hall.

Microsoft.com. (n.d.). Delegates (C# Programming Guide). Retrieved from MSDN

Library: http://msdn.microsoft.com/en-us/library/ms173171.aspx

Mortazavi, A., Williams, B., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes. Nature Methods , 621-628.

Needleman-Wunsch Program. (n.d.). Retrieved from Google Code: http://code.google.com/p/gal2009/

Object Mentor. (n.d.). Liskov Substitution Principle. Retrieved from ObjectMentor.com: http://www.objectmentor.com/resources/articles/lsp.pdf

Ozer HG, B. D. (2009). A Comprehensive Analysis Workflow for Genome-Wide

Screening Data from ChIP-Sequencing Experiments. BICoB: 320-330: IEEE Press; 2009.

BICoB (pp. 320-330). IEEE Press.

S, P., B, W., & A., M. (2009, November). Computation for ChIP-seq and RNA-seq studies. Nature methods , 22-32.

SAM Format Specification. (n.d.). Retrieved from Sam Tools: http://samtools.sourceforge.net/SAMv1.pdf

73

Taggart, D. J., Camerlengo, T. L., Harrison, J. K., Sherrer, S. M., Kshetry, A. K., Taylor,

J.-S., et al. (2013). A high-throughput and quantitative method to assess the mutagenic potential of translesion DNA synthesis. Nucleic Acids Research .

Terry Camerlengo, H. G.-S. (2012). From Sequencer to Supercomputer: An Automated

Pipeline for Managing and Processing Next Generation Sequencing Data. AMIA Summits

Transl Sci Proc. (pp. 1-10). Epub 2012 Mar 19.

Terry Camerlengo, H. G.-S. (2012). From Sequencer to Supercomputer: An Automated

Pipeline for Managing and Processing Next Generation Sequencing Data. AMIA Summits

Transl Sci Proc. 2012, (pp. 1–10).

The Facade Pattern. In Design Patterns.

Wikipedia. (n.d.). Inversion Of Control. Retrieved from http://en.wikipedia.org/wiki/Inversion_of_control

Wikipedia. (n.d.). Service Locator Pattern. Retrieved from http://en.wikipedia.org/wiki/Service_locator_pattern

Z, W., M, G., & M., S. (2009). RNA-Seq: a revolutionary tool for transcriptomics.

Nature Review Genetics , 57-63.

74

APPENDIX A – COMPARISON OF 4-BIT AND 3-BASE/BYTE ENCODING SCHEMES

Count Bit Encoding 4-Bit Pattern 3-base/byte Pattern 1 0 0 0 0 0 0 0 0 A A A A A 2 0 0 0 0 0 0 0 1 A T A A T 3 0 0 0 0 0 0 1 0 A C A A C 4 0 0 0 0 0 0 1 1 A G A A G 5 0 0 0 0 0 1 0 0 A N A A N 6 0 0 0 0 0 1 0 1 T A A T A 7 0 0 0 0 0 1 1 0 T T A T T 8 0 0 0 0 0 1 1 1 T C A T C 9 0 0 0 0 1 0 0 0 T G A T G 10 0 0 0 0 1 0 0 1 T N A T N 11 0 0 0 0 1 0 1 0 C A A C A 12 0 0 0 0 1 0 1 1 C T A C T 13 0 0 0 0 1 1 0 0 C C A C C 14 0 0 0 0 1 1 0 1 C G A C G 15 0 0 0 0 1 1 1 0 C N A C N 16 0 0 0 0 1 1 1 1 G A A G A 17 0 0 0 1 0 0 0 0 G T A G T 18 0 0 0 1 0 0 0 1 G C A G C 19 0 0 0 1 0 0 1 0 G G A G G 20 0 0 0 1 0 0 1 1 G N A G N 21 0 0 0 1 0 1 0 0 N A A N A 22 0 0 0 1 0 1 0 1 N T A N T 23 0 0 0 1 0 1 1 0 N C A N C 24 0 0 0 1 0 1 1 1 N G A N G 25 0 0 0 1 1 0 0 0 N N A N N 26 0 0 0 1 1 0 0 1 A- (1 base ends) T A A 27 0 0 0 1 1 0 1 0 T- T A T 28 0 0 0 1 1 0 1 1 C- T A C 29 0 0 0 1 1 1 0 0 G- T A G 30 0 0 0 1 1 1 0 1 N- T A N 31 0 0 0 1 1 1 1 0 T T A 32 0 0 0 1 1 1 1 1 T T T 75

33 0 0 1 0 0 0 0 0 T T C 34 0 0 1 0 0 0 0 1 T T G 35 0 0 1 0 0 0 1 0 T T N 36 0 0 1 0 0 0 1 1 T C A 37 0 0 1 0 0 1 0 0 T C T 38 0 0 1 0 0 1 0 1 T C C 39 0 0 1 0 0 1 1 0 T C G 40 0 0 1 0 0 1 1 1 T C N 41 0 0 1 0 1 0 0 0 T G A 42 0 0 1 0 1 0 0 1 T G T 43 0 0 1 0 1 0 1 0 T G C 44 0 0 1 0 1 0 1 1 T G G 45 0 0 1 0 1 1 0 0 T G N 46 0 0 1 0 1 1 0 1 T N A 47 0 0 1 0 1 1 1 0 T N T 48 0 0 1 0 1 1 1 1 T N C 49 0 0 1 1 0 0 0 0 T N G 50 0 0 1 1 0 0 0 1 T N N 51 0 0 1 1 0 0 1 0 C A A 52 0 0 1 1 0 0 1 1 C A T 53 0 0 1 1 0 1 0 0 C A C 54 0 0 1 1 0 1 0 1 C A G 55 0 0 1 1 0 1 1 0 C A N 56 0 0 1 1 0 1 1 1 C T A 57 0 0 1 1 1 0 0 0 C T T 58 0 0 1 1 1 0 0 1 C T C 59 0 0 1 1 1 0 1 0 C T G 60 0 0 1 1 1 0 1 1 C T N 61 0 0 1 1 1 1 0 0 C C A 62 0 0 1 1 1 1 0 1 C C T 63 0 0 1 1 1 1 1 0 C C C 64 0 0 1 1 1 1 1 1 C C G 65 0 1 0 0 0 0 0 0 C C N 66 0 1 0 0 0 0 0 1 C G A 67 0 1 0 0 0 0 1 0 C G T 68 0 1 0 0 0 0 1 1 C G C 69 0 1 0 0 0 1 0 0 C G G 70 0 1 0 0 0 1 0 1 C G N 71 0 1 0 0 0 1 1 0 C N A 72 0 1 0 0 0 1 1 1 C N T 73 0 1 0 0 1 0 0 0 C N C 74 0 1 0 0 1 0 0 1 C N G

76

75 0 1 0 0 1 0 1 0 C N N 76 0 1 0 0 1 0 1 1 G A A 77 0 1 0 0 1 1 0 0 G A T 78 0 1 0 0 1 1 0 1 G A C 79 0 1 0 0 1 1 1 0 G A G 80 0 1 0 0 1 1 1 1 G A N 81 0 1 0 1 0 0 0 0 G T A 82 0 1 0 1 0 0 0 1 G T T 83 0 1 0 1 0 0 1 0 G T C 84 0 1 0 1 0 0 1 1 G T G 85 0 1 0 1 0 1 0 0 G T N 86 0 1 0 1 0 1 0 1 G C A 87 0 1 0 1 0 1 1 0 G C T 88 0 1 0 1 0 1 1 1 G C C 89 0 1 0 1 1 0 0 0 G C G 90 0 1 0 1 1 0 0 1 G C N 91 0 1 0 1 1 0 1 0 G G A 92 0 1 0 1 1 0 1 1 G G T 93 0 1 0 1 1 1 0 0 G G C 94 0 1 0 1 1 1 0 1 G G G 95 0 1 0 1 1 1 1 0 G G N 96 0 1 0 1 1 1 1 1 G N A 97 0 1 1 0 0 0 0 0 G N T 98 0 1 1 0 0 0 0 1 G N C 99 0 1 1 0 0 0 1 0 G N G 100 0 1 1 0 0 0 1 1 G N N 101 0 1 1 0 0 1 0 0 N A A 102 0 1 1 0 0 1 0 1 N A T 103 0 1 1 0 0 1 1 0 N A C 104 0 1 1 0 0 1 1 1 N A G 105 0 1 1 0 1 0 0 0 N A N 106 0 1 1 0 1 0 0 1 N T A 107 0 1 1 0 1 0 1 0 N T T 108 0 1 1 0 1 0 1 1 N T C 109 0 1 1 0 1 1 0 0 N T G 110 0 1 1 0 1 1 0 1 N T N 111 0 1 1 0 1 1 1 0 N C A 112 0 1 1 0 1 1 1 1 N C T 113 0 1 1 1 0 0 0 0 N C C 114 0 1 1 1 0 0 0 1 N C G 115 0 1 1 1 0 0 1 0 N C N 116 0 1 1 1 0 0 1 1 N G A

77

117 0 1 1 1 0 1 0 0 N G T 118 0 1 1 1 0 1 0 1 N G C 119 0 1 1 1 0 1 1 0 N G G 120 0 1 1 1 0 1 1 1 N G N 121 0 1 1 1 1 0 0 0 N N A 122 0 1 1 1 1 0 0 1 N N T 123 0 1 1 1 1 0 1 0 N N C 124 0 1 1 1 1 0 1 1 N N G 125 0 1 1 1 1 1 0 0 N N N 126 0 1 1 1 1 1 0 1 A-- (1 base ends) 127 0 1 1 1 1 1 1 0 T-- 128 0 1 1 1 1 1 1 1 C-- 129 1 0 0 0 0 0 0 0 G-- 130 1 0 0 0 0 0 0 1 N-- 131 1 0 0 0 0 0 1 0 AA- (2 base ends) 132 1 0 0 0 0 0 1 1 AT- 133 1 0 0 0 0 1 0 0 AC- 134 1 0 0 0 0 1 0 1 AG- 135 1 0 0 0 0 1 1 0 AN- 136 1 0 0 0 0 1 1 1 TA- 137 1 0 0 0 1 0 0 0 TT- 138 1 0 0 0 1 0 0 1 TC- 139 1 0 0 0 1 0 1 0 TG- 140 1 0 0 0 1 0 1 1 TN- 141 1 0 0 0 1 1 0 0 CA- 142 1 0 0 0 1 1 0 1 CT- 143 1 0 0 0 1 1 1 0 CG- 144 1 0 0 0 1 1 1 1 CC- 145 1 0 0 1 0 0 0 0 CN- 146 1 0 0 1 0 0 0 1 GA- 147 1 0 0 1 0 0 1 0 GT- 148 1 0 0 1 0 0 1 1 GC- 149 1 0 0 1 0 1 0 0 GG- 150 1 0 0 1 0 1 0 1 GN- 151 1 0 0 1 0 1 1 0 NA- 152 1 0 0 1 0 1 1 1 NT- 153 1 0 0 1 1 0 0 0 NC- 154 1 0 0 1 1 0 0 1 NG- 155 1 0 0 1 1 0 1 0 NN- 156 1 0 0 1 1 0 1 1 157 1 0 0 1 1 1 0 0 158 1 0 0 1 1 1 0 1

78

159 1 0 0 1 1 1 1 0 160 1 0 0 1 1 1 1 1 161 1 0 1 0 0 0 0 0 162 1 0 1 0 0 0 0 1 163 1 0 1 0 0 0 1 0 164 1 0 1 0 0 0 1 1 165 1 0 1 0 0 1 0 0 166 1 0 1 0 0 1 0 1 167 1 0 1 0 0 1 1 0 168 1 0 1 0 0 1 1 1 169 1 0 1 0 1 0 0 0 170 1 0 1 0 1 0 0 1 171 1 0 1 0 1 0 1 0 172 1 0 1 0 1 0 1 1 173 1 0 1 0 1 1 0 0 174 1 0 1 0 1 1 0 1 175 1 0 1 0 1 1 1 0 176 1 0 1 0 1 1 1 1 177 1 0 1 1 0 0 0 0 178 1 0 1 1 0 0 0 1 179 1 0 1 1 0 0 1 0 180 1 0 1 1 0 0 1 1 181 1 0 1 1 0 1 0 0 182 1 0 1 1 0 1 0 1 183 1 0 1 1 0 1 1 0 184 1 0 1 1 0 1 1 1 185 1 0 1 1 1 0 0 0 186 1 0 1 1 1 0 0 1 187 1 0 1 1 1 0 1 0 188 1 0 1 1 1 0 1 1 189 1 0 1 1 1 1 0 0 190 1 0 1 1 1 1 0 1 191 1 1 1 1 1 1 1 0 192 1 1 1 1 1 1 1 1 193 1 1 0 0 0 0 0 0 194 1 1 0 0 0 0 0 1 195 1 1 0 0 0 0 1 0 196 1 1 0 0 0 0 1 1 197 1 1 0 0 0 1 0 0 198 1 1 0 0 0 1 0 1 199 1 1 0 0 0 1 1 0 200 1 1 0 0 0 1 1 1

79

201 1 1 0 0 1 0 0 0 202 1 1 0 0 1 0 0 1 203 1 1 0 0 1 0 1 0 204 1 1 0 0 1 0 1 1 205 1 1 0 0 1 1 0 0 206 1 1 0 0 1 1 0 1 207 1 1 0 0 1 1 1 0 208 1 1 0 0 1 1 1 1 209 1 1 0 1 0 0 0 0 210 1 1 0 1 0 0 0 1 211 1 1 0 1 0 0 1 0 212 1 1 0 1 0 0 1 1 213 1 1 0 1 0 1 0 0 214 1 1 0 1 0 1 0 1 215 1 1 0 1 0 1 1 0 216 1 1 0 1 0 1 1 1 217 1 1 0 1 1 0 0 0 218 1 1 0 1 1 0 0 1 219 1 1 0 1 1 0 1 0 220 1 1 0 1 1 0 1 1 221 1 1 0 1 1 1 0 0 222 1 1 0 1 1 1 0 1 223 1 1 0 1 1 1 1 0 224 1 1 0 1 1 1 1 1 225 1 1 1 0 0 0 0 0 226 1 1 1 0 0 0 0 1 227 1 1 1 0 0 0 1 0 228 1 1 1 0 0 0 1 1 229 1 1 1 0 0 1 0 0 230 1 1 1 0 0 1 0 1 231 1 1 1 0 0 1 1 0 232 1 1 1 0 0 1 1 1 233 1 1 1 0 1 0 0 0 234 1 1 1 0 1 0 0 1 235 1 1 1 0 1 0 1 0 236 1 1 1 0 1 0 1 1 237 1 1 1 0 1 1 0 0 238 1 1 1 0 1 1 0 1 239 1 1 1 0 1 1 1 0 240 1 1 1 0 1 1 1 1 241 1 1 1 1 0 0 0 0 242 1 1 1 1 0 0 0 1

80

243 1 1 1 1 0 0 1 0 244 1 1 1 1 0 0 1 1 245 1 1 1 1 0 1 0 0 246 1 1 1 1 0 1 0 1 247 1 1 1 1 0 1 1 0 248 1 1 1 1 0 1 1 1 249 1 1 1 1 1 0 0 0 250 1 1 1 1 1 0 0 1 251 1 1 1 1 1 0 1 0 252 1 1 1 1 1 0 1 1 253 1 1 1 1 1 1 0 0 254 1 1 1 1 1 1 0 1 255 1 1 1 1 1 1 1 0 256 1 1 1 1 1 1 1 1 257 0 0 0 0 0 0 0 0

81

APPENDIX B – QUEST SYSTEM OVERVIEW

82