<<

HEMATOPOIETIC CELL POPULATION SEGREGATION THROUGH FULL- LENGTH TRANSCRIPTOME SEQUENCING

A Dissertation submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Tumor Biology

By

Anne Deslattes Mays, M.Sc., M.Sc.

Washington, DC July 17, 2015

Copyright 2015 by Anne Deslattes Mays

All Rights Reserved

ii

HEMATOPOIETIC CELL POPULATION SEGREGATION THROUGH FULL- LENGTH TRANSCRIPTOME SEQUENCING

Anne Deslattes Mays, M.Sc., M.Sc. Thesis Advisor: Anton Wellstein, M.D. Ph.D.

ABSTRACT

“Progress in science results from new technologies, new discoveries and new ideas, probably in that order.” Nobel Laureate Sydney Brenner (1927 - )

Sequencing the genome was a critical first step in setting the groundwork to understanding the molecular programming that is involved in transforming a cell from a healthy to a cancerous state. Cellular transcriptome complexity has become increasingly more apparent as technological advances have exposed us to its diversity. Full-length RNA sequencing is crucial for an unbiased analysis of transcriptome complexity. This complexity is due to posttranscriptional processing of primary transcripts that results in a variety of isoforms generated from the same genomic loci. Distinct cell lineages are defined by their transcript isoform expression profiles, and the annotation of cells can be derived from the expression of transcript isoforms that can result in functionally different . Alternate splice site utilization provides cells with a powerful regulatory mechanism of expression that can impact the composition of the product, and influence the rate of translation of transcripts from multi- . The overall goal of this project was to delineate the hematopoietic transcriptome revealed by full-length sequencing and assess the shortcomings of transcriptome

iii

reconstruction using fragmented-read sequencing. The aims were to (a) evaluate the complexity of the hematopoietic transcriptome using full-length RNA sequencing, to (b) compare the full- length RNA-sequencing transcriptome with the reconstructed transcriptome from fragmented- read sequencing and to (c) evaluate whether hematopoietic cell subpopulations show distinct transcriptome patterns. Sequencing and reconstructing transcripts through transcriptome reconstruction from fragmented read sequencing have advanced our understanding of the transcriptome. Here we show that full-length transcriptome sequencing is necessary to faithfully expose the transcriptome and understand its complexities. Abundance information and pathway analysis support this. Also, full-length sequencing illustrates open reading frames that code for contiguous canonical or fusion proteins that can be validated with peptides. This transcriptome diversity is consistent with distinct phenotypes of cell subpopulations present in tissues. Accurate transcriptome measurement builds a foundation that can be relied upon to ensure higher success rates for therapeutics and lower false discovery rates for biomarkers of disease.

The analysis of transcripts of a set of selected genes as well as the potential for posttranscriptional processing predicts for a highly complex transcriptome and an abundance of hitherto unknown protein isoforms. Classic approaches have not allowed full testing of this hypothesis due to limitations in sequencing lengths. Taking advantage of full-length sequencing technology provides us with an opportunity to uncover transcripts that cannot be obtained through traditional transcript reconstruction techniques.

iv

DEDICATION

Life and time are often inconvenient partners that challenge us to keep moving forward while circumstances do their best to derail us from our chosen paths. While heading up software development for Craig Venter and his team at Celera (sequencing the ) my husband suffered a serious stroke. A long and ongoing recovery period has followed. During this period my father was stricken with and subsequently died of lung cancer. Time passed all too quickly as I raised my daughter through adolescence while at the same time my mother began her slow drift into dementia. It was during this period between my fathers’ death and the early stages of my mothers’ decline that I committed myself to pursing my PhD. By no-means an easy decision - and one that has tested the bounds of time, life and love for me and those around me.

I feel the need to complete the work that Craig Venter had asked us to do at Celera and to show my daughter what can be accomplished despite the adversity and circumstances that life and time present us. Craig challenged to us to first, sequence the human genome, and then second, to cure cancer. We are not there yet, and I am committed to help accomplish that task. I feel strongly we must use skills and resources we have, from biological assay, to mathematical algorithm, to complex computer infrastructure to beat down the details in such a way we can dissect the signals of disease and understand its origins. We must ask the right questions, use the right tools, and work to de-obfuscate the information we have.

v

To my daughter Katie, thank you for your love and support throughout these years. You inspire me and challenge me in unexpected and surprising ways. Thank you for revealing what you see with your eyes, hear with your ears and create with your unbelievable imagination. I am so grateful for your understanding and patience throughout this journey. This thesis is dedicated to you.

vi

ACKNOWLEDGEMENTS

I would like to thank my mentor, Dr. Anton Wellstein. He gave me a project that he thought was right up my alley. The journey presented challenges unseen at the beginning, yet ultimately produced results well worth the efforts.

I would also like to thank Dr. Anna Tate Riegel for allowing me to enter the program. Her belief that I could do this and her willingness to help me navigate through the process helped inspire me to try despite the obstacles and challenges.

I am grateful to my Thesis Committee Members - Dr. Michael Johnson, Dr. Anatoly

Dritschilo, Dr. Habtom Resom, Dr. Yuri Gusev, and Dr. Christopher Loffredo. Their time, support and advice during the development of my project have been greatly appreciated.

A special “thank you” to Dr. Terence Ryan for all your support throughout these years and for being my external committee member.

I wish to thank Dr. Eric Schadt - a brief encounter at Cold Spring Harbor started me on the path to complete of my journey. Thank you for getting Dr. Robert Sebra to sequence that first sample for me, I wouldn't be writing these words had you not started that ball rolling for me.

I wish to thank Dr. Mike Hunkapillar and Dr. Elizabeth Tseng and the Pacific

Biosciences collaboration that made the completion of this work possible. Thank you Liz for your software, your friendship and your dedication. None of the work presented in this thesis would have been possible without it.

I would like to also thank present and past members of the Wellstein-Riegel lab, Drs.

Sonia Rosenfeld and Elena Tassi who shared differing lab corners over the years and who

vii

witnessed my trials and tribulations supporting me with love and kindness without which I would not have made it to this end. To Garrett Graham, thank you for being an unofficial member of the lab, I am appreciative of the late night and weekend discussions regarding GRanges, BioViz and all other things bioconductor. The spirit in this lab is without match -- the program is a special one of dedication and striving for proper and correct science.

I would like to thank current and past members of KeyGene, especially I give my thanks to Dr. Arjen van Tunen, Dr. Leo Zwinkels, Dr. Mark van Haaren, Dr. An Michiels , Dr. Jan van

Oeveren, Mike Cariaso and Matthew McCoy and most recently Dr. Fayaz Khazi of KeyGene -- for your support throughout the years.

To the future Dr. Rutger van Bergem, it was wonderful to have a partner in this final 800 meters of the race. I am too slow a runner to beat you and Dr. Eveline Vietsch in a running race

-- but I guess I got to finish just a hair ahead of you in this PhD race! Thank you for your encouragement and for your pushing me along.

Finally, I would like to say a special thank you to Dr. Marcel Schmidt who taught me all I know for working on the bench and has been a staunch supporter and friend throughout these years.

viii

INDEX

CHAPTER 1 - INTRODUCTION ...... 1

A. Genome, Genomic Loci of Genes, mRNA and mRNA Isoforms ...... 1

B. Technological Advances Drive Discoveries ...... 2

C. RNA Sequencing ...... 4

D. Transcriptional Measurement Limitations ...... 6

E. Cancer Discoveries and Landmarks ...... 9

F. Hematopoietic Transcriptome ...... 10

G. Transcript Expression, Structure and Mutational Landscape ...... 11

H. Hypothesis, Goal and Specific Aims ...... 13

CHAPTER 2. MATERIALS AND METHODS ...... 14

A. Bone Marrow Analysis Workflow ...... 14

B. Healthy Bone Marrow Cells ...... 16

C. RNA preparation and Fragmented Read sequencing (Frag-seq) ...... 18

D. RNA Preparation and Full-Length Sequencing (FL-seq) ...... 18

E. Peptide Analysis by Nano LC-MS/MS ...... 19

F. Transcriptome Alignment and Assembly from Fragmented Reads ...... 20

G. ToFU ...... 22

H. Conversion of FPKMs to TPMs ...... 23

I. Conversion of ToFU abundance to TPMs ...... 23 ix

J. Open Read Frame Prediction ...... 23

K. Multiple Sequence Alignment ...... 23

L. Sequence Alignment Editing ...... 23

M. Naming Convention ...... 24

N. Figures and Abundance ...... 24

CHAPTER 3 - FULL-LENGTH RNA SEQUENCING REVEALS HUMAN BONE MARROW

TRANSCRIPTOME COMPLEXITY ...... 25

A. Abstract ...... 25

B. Results ...... 26

CHAPTER 4 - FULL-LENGTH RNA-SEQUENCING REVEALS SPECIFIC TRANSCRIPT

ISOFORMS DISTINGUISH HUMAN BONE MARROW SUBPOPULATIONS ...... 61

A. Abstract ...... 61

B. Results ...... 61

CHAPTER 5 - DISCUSSION AND FUTURE DIRECTIONS ...... 68

GLOSSARY ...... 74

BIBLIOGRAPHY ...... 75

x

LIST OF FIGURES

Figure 2.1 - Bone marrow transcriptome analysis workflow...... 15

Figure 2.2 - Bone marrow processing detail...... 17

Figure 3.1a - Approach to transcriptome analysis ...... 27

Figure 3.1b-k - ANXA1 and EEFA1 transcript isoform detail...... 29

Figure 3.1l-n - Multiple sequence alignment for ANXA1 ...... 31

Figure 3.1o-q - Multiple sequence alignment for EEF1A1 ...... 32

Figure 3.2a-b - Number of transcript isoforms obtained from FL-seq and Frag-seq ...... 35

Figure 3.2c-d - Representative genes with 3 to 13 canonical ...... 55

Figure 3.3 - Multiple sequence alignment for ELANE and CFD...... 58

Figure 3.4 - Multiple sequence alignment for HLA-A, HLA-B and HLA-C...... 58

Figure 4.1 - ENO1 transcript isoforms ...... 62

Figure 4.2 - ENO1 multiple sequence alignment...... 63

Figure 4.3 - PKM Transcript Isoforms...... 65

Figure 4.4 - PKM multiple sequence alignment...... 66

xi

LIST OF TABLES

Table 3.1a - Top 5 isoform counts for genes from FL-seq for lineage-negative (N) ...... 36

Table 3.1b - Top 5 isoform counts for genes from FL-seq for total (T) bone marrow ...... 37

Table 3.1c - Top 5 isoform counts from combined total (T) and lineage-negative (N) FL-Seq .. 38

Table 3.1d - Top 5 isoform counts for genes from Frag-seq for lineage-negative (N) ...... 39

Table 3.1e - Top 5 isoform counts for genes from Frag-seq for total (T) bone marrow ...... 40

Table 3.1f - Top 5 isoform counts from combined total (T) and lineage negative (N) Frag-seq . 41

Table 3.2 - Gene names for transcript isoforms ...... 42

Table 3.3 - Number of exons, transcript isoform numbers (Trans #) for genes in Figure 3.2c. ... 56

Table 3.4 - Identifiers, gene names and full description in Figure 3.2d...... 59

Table 3.5 - Exon counts, transcript isoform frequency distributions from Frag-seq or FL-seq. .. 60

xii

CHAPTER 1 - INTRODUCTION

A. Genome, Genomic Loci of Genes, mRNA and mRNA Isoforms

The human genome contains regions that are regulatory and those that code for mRNA transcripts. These transcripts from a given genomic locus and generated by RNA polymerase, may be spliced, have alternative start sites or alternative ending depending upon the individual cellular potential, the cellular environment, the timing, and actions from upstream positive and negative regulators. From the genome, the region marking the transcriptional start site of a gene is called the promoter. The promoter region resides at the 5' end of a given gene, typically containing TATA Boxes, which guide the promoter to the location of the gene. Proteins that bind and control transcription are transcription factors. Transcription factors bind to short pieces of DNA sequences called binding sites. These are located upstream of a gene under active regulation. The core promoter that guides the RNA polymerase is positioned in the region and transcription initiates and continues until the 3' end of the transcriptional region. This primary transcript contains both introns and exons. Introns are removed in the spliceosome, resulting in the final, mature mRNA. It is defined by a 5'CAP, a 5' UTR, a translational start codon, followed by sequence capable of coding for proteins, ending with a stop codon and a 3' UTR together with a polyA tail that defines the mature mRNA.

While the genome is relatively static, the human transcriptome is dynamic. The dynamism is necessary to enable properly coded responses in tissues of multicellular organisms and to adapt to the environment. Distinct transcript isoforms represents one way a cell may actively participate and respond within this environment. Each transcribed mRNA may be 1

subject to posttranscriptional modifications that result in distinct transcript isoforms. Our understanding of the potential for these alternative has grown with our ability to measure and detect differences and to quantify these differences both in terms of their level of expression but also in terms of the alternative mRNA structures and proteins that can result from distinct transcript isoforms.

B. Technological Advances Drive Discoveries

Scientific progress occurs when technological advances enable novel insight into the processes under study. Rosalind Franklin pioneered the use of x-rays to create images of large biological molecules. Franklin adjusted the equipment to produce fine beans and extracted

DNA fibers and arranged them in parallel bundles. She studied these fiber's reactions under different conditions and discovered crucial keys to structure (1, 2). Using this information, Watson and Crick were able to solve the structure of DNA (3) . In 1975, Sanger reported on the first DNA sequencing method (4). The Sanger method consists of primer elongation and chain termination that was adapted to modern day automated sequencing.

Reagents are added to primer and template that include a terminating dideoxynucleotide originally labeled radioactively and in modern day sequencing with fluorophores (ddNTPs) as well as the deoxynucleotides (dNTPs). When the DNA polymerase reaches a ddNTPs, the elongation reaction is stopped. In the classic method of Sanger, autoradiographs of size separated mixtures of fragments showed the sequencing ladders whereas modern day sequencing uses imaging systems that read the resulting elongated DNA products by size separation of the

2

randomly terminated fragments. Next, reads are computationally processed generating the final sequence reads.

Two years after the publication of the Sanger method, in 1977, Maxam and Gilbert published their method that required radioactive labeling at one of the 5' end of the DNA(5).

Leroy Hood invented the automated DNA sequencer in 1986 in collaboration with Applied

Biosystems (6). The creation of the ABI 3700, the instrument enabled Celera to produce its version of the human genome and enabled the completion of the public effort to sequence the human genome project. The innovation together with novel algorithmic approach of whole shotgun sequencing produced the draft human sequence several years ahead of schedule and at lower than projected costs.

Advancements in gene expression methods accelerated our understanding of the global gene expression environment. Expressed sequence tags (ESTs), using sequencing technology, representing an advancement for probing for gene expression. Alternative methods for measuring and assessing gene expression advanced in 1994 when Edwin Southern together with

Uwe Maskos created the microarray. Inspired by the original method of Southern blotting invented by Southern (7), probes are synthesized and than attached to a solid surface by covalent bonds in a chemical matrix (8) . Southern blotting detects the presence of DNA fragments in samples as fingerprint.

Microarrays are used to probe for differences through hybridization. The probes are designed typically as small fragments of DNA representing a transcript sequence of interest.

Global gene expression permits the profiling of thousands of genes simultaneously. This

3

technological advance enabled advancing analytical techniques that put the gene expression within a larger context. Moving from one gene at a time, to global gene expression analysis enabled network analysis techniques such as weighted gene co-expression network analysis to enable further dissection of the coding programs used by a cell in response to environmental pressures and disease (9). Understanding patterns of global gene expression reveals biomarkers of disease and putative targets for therapeutic discovery. Cost reductions and technical strides enabled the acceleration of our understanding of gene expression, their global gene expression patterns and the context of these expressing genes within the global cellular environment (10).

C. RNA Sequencing

Next generation sequencing pushed the throughput of sequencers to another level, producing output that’s now outpacing Moore's law of computer science, more than doubling every two years since it was invented in 2007 (11, 12). Since its invention, next generation sequencing, and its derivatives, have increased in throughput and decreased in cost, enabling diverse characterizations of the genome, transcriptome, and epigenome. However, the massively parallel sequencing of fragmented cDNA libraries is not without artifact and caution was urged as early as 2009 regarding conclusions to be made from the products (13) and discussed below.

The decline in sequencing costs, decline in computational costs and the speed with which assembly may occur has made sequencing ubiquitous (14). The cost reductions and the technical advances surrounding the technical advance enabled the characterization of the transcriptome, capturing both sequence, structure and abundance differences (15, 16).

4

Assembly of fragmented sequencing reads at the transcript level is more complex than at the DNA level, which alone is complex enough (17). The Assemblathon 2 (18) completed in

2013, showed that assembly at the DNA level is in good shape but still not perfect. There are still regions of the genome that cannot be completely resolved. For example, the genome contains regions that are not easily resolved, variable regions that represent the immune repertoire within the MHC class I and MHC Class II molecules. Also, the immunoglobulin locus is complex and rearranged and different between individuals, making a singular reference not feasible, and challenging the traditional ways of classifying the immune repertoire of an individual.

These genome rearrangements represent only part of the picture when considering transcripts (19). At the transcriptome level, assembly is even more complex due to posttranscriptional processing. The calculations approach exponential numbers in terms of the potential at any one particular genomic locus with increasing transcript complexity (> 2 exons).

In 2010, Pacific Biosciences introduced single-molecule real-time DNA sequencing

(SMRT). The sequencing technology is capable of reading very long lengths, so fragmentation of the cDNA library is not required (20). Single pass reads in SMRT sequencing are more error prone with a median error rate of 11%. However, since the sequencing errors are stochastic, through the use of consensus accuracy, where every base is accurate 9 out of 10 times, achieves a final sequencing result at the 99.999% level (21, 22). There is no bias with regards to DNA sequence of GC content (23). The technology involves affixing a polymerase at the bottom of

5

zero mode wave guide cells (ZMW) (24). The cDNA library in either a size selected subpopulation based upon the unfragmented length of a transcript, or by direct loading of the entire non-size selected library, are distributed over the SMRTCell. The resulting circular consensus reads are the product of a polymerase reading the same transcript multiple times. The resolution at the transcript level is one consensus read per well. The polymerase then adds a nucleotide at a time in the synthesis reaction each with a fluorphore that is ignited by a laser and a movie is taken for a variable period of time.

The pulse wave is then deciphered by algorithms that resolve the transcript sequence.

The full-length of the transcript is sequenced from 5' to 3'. Within the course of a 4 hour movie, the transcript is read typically > 10 times. The algorithms then identify transcripts at a throughput of 50,000 transcripts per experimental read. This is done without a reference genome, and without assembly. A newly developed algorithm pipeline, Transcripts of Full and unassembled length (=ToFU) (25) clusters the > 10 reads and with that achieves transcript alignment to the genome at levels equivalent to the read alignment error rate with the high- throughput fragmented sequencing technologies typically at >99%. Each of these technological advances increases the observational power we have on monitoring the gene transcripts.

D. Transcriptional Measurement Limitations

RNA-sequencing has revolutionized the field of transcriptome analysis (26). Given the challenges of genome assembly, the complexity of transcriptome assembly, however, are far greater. Transcripts come from a region of the genome with paralogous genes, where there has been rearrangements, and, generally speaking, alternative splicing affects all transcripts with >2

6

exons. Also, new exon discovery, new exon/intron junction discovery of genes that are not paralogous. A great deal of innovation has occurred to mange the ambiguous nature of the result of sequencing on shorter transcripts.

Surprisingly, despite years of short read RNA-sequencing, transcriptome databases have not grown significantly, yet with the first long-read transcriptome sequenced over 9000 new transcripts were added (27). Full-length sequencing on unfragmented libraries overcomes the limitations and shortcomings of short read sequencing on fragmented cDNA libraries. When comparing these technologies, it is important to keep in mind that full-length sequencing reads are in fact transcripts. Comparing the short and long read-sequencing technologies is best done at the transcript level for in the case of short read sequencing, a read does not equal a transcript until it has been assembled and reconstructed. Short read RNA sequencing uses a cDNA library preparation method that depends upon the fragmentation and size selection of the library. The process of fragmentation and size selection can significantly affect output(28, 29). After fragmentation, transcript reconstruction must occur. Transcript assembly is hard and a fundamental obstacle to RNA-Sequencing (19). Isoform de-convolution from short-read RNA-

Seq data has been described as a significant problem (30).

Transcript Abundance Estimation is also a multi-step process. Given the collapsed non- redundant transcriptome, reads are mapped back on the transcriptome, estimation is based upon probabilities (Sailfish(31), Cuffdiff(32), eXpress (33), RSEM (34)all use similar algorithms).

Given that there is necessarily a loss of information during cDNA library construction (due to fragmentation) and that there is loss of information and inefficiencies in transcriptome

7

reconstruction (averaging sensitivity and precision, no methods have achieved 60% accuracy for transcript reconstruction) (29, 35), there must also be loss of information during quantification

(30).

There are two considerations dominating the RNA-sequencing approach at this time. One is that one can reconstruct all transcripts from short reads of fragmented cDNAs given enough sequencing depth and computer processing power. Our results presented here show that there is a limit to which transcripts can be reconstructed from such fragmented reads. The more complex the transcript (> 2 exons), the less likely the full spectrum of possible transcripts will be recovered. The second is that if you haven't seen a transcript, then if you sequence more reads, i.e. sequence deeper, than you will likely be able to reconstruct that transcript. Our results show that greater sequencing depth does in fact increase the number of transcripts found, but that the rate of increase is linear, whereas the number of transcripts likely is exponential (36). Proteome diversity is accessible through transcriptional measurement given the ability to resolve the open reading frames whether from assembly of read fragments into full-length transcripts or from their direct full-length measurement.

Alternative splicing in transcripts contributes greatly to the proteome diversity with an estimated 94% of genes are alternatively spliced in the human genome (37). Accurate transcript reconstruction or direct full-length RNA Sequencing enables the ability to predict the open reading frames for the transcribed gene. Given the proteome, annotation and structural analysis is further enabled.

8

E. Cancer Discoveries and Landmarks

In 1914 Theodore Boveri proposed cancer as a genomic disease, a genetic disease of somatic cells (38). Boveri studied sea urchin egg cells and used the fact that an egg could be inseminated by two sperm cells studying the resulting chromosomal distribution in daughter cells to arrive at the conclusion that tumors might arise as a consequence from their abnormal segregation (39). In 1960, Nowell and Hungerford identified the chromosomal abnormality in chronic myeloid leukemia (CML) (40). In 1962, Ludwig Gross proposed a viral origin of cancer

(41). In 1971, Knudsen presented evidence in retinoblastoma that two mutational hits for the same gene on two different alleles were sufficient to cause the cancer. This was the first time that evidence of specific mutations, whether present in the germline or somatically acquired would bring about the onset of cancer(42). The seminal papers in 1982 brought the identification of mutated proto-oncogene HRAS, the identification of Bcr-ABL oncogenic fusion protein on the Philadelphia in CML and MYC was identified as an amplified oncogene (43). Activating point mutations identified in BRAF were discovered in 2002, transforming the normally functional BRAF gene into one that was constitutively active (44). In

2004, the activating point mutations were identified in PIK3CA(45). In 2004, the activating point mutations and small insertion and deletions were identified in EGFR (46). In 2005, translocations were identified in solid tumors (47). And in 2005, NIH Director Elias A.

Zerhouni, M.D, announced a plan for a comprehensive effort to explore cancer genomics, launching the pilot project and the beginnings of The Cancer Genome Atlas (TCGA).

9

The discussion evolved from not whether to do a sequencing project but how and when, driving coordination between the multiple projects both domestically and internationally the

TCGA coordinating with the International Cancer Genome Consortium (ICGC) and other large- scale cancer genome sequencing projects over 100,000 somatic mutations were found by

2009(48).

The 1000 Genome Project, a large consortium sequencing project, has as part of its scope the goal to sequence and ascribe meaning to variations associated with disease using the population scale to ensure statistical viability (49). These efforts have resulted in the identifications of a plethora of point mutations and structure alterations with all cancers.

Understanding the driver mutations has long been the goal. The realization that cancer is very heterogenetic is the result of these sequencing efforts. These large-scale efforts to decipher the molecular basis of disease provide the foundation for biomarkers for early cancer detection, markers for cancer specific signatures to provide for targeted therapy that would attack directly the disease and not the individual.

F. Hematopoietic Transcriptome

Bone marrow is an ideal system to study the molecular signals involved in understanding what differentiates and characterized a stem cell when compared to its differentiated progeny and to understand the environment of the stem cell(50, 51). Also, bone marrow directly participates in cancer malignancy, that is, cancer uses bone marrow for invasion and metastasis(52, 53). A stem cell has the dual characteristics of self renewal and then differentiation -- there are competing views as to what comprises the stem cell population committed progenitor cell and

10

then differentiated cells. It has been shown already that isoforms exist in a mixture specific to the sub-cell population within the hematopoietic transcriptome(54). Understanding the hematopoietic transcriptome can provides insights useful for studies of cancer stem cells and their differentiating programs. Isolating the most undifferentiated cells from a healthy bone marrow cell population and measuring the structure and the abundance helps us to understand health and provide insight into disease. Transcriptome exploration of human bone marrow began with microarray technology and continues with RNA-sequencing. The molecular characterization of the hematopoietic stem cell microenvironments is a fundamental model goal in the field of stem cell biology (55).

G. Transcript Expression, Structure and Mutational Landscape

Gene expression is limited to different subsets at different times of development and cell subpopulations present in tissues.. Microarray analysis makes use of expression analysis to uncover complex programs of transcriptional control through the use of high throughput whole transcriptome analysis. When considering abundance information, probes to known gene loci remain an effective means for gene network analysis (56). As disease is a multi-factorial process, and understanding not only the gene expression program but also what the consequences are when alternative products are transcribed from transcripts from the same locus. Microarray technology enabled the full transcriptome measurement within a sample but misses the mutational differences present that indicate mutational markers of disease or tissue. DNA sequencing and cDNA sequencing enables the capture of these mutational differences and given the alteration potential within a transcript. Mutations may cause insertions, deletions of

11

sequence as well as single nucleotide polymorphic differences. These differences may then alter the predicted open reading frame, may cause fusion proteins and alter subsequent protein binding partners, altering the entire landscape of expression potential. These alterations to transcript may occur as part of normal development as well as response to stresses, environmental or causal in a disease landscape.

At all times there are many molecular events occurring within the particular sub cellular populations that would alter the phenotypic potential and outcome of the organism. Genome wide association studies at the genome level establish the mutational correlates associated with disease. The ability to measure this information with the throughput, computer processing power and algorithms enables a global view of disease that we have not seen to date.

Full-length RNA-sequencing allows discovery and profiling of the entire transcriptome of any organism without designing probes or primers and without prior knowledge of the transcriptional program. High throughput next-generation sequencing performed on fragmented cDNA libraries allows for the discovery of novel junctions and for less complex genes, the discovery of novel transcripts. Novel transcript isoforms may appear evolutionarily from mutations that are advantageous for disease or individual and these may be uncovered through transcriptional programs. Alternative splicing can include intron retention, exon skipping, alternative donor and acceptor sites, alternative transcriptional start sites and alternative end sites.

Given these layers of differences, genotype differences from the individual genome level, environmental, stressed or disease induced mutational differences that potentially alter the open

12

reading frame for the coded protein, structural measurement is as important as the abundant information that may be collected on a specimen. Technology drives innovation, innovation drives discoveries, and measurement tools enable the ability to capture with greater and greater specificity and with higher throughput the specific molecular landscape actuated within a sample at the time and place of interest.

H. Hypothesis, Goal and Specific Aims

The analysis of transcripts of a set of selected genes as well as the potential for posttranscriptional processing predicts for a highly complex transcriptome and an abundance of hitherto unknown protein isoforms. Classic approaches have not allowed full testing of this hypothesis due to limitations in sequencing lengths. Taking advantage of full-length sequencing technology provides us with an opportunity to uncover transcripts that cannot be obtained through traditional transcript reconstruction techniques based upon short read sequencing techniques. The specific aims of this project were to (a) evaluate the complexity of the hematopoietic transcriptome using full-length RNA sequencing, to (b) illustrate the greater size of the transcriptome compared with the transcriptome reconstructed from short-read sequencing techniques and to (c) evaluate whether cell subpopulations show distinct transcriptome patterns.

13

CHAPTER 2. MATERIALS AND METHODS

A. Bone Marrow Analysis Workflow

The steps in the bone marrow analysis workflow from harvest to cell population enrichment analysis and open reading frame structure analysis depicted in (see Figure 2.1 on the next page). Healthy human bone marrow cells are first separated into lin+ and lin- cell subpopulations, full-length sequencing is performed differential isoform analysis is and open reading frame predictions made. Mass spectrometry and short read sequencing used to validate novel isoforms.

14

Bone Marrow Harvest

Separation into Total, Lineage Negative Lineage Positive Cell Populations

Unfragmented Full-length RNA-Sequencing all Cell Populations

Translation Transcript Isoform Open Reading Frame Differential Analysis Protein Isoform

Validation Validation Fragmented Mass Spectrometry Short-Read RNA-Sequencing

Protein Enrichment Structure Analysis Modeling

Figure 2.1 - Bone marrow transcriptome analysis workflow. Information flow from healthy bone marrow harvest to enrichment analysis, protein structure modeling, and validation at both the mass spectrometry level and with fragmented short reads.

15

B. Healthy Bone Marrow Cells

Freshly harvested bone marrow tissues were collected from discarded healthy human bone marrow collection filters that had been de-identified. Mononuclear cells were isolated by

Ficoll gradient centrifugation. In order to select for lineage negative (lin-) cells. Bone marrow mononuclear cells were incubated with an antibody cocktail containing antibodies against CD2,

CD3, CD11b, CD11C, CD14, CD16, CD19, CD24, CD5, CD61, CD66b, and Glycophorin A, using a negative selection kit, (Stemcell Technologies). Lineage positive (lin+) cells bound to the antibodies were removed by magnetic beads, lineage negative (lin-) cells are obtained from the flow-through. In order to increase purity lin- cells were enriched two times (see Figure 2.2 on the next page).

16

Harvest SepMate + Antibody Cocktail + Healthy Human Ficol + Magnetic Beads Bone Marrow Centrifugation Separates Lin- from Lin+

eliminates erthrocytes & platelets

Antibody Cocktail: CD11B, CD11C, CD14, CD16A, CD19 CD2, CD24, CD3EAP, CD56 CD61, CD66B

Magnet

Supernatant 1-2% Cell Fate and Clonal Expansion harvest Lin- Lin+

Cell Cell Potential Count

Figure 2.2 - Bone marrow processing detail. Erythrocytes and platelets are removed through centrifugation followed by a negative selection assay to isolate the lineage negative cell population from differentiated progenitors.

17

C. RNA preparation and Fragmented Read sequencing (Frag-seq)

Total RNA was submitted to Otogenetics Corporation (Norcross, GA USA) for RNA-

Seq. Briefly, the integrity and purity of total RNA were assessed using Agilent Bioanalyzer by

OD260/280 ratio. 5µg of total RNA was subjected to rRNA depletion using the RiboZero

Human/Mouse/Rat kit (Epicentre Biotechnologies). cDNA was generated from the depleted

RNA using random hexamers or custom primers and Superscript III (Life Technologies). The resulting cDNA was purified and fragmented using a Covaris fragmentation kit (Covaris, Inc.), profiled using an Agilent Bioanalyzer, and Illumina libraries were prepared using NEBNext reagents (New England Biolabs). The quality, quantity and the size distribution of the Illumina libraries were determined using an Agilent Bioanalyzer 2100. The libraries were then submitted for Illumina HiSeq2000 sequencing. Paired-end 90 or 100 nucleotide (nt) reads were generated and checked for data quality using FASTQC (Babraham Institute), and DNAnexus (DNAnexus,

Inc) was used on the platform provided by Center for Biotechnology and Computational Biology

(University of Maryland) as described in Ref (32). 159,043,023 non-strand specific paired reads were collected for the total bone marrow sample (deep sequencing). 35,126,712 strand-specific paired reads were collected from the lineage negative cell sample (57).

D. RNA Preparation and Full-Length Sequencing (FL-seq)

Total RNA was submitted to Pacific Biosciences (Menlo Park, CA) and Icahn School of

Medicine at Mount Sinai (New York, NY). The integrity and purity of total RNA were assessed using Agilent Bioanalyzer and OD260/280 prior to submission. Full-length cDNA synthesis was done from polyA RNA using Clontech SMARTer PCR cDNA synthesis kit. Libraries were

18

prepared based upon size selection of 1-2 kb, 2-3 kb and 3-6 kb fractions by BluePippin size selection protocol. BluePippin size selection includes two PCR amplification steps, before and after gel sizing. These size fractions were converted to SMRTbell libraries. Full-length cDNA libraries were produced from 1 ng of poly-A RNA. Non-fragmented, full-length RNA sequencing followed. 17 SMRT cells (7 cells 1-2kb, 5 cells 2-3 kb, 5 cells 3-6kb) were used to sequence the total bone marrow cell population. 12 SMRT cells were used to sequence the lin- population (5 cells 1-2kb, 5 cells 2-3kb, 2 cells 3-6kb).

E. Peptide Analysis by Nano LC-MS/MS

Proteins were extracted using 0.1% Rapigest (Waters) in 25mM ammonium bicarbonate

Extracted proteins were reduced with 5 mM DTT for 60 min at 60°C and alkylated with 15 mM iodoacetamide for 30 min in the dark. Trypsin (Promega) digestion (2.5 ng/µL) was carried out at

37°C in Barocycler NEP2320 (PressureBioSciences) for 1 h at 37˚C and then vacuum dried in

Speed-vac (Labconco). Tryptic peptides were analyzed on a NanoAcquity UPLC (Waters) by RP chromatography on a Symmetry C18 (3µm, 180µm, 20mm) trap column and UPLC capillary column (BEH 300Å, 1.7µm, 150mm x 0.75µm) (Waters) interfaced with 5600 TripleTOF (AB

Sciex). Separation was achieved by a 250min gradient elution with ACN containing 0.1% formic acid. The chromatographic method was composed of 5 min trapping step using 2% ACN, 0.1% formic acid at 15µL/min and chromatographic separation at 0.4µL/min as follows: starting conditions 2% ACN, 0.1% formic acid; 1−180min, 2−60% ACN, 0.1% formic acid; 180−200 min, 60−95% ACN, 0.1% formic acid; 200−220 min 95% ACN, 0.1% formic acid followed by equilibration 2% ACN, 0.1% formic acid for an additional 30 min. For all runs, 5µL of sample

19

were injected directly after enzymatic digestion Analysis used an Information Dependent

Acquisition (IDA) work flow with one full scan (400 − 1500 m/z) and 50 MS/MS fragmentations of major multiply charged precursor ions with rolling collision energy. Mass spectra were recorded in the MS range of 400−1500 m/z and MS/MS spectra in the range of 100−1800 m/z with resolution of 30 000 and mass accuracy up to 2 ppm using the following experimental parameters: declustering potential, 80 V; curtain gas, 15; ion spray voltage, 2300 V; ion source gas 1, 20; interface heater, 180°C; entrance potential, 10 V; collision exit potential, 11 V; exclusion time, 5 s; collision energy was set automatically according to m/z of the precursor

(rolling collision energy). Data were processed using ProteinPilot 4.0 software (AB Sciex). For targeted measurement an inclusion parent mass list was created according to in-silico tryptic digest of interesting sequences.

F. Transcriptome Alignment and Assembly from Fragmented Reads

Reads were trimmed using Trimmomatic

(http://www.usadellab.org/cms/?page=trimmomatic) with the default parameters and the reads then aligned and assembled according to the Tuxedo suite protocol as described in Ref (32).

The genome of reference used was GRCh37 (Hg19). The genes.gtf from this reference was used to guide the read alignment during the Tophat 2 step and Cufflinks2 (32). Bowtie 2 indices were used for the genome reference. All computation was preformed using Amazon Web Services and through the use of starcluster software (http://star.mit.edu/cluster/) to manage the boxes. A

Sun Grid Engine was employed to run the tasks. Reads were trimmed by Trimmomatic with the default parameters.

20

The specific qsub command for the total bone marrow (T) alignment was:

#!/bin/bash #$ -cwd #$ -pe orte 32 #$ -N "th.fr.unstr.tot" #$ -V export PATH="/mnt/myenv2/bin:/mydata/cufflinks-2.2.1.Linux_x86_64;/mydata/bowtie2-2.2.5:/mydata/samtools-1.2:$PATH" /mydata/tophat-2.0.14.Linux_x86_64/tophat2 -p 32 -G /results/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf \ -r200 \ --library-type fr-unstranded \ -o /trimmed_fastq/transcriptome/ill.bm.total/tophat.2.0.14.fr-unstranded.gtf \ /results/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome \ /trimmed_fastq/trimmed.fasta.total.bm.non.ss/bm.trimmed.reads.1.fa \ /trimmed_fastq/trimmed.fasta.total.bm.non.ss/bm.trimmed.reads.2.fa

The Tophat 2 alignment results for total bone marrow (T) were as follows:

Left reads: Input : 79521512 Mapped : 44936546 (56.5% of input) of these: 2288065 ( 5.1%) have multiple alignments (11316 have >20) Right reads: Input : 79521512 Mapped : 44971783 (56.6% of input) of these: 2298530 ( 5.1%) have multiple alignments (11474 have >20) Unpaired reads: Input : 12673662 Mapped : 7229036 (57.0% of input) of these: 425163 ( 5.9%) have multiple alignments (442 have >20) 56.6% overall read mapping rate.

The specific command for cufflinks 2 assembly for total bone marrow (T) cells was:

#!/bin/bash #$ -cwd #$ -pe orte 32 #$ -N "cf2.tot.fr.un.gtf" #$ -V export PATH="/mnt/myenv2/bin:/mydata/cufflinks-2.2.1.Linux_x86_64;/mydata/bowtie2-2.21:/mydata/samtools-0.1.19:$PATH" /mydata/cufflinks-2.2.1.Linux_x86_64/cufflinks -b /results/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \ -p 32 \ -g /results/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf \ -L tot.fr.unstranded.gtf \ -o /trimmed_fastq/transcriptome/ill.bm.total/cufflinks2.2.1.tophat.2.0.14.fr.unstranded.gtf.5.19 \ /trimmed_fastq/transcriptome/ill.bm.total/tophat.2.0.14.fr-unstranded.gtf/accepted_hits.bam

The specific qsub command for the lineage negative (N) cell subpopulation alignment was:

#!/bin/bash #$ -cwd #$ -pe orte 32 #$ -N "th.neg.gtf" #$ -V export PATH="/mnt/myenv2/bin:/mydata/cufflinks-2.2.1.Linux_x86_64;/mydata/bowtie2-2.2.5:/mydata/samtools-1.2:$PATH" /usr/local/bin/tophat -p 32 -G /results/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf \ --library-type fr-firststrand \ -r200 \ -o /trimmed_fastq/transcriptome/ill.lin.neg/tophat.2.0.14.fr.firststrand.no.mm.gtf \ /results/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome \ /trimmed_fastq/trimmed.fasta.lin.neg.ss/lin.neg.trimmed.reads.1.fa \ /trimmed_fastq/trimmed.fasta.lin.neg.ss/lin.neg.trimmed.reads.2.fa

21

The Tophat 2 alignment results for lineage negative (N) cell subpopulation were as follows:

Left reads: Input : 17563356 Mapped : 9481943 (54.0% of input) of these: 632485 ( 6.7%) have multiple alignments (826 have >20) Right reads: Input : 17563356 Mapped : 8551291 (48.7% of input) of these: 507275 ( 5.9%) have multiple alignments (764 have >20) 51.3% overall read mapping rate.

Aligned pairs: 5783853 of these: 194931 ( 3.4%) have multiple alignments 140762 ( 2.4%) are discordant alignments 32.1% concordant pair alignment rate.

The specific command for cufflinks 2 assembly for lineage negative (N) cell subpopulation was:

#!/bin/bash #$ -cwd #$ -pe orte 32 #$ -N "cf2.ill.lin.neg.gtf" #$ -V export PATH="/mnt/myenv2/bin:/mydata/cufflinks-2.2.1.Linux_x86_64;/mydata/bowtie2-2.21:/mydata/samtools-0.1.19:$PATH" /mydata/cufflinks-2.2.1.Linux_x86_64/cufflinks -b /results/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa \ -p 32 \ -g /results/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf \ -L ill.lin.neg.nm.th \ -o /trimmed_fastq/transcriptome/ill.lin.neg/cufflinks2.2.1.no.mm.no.u.gtf \ /trimmed_fastq/transcriptome/ill.lin.neg/tophat.2.0.14.fr.firststrand.no.mm/accepted_hits.bam

G. ToFU

Reads obtained from the Pacific Biosciences RS II platform were run through the ToFU

(=Transcript isoforms: Full-length and Unassembled) to obtain high quality non-chimeric full- length reads (25). These were collapsed to the longest transcript and their abundance information obtained. A master id was created to permit the comparison of transcript isoforms obtained from one sample population to another. The abundance information converted to

Transcripts per Million according to the specifications from Bo Li et al. (58). Custom python and R Scripts were used to perform analysis.

22

H. Conversion of FPKMs to TPMs

Conversion formula obtained from Colin Dewey (personal communication).

TPM = FPKM / (sum of FPKM over all genes/transcripts) * 106

I. Conversion of ToFU abundance to TPMs

Conversion formula obtained from the 2010 paper written by Bo Li, et. al. (58).

J. Open Read Frame Prediction

Open reading frames were generated using ANGEL, https://github.com/PacificBiosciences/ANGEL. The publically available software package is on github and through the use of SerialCloner 2.6.1. Franck Perez [SerialBasics]. Open reading frames accepted were the first open reading frames and not necessarily the largest. In the cases where there was compelling evidence to accept the largest open reading frame, both the first and the largest were included and designated with letters a,b, etc. appended to the end of the name assigned.

K. Multiple Sequence Alignment

Multiple sequence alignment was done using Clustal Omega through the website

(http://www.ebi.ac.uk/Tools/msa/clustalo)

L. Sequence Alignment Editing

Sequence alignment was edited using BioEdit version 7.2.5 Copyright by Tom Hall. This

Sequence Alignment Editor written for the windows environment was run on OS-X Yosemite on

23

a Mac Book Pro through the use of the wine version 1.6.2, a windows emulator available for download and installation through Home Brew version 0.9.5.

M. Naming Convention

For figures in the paper, a simplification of the names generated by the cDNA Primer software was used creating a per figure unique nomenclature relating to the consensus deposited gene structures as well as using the deposited protein isoforms to unify results.

N. Figures and Abundance

All figures were generated through custom R scripts permitting alignment of abundance with isoforms. The non-fragmented sequence reads were aligned to hg19 reference genome using gsnap. The fragmented sequence reads were aligned to hg19 reference using Tophat 2 according to Ref.(32). Quantitation was reported by Cufflinks as FPKMs and these were transformed to TPMs according a formula provided and corrected by Colin Dewey (Genome

Center of Wisconsin). Data structures using bioconductor packages GRanges were used to unify the results. All figures generated in R were edited within Adobe Illustrator.

24

CHAPTER 3 - FULL-LENGTH RNA SEQUENCING REVEALS HUMAN BONE MARROW TRANSCRIPTOME COMPLEXITY

A. Abstract

RNA sequencing is a powerful approach to analyze the composition of transcripts generated from distinct genomic loci. Here we compare transcript isoforms obtained from full- length sequencing of cDNA libraries (FL-seq) with reconstructed transcript isoforms from paired-end sequencing of fragmented cDNA libraries (Frag-seq). The FL-seq analysis reveals a

4-fold increase over previously known transcript isoforms for genomic loci with more than two exons and allows transcript mapping of paralogous genes.

The analysis of transcripts of a set of selected genes as well as the potential for posttranscriptional processing predicts for a highly complex transcriptome and an abundance of hitherto unknown protein isoforms. Classic approaches have not allowed testing of this hypothesis due to limitations in sequencing lengths. The overall goal of this project was to delineate the hematopoietic transcriptome revealed by full-length sequencing and demonstrate the shortcomings of transcriptome reconstruction using short-reads generally. Specifically, the aims were to (a) evaluate the complexity of the hematopoietic transcriptome using full-length

RNA sequencing, to (b) evaluate the size of the transcriptome compared with the transcriptome reconstructed from fragmented-read sequencing techniques and to (c) evaluate whether cell subpopulations show distinct transcriptome patterns.

25

B. Results

RNA sequencing is crucial for an unbiased analysis of transcriptome complexity (26, 59-

65). This complexity is due to posttranscriptional processing of primary transcripts that results in a variety of isoforms generated from the same genomic loci (66, 67). Distinct cell lineages are defined by their transcript isoform expression profiles, and the annotation of cells can be derived from the expression of transcript isoforms that can result in functionally different proteins.

Alternate splice site utilization provides cells with a powerful regulatory mechanism of gene expression that can impact the composition of the protein product, and influence the rate of translation of transcripts from multi-exon genes (68, 69).

Here we introduce an analysis approach (see Figure 3.1a on the next page) that compares transcript isoforms obtained from full-length sequencing of unfragmented cDNA libraries (FL- seq) with reconstructed transcript isoforms obtained from paired-end sequencing of fragmented cDNA libraries at 20 and 100 million read depths (Frag-seq). Freshly harvested total human bone marrow cells (T) and the lineage-negative progenitor cell population (N) were analyzed in parallel. For FL-seq, we used the recently developed analysis platform ToFU (=Transcript isoforms: Full-length and Unassembled(25)). For transcript reconstruction after Frag-seq, we employed the Tuxedo suite (32) first aligning the reads to the genome (70) and then assembling the aligned reads.

26

Total Frag-Seq a fragment align & bone (Illum) assemble marrow poly A FL-Seq mRNA cDNA cluster Total (PacBio) bone enrich marrow poly A FL-Seq selection mRNA cDNA cluster enrich (PacBio)

Lineage Isoforms Transcript Frag-Seq Quantity & Structure Negative fragment align & (Illum) assemble

Figure 3.1a - Approach to transcriptome analysis. Transcriptome analysis of freshly harvested human bone marrow (BM) cells and examples of results for the ANXA1 and EEF1A1 genes. Analysis scheme: Poly(A)+ RNA was isolated from total (T, red) or lineage-negative BM cell populations (N, blue). cDNA libraries were subjected to full-length sequencing (FL-seq; PacBio) or fragmented for paired-end sequencing (Frag-seq) at 20 million (T) or 100 million (N) read depth (Illumina). FL-seq was processed using ISO-Seq, Quiver and ToFU 11. Frag-seq were first aligned (Tophat 2, Bowtie 2) and then assembled (Cufflinks 2) based on the Tuxedo suite protocol.

.

27

The discovered isoforms of two genes, ANXA1 and EEF1A1, which are abundantly expressed in hematopoietic cells, illustrate our findings (Figure 3.1b,g). Frag-seq gave canonical results from total bone marrow (T, in red), and lineage-negative cell population (N, in blue) indicated as ST-C and SN-C in the lighter color shades (Figure 3.1c,h) and added two novel isoforms one each for ANXA1 (SN2-27) and EEF1A1 (SN2-35). In contrast, FL-seq revealed an additional 38 transcript isoforms for ANXA1 (Figure 3.1d) coding for 26 distinct predicted open reading frames (Figure 3.1l-n). Similarly, at the EEF1A1 genomic locus the FL-seq method revealed 42 additional transcript isoforms coding for 35 predicted open reading frames (Figure

3.1i and Figure 3.1o-p). It is noteworthy that many of these open reading frames predict fusion proteins generated from domains in the canonical protein. For the previously unknown transcript isoform N7-24 the predicted fusion protein was validated by mass spectrometry (Figure 3.1q).

These findings suggest that the conventional Frag-seq uncovers only a small portion of the existing transcript isoforms detectable by the FL-seq method. The Frag-seq and FL-seq structure, abundance, and multiple sequence alignments for the predicted open reading frames for ANXA1 and EEF1A1 are shown on the following pages.

28

b g ANXA1 EEF1A1 Chr 9 q12 TPM Chr 6 TPM 5’ 3’ 3’ 5’ 1 kb 1 10 1001000 1 kb 1 10 1001000 75.77 mb 75.78 mb 74.226 mb 74.233 mb c e h j ST-C . ST2-C* SN-C ST1-C SN2-27 SN2-35 SN1-C d f i k T1-C T2-C T6-14 T7-26 T7-3 T3-28 T9-9 T13-12 T10-C* T9-11 T8-1 T8-22 T16-12 T10-3 T14-8 T18-19 T1-10 T11-13 T12-16 T12-2 T6-34 T13-10 T4-32 T19-6 T5-25 T21-11 T11-14 T14-3 N4-C* T3-20 N9-C* T4-18 N8-C* T15-5 N28-15 T20-4 N25-C* T2-2 N10-C* T5-22 N27-C* T22-21 N1-2 N12-21 T17-2 N14-30 N1-C N18-C N7-12 N19-33 N11-13 N16-4 N12-16 N5-29 N18-27 N21-35 N16-14 N22-19 N14-15 N11-18 N5-24 N6-1 N2-7,26 N7-24 N15-2 N3-23 N10-2 N26-17 N8-23 N15-5 N13-17 N20-20 N17-2 N17-27 N24-13 N3-2 N13-9 N4-23 N23-7 N6-25 N2-6 N9-8 Figure 3.1b-k - ANXA1 and EEFA1 transcript isoform detail. (b,g) Chromosomal loci and gene model for ANXA1 and EEFA1 from the hg19 annotation. Arrows, direction of transcription. (c-f and h-k) Results for lineage-neg (blue) and total BM (red) from the Frag-seq (c,e,h,j) or FL-seq (d,f,i,k). Transcript isoforms assembled from the Frag-seq (c,h) and FL-seq (d,i). Abundance of transcript isoforms per million (TPM) of Frag-seq (e,j) is transformed from fragments per kilobase of transcript per million mapped (FPKMs; Colin Dewey, pers. comm.). For FL-seq (f,k) TPM is transformed from ToFU reported abundance. S, short reads; C, canonical transcript information; ID#s of the isoforms are based on the identifiers from the sequencing method. C*, open reading frame (ORF) identical with the respective canonical peptide. ORFs are shown in Figure 3.1l-n and Figure 3.1o-q.

29

To assess the potential biologic impact of complex transcript isoforms detected in the different bone marrow cell populations we aligned the open reading frames (ORF) predicted from the FL-seq data. For ANXA1, (see Figure 3.1d and Figure 3.1l-n on the next pages) the canonical ORF is coded for by 2 different transcript isoforms and a C-terminal fragment, ORF 2, by 7 different transcript isoforms that are detected in both cell populations. ORF 12, 13 and 14 are coded for by one transcript isoform each that was found in both cell populations. These shared predicted peptides provide independent corroboration for the conserved coding sequences uncovered by FL-seq and include the canonical protein. In addition to these shared coding sequences there are a number of ORFs that are distinct between the cell populations (Figure

3.1l-n). For EEF1A1 (Figure 3.1i and Figure 3.1o-q), the canonical protein is coded for by 8 different transcript isoforms and a series of ORFs that are distinct between the bone marrow cell populations. Collectively, our results from these two selected genes show that the transcript isoforms identified by FL-seq code for peptides that show overlap of the ORFs but also distinguish between the cell populations. The multiple sequence alignments for ANXA1 and

EEF1A1 for the open reading frames predicted from transcript isoforms from FL-seq and Frag- seq are shown on the following pages.

30

l

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|... . |....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... | ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... | . T8-1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MAIKLLRKAQSKTKEEYLLVLDIFVKVRLQIMYLAIFFLNMFLQEFFCSFILNIY*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N17-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N15-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~MNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N10-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N3a-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T12-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~MNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T2-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T17-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T14-3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~MKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T7-3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T20-4 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEAEKGKRNRQKKK------KKKKKKKKKKNEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T15-5 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETL------KRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T19-6 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDE------DLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N2b-7 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MVKGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISFAKPSWMKPKEIMRKSWWLFVEETKHSLMVSSYDQKTLIIYFHPISLNRKVSSTGLQCSYLHAEKYSL*~ N9-8 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEE ------ELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T9-9 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIGM*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T13-10 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLD------EDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ P04083 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T16-12 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRGNNKFLFLESVYGRCNFLF*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N7-12 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRGNNKFLFLESVYGRCNFLF*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~ T11-13 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~MKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRGNNKFLFLESVYGRCNFLF*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~ N11-13 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRGNNKFLFLESVYGRCNFLF*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T6-14 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKVCTILLICPA*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N16-14 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKVCTILLICPA*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N14-15 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKVCTILLICPA*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ P04083-C ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T1-C ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T10-C ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N1-C ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N12-16 MNLILRYTFSKMAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N13-17 ~~~~~~~~~~ MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAPAS------TDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITQTHLEIFGTLCFLLLRVTDLRTLV*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T4-18 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTKQSRRKEKGDRRKRVQYHPYHQKLSTTSQSVSEIHQVQ*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T18-19 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLV*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T3-20 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTKKSETLTGSTERN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ T22-21 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEKGERGQT*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ T5-22 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALQVTLRRLF*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N4-23 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPWMKH*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ N8-23 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPWMKH*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N5-24 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKGERGQT*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N6-25 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIEHSN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~ N2a-26 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPILPSIHPRMSLPCIRP*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N3b-26 ~~~~~~~~~~~MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPILPSIHPRMSLPCIRP*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SN2-27 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~ ~~~~~~~~~~MFLMRVFSEYETLTLIYFGQGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

m

C MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLDETLKKALTGHLEEVVLALLKTPAQFDADELRAAMKGLGTDEDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRNALLSLAKGDRSEDFGVNEDLADSDARALYEAGERRKGTDVNVFNTILTTRSYPQLRRVFQKYTKYSKHDMNKVLDLELKGDIEKCLTAIVKCATSKPAFFAEKLHQAMKGVGTRHKALIRIMVSRSEIDMNDIKAFYQKMYGISLCQAILDETKGDYEKILVALCGGN

T13-10 MAMVSEFLKQAWFIENEEQEYVQTVKSSKGGPGSAVSPYPTFNPSSDVAALHKAIMVKGVDEATIIDILTKRNNAQRQQIKAAYLQETGKPLD EDTLIEILASRTNKEIRDINRVYREELKRDLAKDITSDTSGDFRN...GGN

n T8-1 N9-8 N12-16 N17-2 T9-9 N13-17 N15-2 T13-10 T4-18 N10-2 T16-12 T18-19 N3a-2 2 N7-12 12 T3-20 T12-2 T11-13 T22-21 13 T2-2 N11-13 T5-22 T17-2 T6-14 14 N4-23 23 T14-3 N16-14 N8-23 T7-3 3 N14-15 N5-24 T20-4 P04083-C N6-25 T15-5 T1-C C N2a-26 26 T19-6 T10-C N3b-26 N2b-7 N1-C SN2-27

Figure 3.1l-n - Multiple sequence alignment for ANXA1. The multiple sequence alignment for the transcript isoforms of ANXA1 shown in Figure 3.1c,d. The identifiers of the transcript isoforms are included. Canonical protein sequences are highlighted in yellow. ORFs coding for the same protein are shown in matching colors. Note: Solid lines connecting protein fragments indicate contiguous sequences predicted from the respective transcript isoform. (l) Overview of predicted amino acid sequence alignment. (m) Example to indicate generation of a predicted fusion protein T13-10 from the canonical (C) protein. (n) Common predicted proteins for groups of transcript isoforms, ORFs 2, 3, 12, 13, 14, 23, 26 and the canonical ORF (C) are listed and shown with the respective color from panel (a).

31

o

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... |....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|. . N6-1 MGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFRVKLFLGTMWASMSRMCLSRMFVVATLLVTAKMTHQWKQLASLLR* N1-2 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAECVNVVRSINQLINLLSVVGNFE* T10a-3 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWSWIN* N16-4 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKQLASLLR* N15-5 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETK* N2-6 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALNS* N23-7 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVSWQAHVC* T14-8 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDKHAGAKC* N13-9 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTI KNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTVRMAMPVEPRCLRLWTASYHQLVQLTSPCACLSRMSTKLVVLVLFLLAEWRLVFSNPVWWSPLLQSTLQCGLQCQECVCQGCSSWQRCW* T1-10 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDWCSQTRYGGHLCSSQRYNGSKICRNAP* T9-11 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEA VDKKAAGAGKVTKSAQKAQKAK* T13-12 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACK SAQKAQKAK* N24-13 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MESHRKDGNASGTTLLEALDCILPPTRPTDKPCACLSRMSTKLGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYALYWIATRLTLHASLLS* T11-14 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVG VLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRCS* P68104 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK ST2-C MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* T2-C MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N4-C* MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N8-C* MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N9-C* MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N10-C* MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N18-C* MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N25-C* MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N27-C* MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N28-15 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAEL T12-16 MGKE IVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVVVIKAVDKKAAGAGKVTKSAQKAQKAK* N26-17 MGK GSFKYAWVLDKLKAERERG VLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSARKLRRLNEYYP* N11-18 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALTASYTNSSTDNLAPASPGCLQNVYWYCSCWPS ETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N22-19 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNM PWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N20-20 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVK DGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N12-21 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKELLRWERAPSS MPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYALYWIATRLTLHASLLS* T8-22 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVE NDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N3-23 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYSWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGH IACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N7-24 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYY VIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* T5-25 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDA AIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* T7-26 MGKEKTHI TIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N17-27 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MITGTSQADCAVLIVAAGVGEFEA AGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* T3-28 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~MIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N5-29 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N14a-30 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKTRYGGHLCSSQRYNGSKICRNAP* T10b-31 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MRKLLRKSKKKKKKKKKKKLLKEVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* T4-32 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~MVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N19-33 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* T6-34 MWASMSRMCLSRMFVVATLLVTAKMTTNGSSWLHCSGDYPEPSRPNKRRLC VLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* SN2-35 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~MVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N21-35 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~MVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK* N14b-36 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~MPGSWINSKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK*

p N1-2 P68104 N11-18 N5-29 T10a-3 ST2-C N22-19 N14a-30 N16-4 T2-C N20-20 T10b-31 N15-5 N4-C* N12-21 T4-32 N2-6 N8-C* C T8-22 N19-33 N23-7 N9-C* N3-23 T6-34 T14-8 N10-C* T5-22 SN2-35 35 N13-9 N18-C* N4-23 N21-35 T1-10 N23-C* N7-24 24 N14b-36 T9-11 N27-C* T5-25 T13-12 N28-15 T7-26 N24-13 T12-16 N17-27 T11-14 N26-17 T3-28 q

YYVIILNHPGQISAGYAPVLDCHTAHIACK

Figure 3.1o-q - Multiple sequence alignment for EEF1A1. The multiple sequence alignment for the transcript isoforms of EEF1A1 shown in Figure 1h,i. The identifiers of the transcript isoforms are included. Canonical protein sequences are highlighted in yellow. ORFs coding for the same protein are shown in matching color. (o) Overview of predicted amino acid sequence alignment. (p) Higher magnification of ten transcript isoform identifiers that code for the canonical protein (yellow). C, Refseq derived transcripts. C*, previously unknown transcript isoforms. Transcript isoforms SN2-35 (from Frag-seq) and N21-35 (from FL-seq) predict the same protein (purple) (q) The fusion protein predicted from the N7-24 transcript isoform (blue) was validated by mass spectrometry. The respective peptide identified after tryptic digest is shown.

32

The above data for two genes with 8 and 13 canonical exons show that the FL-seq method uncovers a multitude of previously unknown transcript isoforms. To test whether this applies to other genes with lower and higher numbers of canonical exons, we arranged genes accordingly and identified loci with the top five transcript isoform counts in each bin (Table

3.1a-f and 3.2). Surprisingly, the number of unique isoforms identified from Frag-seq plateaued at a maximum average of four isoforms irrespective of the number of exons in a given gene

(Figure 3.2a). An increase in sequencing depth from 20 to 100 million reads did not impact this maximum significantly (p>0.05; Table 3.1a-f), an observation that matched with predictions by other investigators (57). In contrast, FL-seq showed an increase in transcript isoform number tracking with increasing gene organization complexity with additional exons (p=0.0014; Chi- square test for trend, FL-seq vs. Frag-seq).

To assess the biologic relevance of the transcript isoforms detected and their predicted open reading frames, we identified detectable proteins in the bone marrow cell populations by mass spectrometry analysis. Detected proteins were then compared to Frag-seq or FL-seq transcripts with the highest number of isoforms at genomic loci with 2 to 16 exons (Table 3.2 and Figure 3.2a). Surprisingly, the unbiased search for detectable proteins yielded a significantly higher number of detectable proteins, for the 149 FL-seq versus 149 Frag-seq transcripts, i.e. 52 versus 18 detectable proteins respectively (p=0.0003, Chi-square; Table 3.2). This proteomic evidence supports the relevance of the ORFs identified by FL-seq.

33

The figure and tables on the following pages illustrate these differences in transcript isoform number between FL-seq and Frag-seq as a function of transcript length (as measured by number of canonical exons).

34

a b

FL-seq FL-seq Frag-seq Frag-seq Transcript Isoform (#) Transcript Transcript Abudance (#) Transcript p=0.0014

p < 0.0001

0 5 10 15 Exon (#) Transcript isoform #

Figure 3.2a-b - Number of transcript isoforms obtained from FL-seq and Frag-seq. The number of canonical exons was obtained from the hg19 gene annotation. (a) Comparison of the mean of the top five transcript isoforms for genes with >2 to 16 exons from the combined analysis of bone marrow cell populations (p=0.0014; Chi-sq. for trend FL-seq vs Frag-seq). The numbers of transcript isoforms for each of the subgroups and the gene names are in Table 1a-f and 2 respectively. (b) Transcript isoform numbers mapped to distinct genomic loci relative to the canonical number of exons at the respective loci. Comparison of Frag-seq vs FL-seq, p<0.0001; Chi-square for trend.

35

Table 3.1a - Top 5 isoform counts for genes with 2 to 16 exons shown as combined values in Figure 3.2a. from FL-seq for lineage-negative (N) cell population. Names of the individual genes are shown in Table 3.2 Full-length RNA sequencing Lin- (N) top 5 transcript isoform counts median mean

FLexons-seq 1 2 3 4 5 2 4 3 3 3 3 3 3.2 3 11 6 5 4 4 5 6 4 9 9 8 7 6 8 7.8 5 9 9 8 7 7 8 8 6 20 13 13 10 9 13 13 7 28 17 13 10 9 13 15.4 8 12 12 9 9 8 9 10 9 28 13 11 10 8 11 14 10 22 13 10 10 8 10 12.6 11 31 18 15 12 9 15 17 12 12 12 11 9 8 11 10.4 13 71 11 10 9 8 10 21.8 14 18 12 11 9 9 11 11.8 15 11 7 7 7 6 7 7.6

36

Table 3.1b - Top 5 isoform counts for genes with 2 to 16 exons shown as combined values in Figure 3.2a. from FL-seq for total (T) bone marrow cell population. Names of the individual genes are shown in Table 3.2 Full-length RNA sequencing Total (T) top 5 transcript isoform counts exons 1 2 3 4 5 2 4 4 4 3 3 4 3.6 3 10 6 6 6 6 6 6.8 4 26 16 14 14 12 14 16.4 5 25 24 24 17 13 24 20.6 6 25 23 20 16 15 20 19.8 7 19 13 12 11 11 12 13.2 8 69 31 24 22 21 24 33.4 9 41 23 21 20 18 21 24.6 10 21 21 18 12 12 18 16.8 11 58 16 14 12 12 14 22.4 12 329 16 14 13 13 14 77 13 56 13 10 9 8 10 19.2 14 26 22 19 13 13 19 18.6 15 31 18 14 10 9 14 16.4

37

Table 3.1c - Top 5 isoform counts for genes with 2 to 16 exons shown as combined values here and in Figure 3.2a. from FL-seq for total (T) bone marrow cell population. Names of the individual genes are shown in Table 3.2 COMBINED (top 5 in combined population) FL-seq top 5 transcript isoform counts exons 1 2 3 4 5 2 4 4 4 3 3 4 3.6 3 10 6 6 6 6 6 6.8 4 26 16 14 14 12 14 16.4 5 25 24 24 17 13 24 20.6 6 25 23 20 16 15 20 19.8 7 19 13 12 11 11 12 13.2 8 69 31 24 22 21 24 33.4 9 41 23 21 20 18 21 24.6 10 21 21 18 12 12 18 16.8 11 58 16 14 12 12 14 22.4 12 329 16 14 13 13 14 77 13 56 13 10 9 8 10 19.2 14 26 22 19 13 13 19 18.6 15 31 18 14 10 9 14 16.4

38

Table 3.1d - Top 5 isoform counts for genes with 2 to 16 exons shown in Figure 3.2a. from Frag-seq for lineage-negative (N) cell population and total . Names of the individual genes are shown in Table 3.2 Fragmented RNA sequencing Lin- (N) top 5 transcript isoform counts exons 1 2 3 4 5 2 4 4 4 3 3 4 3.6 3 10 6 6 6 6 6 6.8 4 26 16 14 14 12 14 16.4 5 25 24 24 17 13 24 20.6 6 25 23 20 16 15 20 19.8 7 19 13 12 11 11 12 13.2 8 69 31 24 22 21 24 33.4 9 41 23 21 20 18 21 24.6 10 21 21 18 12 12 18 16.8 11 58 16 14 12 12 14 22.4 12 329 16 14 13 13 14 77 13 56 13 10 9 8 10 19.2 14 26 22 19 13 13 19 18.6 15 31 18 14 10 9 14 16.4

39

Table 3.1e - Top 5 isoform counts for genes with 2 to 16 exons shown in Figure 3.2a. from Frag-seq for total (T) bone marrow cell population. Names of the individual genes are shown in Table 3.2 Fragmented RNA sequencing Total (T) top 5 transcript isoform counts exons 1 2 3 4 5 Frag-seq 2 5 4 3 3 3 3 3.6 3 5 5 5 5 4 5 4.8 4 5 5 4 4 4 4 4.4 5 7 6 5 5 5 5 5.6 6 9 6 5 4 4 5 5.6 7 5 4 4 4 4 4 4.2 8 7 6 5 5 5 5 5.6 9 5 5 5 4 4 5 4.6 10 7 5 5 4 4 5 5 11 5 4 4 4 4 4 4.2 12 6 6 5 4 4 5 5 13 5 4 4 4 4 4 4.2 14 5 4 4 4 4 4 4.2 15 5 4 4 4 4 4 4.2

40

Table 3.1f - Top 5 isoform counts for genes with 2 to 16 exons shown as combined values here and in Figure 3.2a. from Frag-seq for total (T) bone marrow cell population and for lineage negative (N) cell population. Names of the individual genes are shown in Table 3.2 Fragmented RNA sequencing COMBINED Frag-seq top 5 median mean (top 5 in exons 1 2 3 4 5 transcript combined2 5 4 3 3 3 3 3.6 3 isoform5 5 5 5 4 5 4.8 population) 4 5 5 5 4 4 5 4.6 counts 5 7 6 5 5 5 5 5.6 6 9 6 5 4 4 5 5.6 7 5 4 4 4 4 4 4.2 8 7 6 5 5 5 5 5.6 9 5 5 5 4 4 5 4.6 10 7 5 5 4 4 5 5 11 5 4 4 4 4 4 4.2 12 6 6 5 4 4 5 5 13 5 4 4 4 4 4 4.2 14 5 4 4 4 4 4 4.2

41

Table 3.2 - Gene names for transcript isoforms arranged by the number of canonical exons from the hg19 gene annotation (exon #). The transcript isoform count (trans #), source material, identifiers generated by the sequencing method (ID) and acronym for the respective gene are listed. The presence of proteins predicted from the RNA sequencing was assessed by mass spectrometry of the bone marrow cells. The detection of peptides matching with the respective protein is indicated (X). (^multiple ZNF matches,^^no gene found at this location --no evidence in FL, ****HYAL1 and NAT6 are co-located in the same region on the genome -- these overlapping annotations will likely be merged in future genome releases,**hg19 annotation has ANXA1 with 13 exons, our data shows 14) Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 6 25 FL-seq Total PB.4886 ab parts 11 31 FL-seq Lin- PB.2080 ab parts 12 329*** FL-seq Total PB.2666 ab parts 12 13 FL-seq Total PB.6200 ABTB1 7 17 FL-seq Lin- PB.5874 ACTB X 8 69 FL-seq Total PB.7487 ACTB X 7 10 FL-seq Lin- PB.3033 ACTG1 X 8 31 FL-seq Total PB.3907 ACTG1 X 15 4 Frag-seq Total tot.fr.unstranded.gtf.6428 ADD3 4 5 Frag-seq Lin- ill.lin.neg.nm.th.1627 AFLBP 13 4 Frag-seq Total tot.fr.unstranded.gtf.18766 AKAP10 9 2 Frag-seq Lin- ALDOA ALDOA X 10 12 FL-seq Total PB.3146 ALDOA X 4 4 Frag-seq Total tot.fr.unstranded.gtf.12281 ALG11 10 18 FL-seq Total PB.1689 AMICA1 13 2 Frag-seq Lin- ill.lin.neg.nm.th.7767 ANXA1 X 14 22 FL-seq Total PB.8299 ANXA1** X 14 18 FL-seq Lin- PB.6509 ANXA1** X 14 2 Frag-seq Lin- ill.lin.neg.nm.th.2641 ANXA2 X 15 7 FL-seq Lin- PB.2183 ANXA2 X 15 4 Frag-seq Total tot.fr.unstranded.gtf.4771 APBB1P 5 7 FL-seq Lin- PB.1872 APEX1

42

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 5 3 Frag-seq Lin- APEX1 APEX1 7 13 FL-seq Total PB.1824 ARHGDIB X 7 9 FL-seq Lin- PB.1436 ARHGDIB X 2 3 FL-seq Lin- PB.5885 ARL4A 2 4 Frag-seq Lin- ill.lin.neg.nm.th.6959 ARL4A 6 13 FL-seq Lin- PB.1729 ARL6IP4 12 16 FL-seq Total PB.7694 ARPC1B X 12 11 FL-seq Lin- PB.6039 ARPC1B X 11 2 Frag-seq Lin- ARPC2 ARPC2 X 15 6 FL-seq Lin- PB.2675 ARRB2 14 2 Frag-seq Lin- ill.lin.neg.nm.th.7401 ASAH1 X 13 2 Frag-seq Lin- ATP5A1 ATP5A1 X 5 7 FL-seq Lin- PB.3160 AZU1 X 3 6 FL-seq Total PB.6638 BASP1 X 15 7 FL-seq Lin- PB.4193 BPI X 2 3 FL-seq Total PB.2016 BTG1 3 6 FL-seq Total PB.787 BTG2 8 21 FL-seq Total PB.1064 C10orf54 6 5 Frag-seq Total tot.fr.unstranded.gtf.16229 C16orf13 9 10 FL-seq Lin- PB.3060 C17orf62 6 9 FL-seq Lin- PB.2740 C17orf76-AS1 6 3 Frag-seq Lin- ill.lin.neg.nm.th.384 C1orf162 2 4 FL-seq Total PB.6013 C5AR2 14 19 FL-seq Total PB.285 CAP1 X 14 11 FL-seq Lin- PB.197 CAP1 X 10 12 FL-seq Total PB.4868 CAPG X 10 7 Frag-seq Total tot.fr.unstranded.gtf.12837 CASP5 9 5 Frag-seq Total tot.fr.unstranded.gtf.8430 CCDC90B 16 2 Frag-seq Lin- ill.lin.neg.nm.th.1854 CCT2 43

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 15 1 Frag-seq Lin- CCT8 CCT8 6 4 Frag-seq Total tot.fr.unstranded.gtf.2946 CD1E 15 2 Frag-seq Lin- ill.lin.neg.nm.th.7132 CD36 X 10 5 Frag-seq Total tot.fr.unstranded.gtf.9693 CD4 12 4 Frag-seq Total tot.fr.unstranded.gtf.4122 CD46 9 13 FL-seq Lin- PB.5427 CD74 10 21 FL-seq Total PB.6902 CD74 16 7 FL-seq Total PB.4257 CD97 13 1 Frag-seq Lin- CDC123 CDC123 5 5 Frag-seq Total tot.fr.unstranded.gtf.44946 CHCHD7 12 4 Frag-seq Total tot.fr.unstranded.gtf.3659 CHIT1 chr2:43,355,151- 4 5 Frag-seq Total tot.fr.unstranded.gtf.24612 43,359,900^^ chr7:142,493,939- 5 5 Frag-seq Total tot.fr.unstranded.gtf.43348 142,495,384^^ 14 4 Frag-seq Total tot.fr.unstranded.gtf.14195 CHURC1 6 3 Frag-seq Lin- ill.lin.neg.nm.th.1647 CLEC4A 5 3 Frag-seq Lin- CNBP CNBP 8 8 FL-seq Lin- PB.1393 COPS7A 11 14 FL-seq Total PB.3151 CORO1A X 5 24 FL-seq Total PB.272 CSF3R 14 5 Frag-seq Total tot.fr.unstranded.gtf.34222 CSNK2A1 2 3 Frag-seq Total tot.fr.unstranded.gtf.5250 CSTF2T 12 12 FL-seq Lin- PB.6200 CTSB 4 16 FL-seq Total PB.5139 CXFL2 13 10 FL-seq Total PB.5748 CYTH4 13 8 FL-seq Lin- PB.5949 DBNL X 3 3 Frag-seq Lin- ill.lin.neg.nm.th.2806 DDX11L10 13 2 Frag-seq Lin- ill.lin.neg.nm.th.3516 DDX5 9 4 Frag-seq Total tot.fr.unstranded.gtf.733 DHDDS

44

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 14 4 Frag-seq Total tot.fr.unstranded.gtf.34222 DHX15 11 12 FL-seq Total PB.3262 DPEP2 8 5 Frag-seq Total tot.fr.unstranded.gtf.2298 DPH5 4 12 FL-seq Total PB.6963 DUSP1 9 28 FL-seq Lin- PB.5717 EEF1A1 X 10 8 FL-seq Lin- PB.1118 EEF1G 12 13 FL-seq Total PB.3480 EIF4A1 11 3 Frag-seq Lin- ill.lin.neg.nm.th.5796 EIF4A2 12 8 FL-seq Lin- PB.33 ENO1 X 12 2 Frag-seq Lin- ill.lin.neg.nm.th.88 ENO1 X 13 8 FL-seq Total PB.52 ENO1 X 14 26 FL-seq Total PB.3744 EPX X 12 3 Frag-seq Lin- ill.lin.neg.nm.th.2730 ETFA 3 4 Frag-seq Total tot.fr.unstranded.gtf.3691 FAIM3 4 4 Frag-seq Lin- ill.lin.neg.nm.th.8108 FAM104B 13 4 Frag-seq Total tot.unstranded.gtf.39720 FBXO9 8 6 Frag-seq Total tot.fr.unstranded.gtf.5263 FCGR2A 8 22 FL-seq Total PB.4539 FCGRT 9 8 FL-seq Lin- PB.6682 FCN1 10 21 FL-seq Total PB.8517 FCN1 10 2 Frag-seq Lin- FDPS FDPS 3 6 FL-seq Total PB.4389 FFAR2 7 4 Frag-seq Total tot.fr.unstranded.gtf.9564 FGFR10P2 10 5 Frag-seq Total tot.fr.unstranded.gtf.39802 FLISP2 14 13 FL-seq Total PB.7134 FLOT1 X 6 23 FL-seq Total PB.2567 FOS 4 14 FL-seq Total PB.4567 FPR1 5 24 FL-seq Total PB.1457 FTH1 5 9 FL-seq Lin- PB.1113 FTH1 45

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 12 5 Frag-seq Total tot.fr.unstranded.gtf.40449 FYN 9 2 Frag-seq Lin- GAPDH GAPDH X 10 22 FL-seq Lin- PB.1386 GAPDH X 11 58 FL-seq Total PB.1766 GAPDH X 12 2 Frag-seq Lin- ill.lin.neg.nm.th.592 GAS5 11 2 Frag-seq Lin- ill.lin.neg.nm.th.787 GDI2 X 7 3 Frag-seq Lin- ill.lin.neg.nm.th.629 GLUL 8 24 FL-seq Total PB.727 GLUL 14 9 FL-seq Lin- PB.4248 GNAS 10 4 Frag-seq Total tot.fr.unstranded.gtf.13777 GPATCH2L 13 13 FL-seq Total PB.3228 GPR97 14 9 FL-seq Lin- PB.3055 GPS1 8 3 Frag-seq Lin- ill.lin.neg.nm.th.394 GSTM1 10 10 FL-seq Lin- PB.5374 H2AFY X 16 18 FL-seq Total PB.6174 HCLS1 4 4 Frag-seq Total tot.fr.unstranded.gtf.1653 HHLA3 8 9 FL-seq Lin- PB.5586 HLA-C X 9 21 FL-seq Total PB.7140 HLA-C X 6 20 FL-seq Total PB.7174 HLA-DRA 9 41 FL-seq Total PB.7123 HLA-E 6 10 FL-seq Lin- PB.5636 HMGA1 8 9 FL-seq Lin- PB.4320 HMGN1 6 4 Frag-seq Lin- ill.lin.neg.nm.th.6752 HMGN3 13 11 FL-seq Lin- PB.5904 HNRNPA2B1 10 10 FL-seq Lin- PB.1881 HNRNPC X 10 2 Frag-seq Lin- HNRNPH3 HNRNPH3 12 9 FL-seq Lin- PB.5689 HSP90AB1 X 9 11 FL-seq Lin- PB.1343 HSPA8 X 4 9 FL-seq Lin- PB.4687 HYAL3,NAT6 **** 46

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 2 3 FL-seq Total PB.1270 IFITM1 6 15 FL-seq Total PB.5666 IGLL5 7 28 FL-seq Lin- PB.4388 IGLL5 2 4 FL-seq Lin- PB.672 IRF2BP2 16 11 FL-seq Total PB.741 IVNS1ABP 5 13 FL-seq Total PB.924 KFL6 4 9 FL-seq Lin- PB.712 KLF6 8 5 Frag-seq Total tot.fr.unstranded.gtf.9343 KLRD1 11 9 FL-seq Lin- PB.3581 LAIR1 9 18 FL-seq Total PB.221 LAPTM5 15 4 Frag-seq Total tot.fr.unstranded.gtf.9967 LARP4 9 4 Frag-seq Total tot.fr.unstranded.gtf.10047 LETMD1 13 9 FL-seq Total PB.4600 LILRB3 2 3 Frag-seq Total tot.fr.unstranded.gtf.508 LINC00339 2 3 Frag-seq Total tot.fr.unstranded.gtf.1414 LINC00853 7 4 Frag-seq Total tot.fr.unstranded.gtf.16042 LINS 4 5 Frag-seq Total tot.fr.unstranded.gtf.25680 LOC654342 5 5 Frag-seq Total tot.fr.unstranded.gtf.39780 LST1 5 4 Frag-seq Lin- ill.lin.neg.nm.th.6623 LST1 5 25 FL-seq Total PB.1986 LYZ X 5 9 FL-seq Lin- PB.1581 LYZ X 9 2 Frag-seq Lin- MDH1 MDH1 X 11 4 Frag-seq Total tot.fr.unstranded.gtf.3812 MDM4 13 4 Frag-seq Total tot.fr.unstranded.gtf.25143 MEIS1 4 4 Frag-seq Total tot.fr.unstranded.gtf.9614 METTL20 9 5 Frag-seq Total tot.fr.unstranded.gtf.27519 MFF 13 10 FL-seq Lin- PB.4982 MFSD10 7 11 FL-seq Total PB.634 MNDA X 16 5 Frag-seq Total tot.fr.unstranded.gtf.48073 MOSPD2 47

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 13 71 FL-seq Lin- PB.2913 MPO X 13 56 FL-seq Total PB.3748 MPO X 3 4 Frag-seq Total tot.fr.unstranded.gtf.6905 MRPL17 5 5 Frag-seq Total tot.fr.unstranded.gtf.4040 MRPL55 3 2 Frag-seq Lin- MRPS21 MRPS21 7 4 Frag-seq Lin- ill.lin.neg.nm.th.1342 MS4A7 10 3 Frag-seq Lin- ill.lin.neg.nm.th.4401 NAGK 16 5 Frag-seq Total tot.fr.unstranded.gtf.10766 NAP1L1 X 16 3 Frag-seq Lin- ill.lin.neg.nm.th.1876 NAP1L1 X 14 1 Frag-seq Lin- NASP NASP 2 3 Frag-seq Total tot.fr.unstranded.gtf.10141 NBPF8 12 14 FL-seq Total PB.7651 NCF1C 15 10 FL-seq Total PB.733 NCF2 4 14 FL-seq Total PB.1907 NFE2 7 13 FL-seq Lin- PB.1936 NFKBIA 11 4 Frag-seq Total tot.fr.unstranded.gtf.6749 NLRP3 15 7 FL-seq Lin- PB.3988 NOP58 15 2 Frag-seq Lin- ill.lin.neg.nm.th.4708 NOP58 14 2 Frag-seq Lin- ill.lin.neg.nm.th.1235 NUCB2 16 30 FL-seq Lin- PB.1016 NUCB2 14 4 Frag-seq Total tot.fr.unstranded.gtf.10441 OS9 11 2 Frag-seq Lin- ill.lin.neg.nm.th.2217 OSGEP 16 9 FL-seq Lin- PB.6359 PABPC1 16 4 Frag-seq Total tot.fr.unstranded.gtf.1141 PABPC4 16 17 FL-seq Total PB.111 PADI4 X 16 4 Frag-seq Total tot.fr.unstranded.gtf.36672 PAPD4 15 11 FL-seq Lin- PB.1507 PCBP2 8 3 Frag-seq Lin- ill.lin.neg.nm.th.6871 PCMT1 15 4 Frag-seq Lin- ill.lin.neg.nm.th.1812 PCPB2 48

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 13 1 Frag-seq Lin- PGD PGD X 14 13 FL-seq Total PB.63 PGD X 2 4 Frag-seq Total tot.fr.unstranded.gtf.42433 PKK4^^^ 11 18 FL-seq Lin- PB.2222 PKM X 11 16 FL-seq Total PB.2834 PKM X 7 4 Frag-seq Lin- ill.lin.neg.nm.th.4088 PLAUR 9 23 FL-seq Total PB.4479 PLAUR 14 12 FL-seq Lin- PB.3449 PLD3 11 9 FL-seq Lin- PB.2077 PLD4 5 3 Frag-seq Lin- ill.lin.neg.nm.th.2226 PNP X 12 4 Frag-seq Total tot.fr.unstranded.gtf.8107 POLD3 3 6 FL-seq Lin- PB.206 PPCS 3 3 Frag-seq Lin- ill.lin.neg.nm.th.255 PPCS 11 4 Frag-seq Total tot.fr.unstranded.gtf.10000 PPHLN1 10 2 Frag-seq Lin- ill.lin.neg.nm.th.239 PPIE 16 6 FL-seq Lin- PB.1592 PPP1R12A 7 11 FL-seq Total PB.1414 PRG3 X 6 4 Frag-seq Total tot.fr.unstranded.gtf.2131 PRPF38B 5 17 FL-seq Total PB.1900 PRR13 5 8 FL-seq Lin- PB.1506 PRR13 15 5 Frag-seq Total tot.fr.unstranded.gtf.10451 PRR13 15 4 Frag-seq Total tot.fr.unstranded.gtf.10451 PRR13 6 13 FL-seq Lin- PB.3161 PRTN3 X 15 18 FL-seq Total PB.1065 PSAP X 15 2 Frag-seq Lin- ill.lin.neg.nm.th.960 PSAP X 9 2 Frag-seq Lin- PSMA4 PSMA4 14 2 Frag-seq Lin- PSMD11 PSMD11 3 4 FL-seq Lin- PB.137 PTAFR 12 6 Frag-seq Total tot.fr.unstranded.gtf.13821 PTGR2 49

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 6 20 FL-seq Lin- PB.4049 PTMA 16 10 FL-seq Total PB.1781 PTPN6 X 16 9 FL-seq Lin- PB.1399 PTPN6 X 3 4 FL-seq Lin- PB.1375 RHNO1 2 3 FL-seq Lin- PB.978 RHOG 7 4 Frag-seq Total tot.fr.unstranded.gtf.11587 RNF34 11 12 FL-seq Lin- PB.954 RNH1 11 12 FL-seq Total PB.1271 RNH1 10 13 FL-seq Lin- PB.4477 RPL3 12 12 FL-seq Lin- PB.2213 RPL4 11 2 Frag-seq Lin- ill.lin.neg.nm.th.1018 RPP30 16 6 FL-seq Lin- PB.3529 RUVBL2 12 2 Frag-seq Lin- SARNP SARNP 7 19 FL-seq Total PB.8616 SAT1 3 6 FL-seq Total PB.5853 SCO2 9 2 Frag-seq Lin- SELL SELL 3 10 FL-seq Total PB.2074 SELPLG 9 20 FL-seq Total PB.2617 SERPINA1 8 12 FL-seq Lin- PB.5509 SERPINB1 4 3 Frag-seq Lin- ill.lin.neg.nm.th.181 SFPQ^^ 7 5 Frag-seq Total tot.fr.unstranded.gtf.24476 SIGLEC9 15 31 FL-seq Total PB.5151 SLC11A1 6 3 Frag-seq Lin- ill.lin.neg.nm.th.514 SLC50A1 16 1 Frag-seq Lin- SLU7 SLU7 6 3 Frag-seq Lin- ill.lin.neg.nm.th.2154 SNORA31 16 4 Frag-seq Total tot.fr.unstranded.gtf.37419 SNX2 7 12 FL-seq Total PB.7437 SOD2 3 5 Frag-seq Total tot.fr.unstranded.gtf.48393 SPIN2B 4 6 FL-seq Lin- PB.795 SRGN 50

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 11 4 Frag-seq Total tot.fr.unstranded.gtf.9063 SRPR 13 9 FL-seq Lin- PB.291 SRSF11 8 3 Frag-seq Lin- ill.lin.neg.nm.th.2357 SRSF5 2 2 Frag-seq Lin- SRSF8 SRSF8 8 3 Frag-seq Lin- ill.lin.neg.nm.th.7095 STGA3L2 3 5 Frag-seq Total tot.fr.unstranded.gtf.48610 TCEAL1 2 2 Frag-seq Lin- TIMM8A TIMM8A 15 14 FL-seq Total PB.6089 TKT X 11 4 Frag-seq Total tot.fr.unstranded.gtf.11191 TMEM116 7 3 Frag-seq Lin- ill.lin.neg.nm.th.2967 TMEM219 12 3 Frag-seq Lin- ill.lin.neg.nm.th.5350 TMEM40 2 2 Frag-seq Lin- TNFAIP8 TNFAIP8 7 3 Frag-seq Lin- ill.lin.neg.nm.th.477 TNFAIP8L2 8 7 Frag-seq Total tot.fr.unstranded.gtf.2789 TNFAIP8L2 6 16 FL-seq Total PB.7951 TNFRSF10C 8 12 FL-seq Lin- PB.447 TPM3 X 8 3 Frag-seq Lin- ill.lin.neg.nm.th.588 TPM3 X 13 4 Frag-seq Total tot.fr.unstranded.gtf.36474 TRAPC13 4 3 Frag-seq Lin- TREM1 TREM1 2 3 FL-seq Lin- PB.4652 TREX1 3 5 Frag-seq Total tot.fr.unstranded.gtf.6741 TRIM21,OR52K2,OR52K1 3 3 Frag-seq Lin- TRIM52-AS1 TRIM52-AS1 8 5 Frag-seq Total tot.fr.unstranded.gtf.34978 TRMT10A 4 8 FL-seq Lin- PB.6869 TSC22D3 4 3 Frag-seq Lin- TSTD1 TSTD1 4 7 FL-seq Lin- PB.4446 TXN2 2 4 FL-seq Total PB.3524 UBB X 2 4 FL-seq Total PB.4510 UBB X 2 3 FL-seq Lin- PB.2739 UBB X 51

Bone Proteins Exon Sequencing Marrow Confirmed Trans # ID Gene # Method Cell by Mass Population Spec 2 2 Frag-seq Lin- UBB UBB X 3 11 FL-seq Lin- PB.1743 UBC 4 26 FL-seq Total PB.2190 UBC 15 9 FL-seq Total PB.181 UBXN11 10 2 Frag-seq Lin- ill.lin.neg.nm.th.238 UROD 14 4 Frag-seq Total tot.fr.unstranded.gtf.40051 VCAN 13 5 Frag-seq Total tot.fr.unstranded.gtf.10770 VEZT 10 4 Frag-seq Total tot.fr.unstranded.gtf.9339 YBX3 9 5 Frag-seq Total tot.fr.unstranded.gtf.22877 YIF1B 3 5 FL-seq Lin- PB.3699 ZFP36L2 5 6 Frag-seq Total tot.fr.unstranded.gtf.42104 ZNF138 7 4 Frag-seq Total tot.fr.unstranded.gtf.6739 ZNF195 3 3 Frag-seq Lin- ill.lin.neg.nm.th.3705 ZNF271 6 4 Frag-seq Total tot.fr.unstranded.gtf.gtf.5545 ZNF33A 6 9^ Frag-seq Total tot.fr.unstranded.gtf.25145 ZNF816-ZNF321P 5 3 Frag-seq Lin- ZNRD1 ZNRD1

52

A comparison of transcript isoform frequency distribution from Frag-seq versus FL-seq shows that 19.1% (9,041) versus 71.5% (20,799) of transcripts respectively are mapped to genes with >4 exons and 8.5% (4,026) versus 43.6% (12,679) of transcripts to genes with >8 exons

(Figure 3.2b). Thus, FL-seq provides an approximately 4-fold gain in information for transcripts of greater complexity. Given that FL-seq shows that almost half of the transcripts are from genes with >8 exons, the ability to span 2 exons with short reads may be inadequate to resolve a full- length transcript successfully without the addition of longer reads (70). Also, incorrect mapping of the Frag-seq short reads to the wrong location on the genome could explain the abnormally high number of transcripts being mapped to 1- to 2-exon genes (Figure 3.2b and Table 3.5). On the other hand, transcript isoform counts for individual genes with different numbers of canonical exons exceeds the theoretical number of possible splice variants (36) as shown in

Figure 3.2c, suggesting a higher complexity of splicing than appreciated by the Frag-seq or other conventional approaches.

Strikingly, the FL-seq identified transcript isoforms from a number of genes that were not be detected by Frag-seq (Figure 3.2d). The inability to reconstruct transcript isoforms for these loci from short read sequencing can be explained by the paralogous nature of the genes involved:

CFD is located on chr 19 with AZU1, PRTN3 and ELANE. These four genes rank second in the list of top ten regions of homozygosity coldspots on human autosomes (71). Genes located in this run of homozygosity (ROH) region are under evolutionary pressure to remain homozygous.

Additionally, these genes are highly paralogous: ELANE and CFD are 78% similar (Figure 3.3 and Table 3.3). Fragment sequencing uncovered transcripts from three of these four genes, but

53

missed CFD. That is not surprising, because blastn analysis matched CFD fragment sequences to

ELANE. FL-seq, however, was able to identify unique isoforms for each of the difficult regions.

Similarly, HLA-A, HLA-B, and HLA-C are paralogous with >80% identity and thus not detected by Frag-seq (Figures 3.2d and 3.4 and Table 3.4). These data suggest that the presence of transcripts from paralogs add to the complexity of alignments and obfuscate transcript assembly when using Frag-seq. Also, the paralogous nature of certain genomic loci affects the ability of

Frag-seq to match transcripts with loci and will affect abundance reported. The figure and the tables on the following pages illustrate these differences regarding transcript isoforms uncovered by FL-seq as compared with those uncovered by Frag-seq.

54

FL-seq FL-seq c Frag-seq d Frag-seq not detected HLA-E EEF1A1 ANXA1 UBC LYZ PKM Transcript Isoform (#) Transcript Isoform (#) Transcript CD74 GLUL SAT1 HLA-C KLF6 HLA-B HLA-A CFD GATA2

0 2 5 8 11 0 5 6 8 Exon (#) Exon (#)

Figure 3.2c-d - Representative genes with 3 to 13 canonical exons. (c) Representative genes with 3 to 13 canonical exons and a high number of transcript isoforms from FL-seq. Gene names and isoform numbers are in Table 3.3. (d) Transcript isoforms identified by FL-seq only. No transcripts were identified by Frag-seq. Gene names and numerical values in Table 3.4.

55

Table 3.3 - Number of exons, transcript isoform numbers (Trans #) for genes shown in Figure 3.2c. The identifiers generated by the sequencing method (ID), acronym of the gene name and description are shown. Bone Exon Trans Sequencing Description (Gene Marrow Cell ID Gene # # Method Cards) Population 2 1 Frag-seq Total NM_021009 UBC Ubiquitin-C 2 2 Frag-seq Lin- ill.lin.neg.nm.th.2072 UBC Ubiquitin-C 2 26 FL-seq Total PB.2190 UBC Ubiquitin-C 2 11 FL-seq Lin- PB.1743 UBC Ubiquitin-C 3 2 Frag-seq Total tot.fr.unstranded.gtf.5047 KLF6 Kruppel-like factor 6 3 3 Frag-seq Lin- ill.lin.neg.nm.th.801 KLF6 Kruppel-like factor 6 3 13 FL-seq Total PB.924 KLF6 Kruppel-like factor 6 3 9 FL-seq Lin- PB.712 KLF6 Kruppel-like factor 6 4 0 Frag-seq Total N/A LYZ Lysozyme C 4 2 Frag-seq Lin- ill.lin.neg.nm.th.1886 LYZ Lysozyme C 4 25 FL-seq Total PB.1986 LYZ Lysozyme C 4 9 FL-seq Lin- PB.1581 LYZ Lysozyme C 5 3 Frag-seq Total tot.fr.unstranded.gtf.48686 SAT1 spermidine/spermine N1- acetyltransferase 1 5 3 Frag-seq Lin- ill.lin.neg.nm.th.8049 SAT1 spermidine/spermine N1- acetyltransferase 1 5 19 FL-seq Total PB.8616 SAT1 spermidine/spermine N1- acetyltransferase 1 5 5 FL-seq Lin- PB.6757 SAT1 spermidine/spermine N1- acetyltransferase 1 6 2 Frag-seq Total tot.fr.unstranded.gtf.41466 EEF1A1 eukaryotic translation elongation factor 1 alpha 1 6 2 Frag-seq Lin- ill.lin.neg.nm.th.6777 EEF1A1 eukaryotic translation elongation factor 1 alpha 1 6 14 FL-seq Total PB.7304 EEF1A1 eukaryotic translation elongation factor 1 alpha 1 6 28 FL-seq Lin- PB.5717 EEF1A1 eukaryotic translation elongation factor 1 alpha 1 7 2 Frag-seq Total tot.fr.unstranded.gtf.4034 GLUL glutamate-ammonia ligase 7 3 Frag-seq Lin- ill.lin.neg.nm.th.629 GLUL glutamate-ammonia ligase 7 24 FL-seq Total PB.727 GLUL glutamate-ammonia ligase 7 8 FL-seq Lin- PB.557 GLUL glutamate-ammonia ligase

56

Bone Exon Trans Sequencing Description (Gene Marrow Cell ID Gene # # Method Cards) Population 8 2 Frag-seq Total tot.fr.unstranded.gtf.39077 HLA-E major histocompatibility complex, class I, E 8 0 Frag-seq Lin- N/A HLA-E major histocompatibility complex, class I, E 8 41 FL-seq Total PB.7123 HLA-E major histocompatibility complex, class I, E 8 3 FL-seq Lin- PB.5573 HLA-E major histocompatibility complex, class I, E 9 1 Frag-seq Total tot.fr.unstranded.gtf.39243 CD74 CD74 molecule, major histocompatibility complex, class II invariant chain 9 2 Frag-seq Lin- ill.lin.neg.nm.th.6425 CD74 CD74 molecule, major histocompatibility complex, class II invariant chain 9 21 FL-seq Total PB.6902 CD74 CD74 molecule, major histocompatibility complex, class II invariant chain 9 13 FL-seq Lin- PB.5427 CD74 CD74 molecule, major histocompatibility complex, class II invariant chain 10 1 Frag-seq Total tot.fr.unstranded.gtf.15771 PKM pyruvate kinase, muscle 10 2 Frag-seq Lin- ill.lin.neg.nm.th.2729 PKM pyruvate kinase, muscle 10 16 FL-seq Total PB.2834 PKM pyruvate kinase, muscle 10 18 FL-seq Lin- PB.2222 PKM pyruvate kinase, muscle 12 1 Frag-seq Total tot.fr.unstranded.gtf.47658 ANXA1 annexin A1 12 2 Frag-seq Lin- ill.lin.neg.nm.th.7767 ANXA1 annexin A1 12 22 FL-seq Total PB.8299 ANXA1 annexin A1 12 18 FL-seq Lin- PB.6509 ANXA1 annexin A1

57

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 24 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|... . |....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| CFD-P00746 RPYMASVQLNGAHLCGGVLVAEQWVLSAAHCLEDAADGKVQVLLGAHSLSQPEPSKRLYDVLRAVPHPDSQPDTIDHDLLLLQLSEKATLGPAVRPLPWQRVDRDVAPGT CFD-T1 RPYMASVQLNGAHLCGGVLVAEQWVLSAAHCLEDAADGKVQVLLGAHSLSQPEPSKRLYDVLRAVPHPDSQPDTIDHDLLLLQLSEKATLGPAVRPLPWQRVDRDVAPGT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ELANE-T3a ~~~MVSLQLRGGHFCGATLIAPNFVMSAAHCVANVADGKVQVLLGAHSLSQPEPSKRLYDVLRAVPHPDSQPDTIDHDLLLLQLSEKATLGPAVRPLPWQRVDRDVAPGT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ELANE-T6 ~~~MVSLQLRGGHFCGATLIAPNFVMSAAHCVANVNVRAVRVVLGAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVILQLNGSATINANVQVAQLPAQGRRLGNGVQ ELANE-T7 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~PTCRCPAAGSGTPPGQRVQ ELANE-P08246WPFMVSLQLRGGHFCGATLIAPNFVMSAAHCVANVNVRAVRVVLGAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVILQLNGSATINANVQVAQLPAQGRRLGNGVQ ELANE-T1 WPFMVSLQLRGGHFCGATLIAPNFVMSAAHCVANVNVRAVRVVLGAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVILQLNGSATINANVQVAQLPAQGRRLGNGVQ ELANE-T2 WPFMVSLQLRGGHFCGATLIAPNFVMSAAHCVANVNVRAVRVVLGAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVILQLNGSATINANVQVAQLPAQGRRLGNGVQ ------ELANE-T4b ------~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ELANE-T4a ~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ELANE-T5 WPFMVSLQLRGGHFCGATLIAPNFVMSAAHCVANVNVRAVRVVLGAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVILQ------GRREGAGSP ELANE-T3b ALHGVPAAARRPLLRRHPDCAQLRHVGRALRGEC------GRREGAGSP

Figure 3.3 - Multiple sequence alignment of ELANE and CFD. A portion of the multiple sequence alignment for the amino acid sequences predicted from transcript isoforms for ELANE and CFD. The identifiers of the transcript isoforms are included.

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... |....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... |....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... |....|....|....|....|....|.. HLA-B-16a MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRLRRQGLHRPERGPALLDRRGHSGSDHPAQVGGGP* HLA-B-6 MRVMAPRALLLLLSGGLALTETWACSHSMRYFDTAVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQADRVSLRNLRGYYNQSEDGSHTLQRMSGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKLEAARAAEQLRAYLEGTCVEWLRRYLENGKETLQRAEPPKTHVTHHPLSDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHMQHEGLQEPLTLSWEPSSQPTIPDHLSCSREVGLDVSISVSNSWCTELQLLTSLMKLRT* HLA-B-9a MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIAPNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWKSLRQLPVWD* HLA-B-15 MRVMAPRALLLLLSGGLALTETWACSHSMRYFDTAVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQADRVSLRNLRGYYNQSEDGSHTLQRMSGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKLEAARAAEQLRAYLEGTCVEWLRRYLENGKETLQRAGHPEVLGPGLLPCGDHTDLAAGWGGPDPGHRACGDQASRRWNLPEVGSCGGAFWTRAEIHVPYAARGAARAPHPELGAIFPAHHPHHGHRCWPGCPGCPSCPWSCGHRYDV* HLA-C-15 MRVMAPRALLLLLSGGLALTETWACSHSMRYFDTAVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQADRVSLRNLRGYYNQSEDGSHTLQRMSGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKLEAARAAEQLRAYLEGTCVEWLRRYLENGKETLQRAGHPEVLGPGLLPCGDHTDLAAGWGGPDPGHRACGDQASRRWNLPEVGSCGGAFWTRAEIHVPYAARGAARAPHPELGAIFPAHHPHHGHRCWPGCPGCPSCPWSCGHRYDV* HLA-B-8 MRVMAPRALLLLLSGGLALTETWACSHSMRYFDTAVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQADRVSLRNLRGYYNQSEDGSHTLQRMSGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKLEAARAAEQLRAYLEGTCVEWLRRYLENGKETLQRAEPPKIHVPYAARGAARAPHPELGAIFPAHHPHHGHRCWPGCPGCPSCPWSCGHRYDVVTATVPRALMSLSSLVKPETAACVGLRCRISSHLSFVTSRASGISFCKGT* HLA-B-21 MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSPPPCPP* HLA-B-1 MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSSGGKGGELLSVWD* HLA-B-4 MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELWWLL* P30510 MRVMAPRTLILLLSGALALTETWACSHSMRYFSTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVSLRNLRGYYNQSEAGSHTLQWMFGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQRRAYLEGTCVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPAEITLTWQWDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSSGGKGGSCSQAASSNSAQGSDESLIACKA HLA-B-2 MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSSGGKGGSCSQAASSNSAQGSDESLIACKA* HLA-B-10a MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWGGKGGSCSQAASSNSAQGSDESLIACKA* P18464 MRVTAPRTVLLLLWGAVALTETWAGSHSMRYFYTAMSRPGRGEPRFIAVGYVDDTQFVRFDSDAASPRTEPRAPWIEQEGPEYWDRNTQIFKTNTQTYRENLRIALRYYNQSEAGSHTWQTMYGCDVGPDGRLLRGHNQYAYDGKDYIALNEDLSSWTAADTAAQITQRKWEAAREAEQLRAYLEGLCVEWLRRHLENGKETLQRADPPKTHVTHHPVSDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEPSSQSTIPIVGIVAGLAVLAVVVIGAVVATVMCRRKSSGGKGGSYSQAASSDSAQGSDVSLTA HLA-C-4 MRVTAPRTLLLLLWGAVALTETWAGSHSMRYFHTSVSRPGRGEPRFITVGYVDDTLFVRFDSDAASPREEPRAPWIEQEGPEYWDRETQICKAKAQTDREDLRTLLRYYNQSEAGSHTLQNMYGCDVGPDGRLLRGYHQDAYDGKDYIALNEDLSSWTAADTAAQITQRKWEAARVAEQLRAYLEGECVEWLRRYLENGKETLQRADPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEPSSQSTVPIVGIVAGLAVLAVVVIGAVVAAVMCRRKSSGGKGGSYSQAACSDSAQGSDVSLTA* HLA-C-7 MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPREEPRAPWIEQEGPEYWDRNTQIYKAQAQTDRESLRNLRGYYNQSEAGSHTLQSMYGCDVGPDGRLLRGHDQYAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQRRAYLEGECVEWLRRYLENDPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVMPSGEEQRYTCHVQHEGLPKPLTLRWEPSSQSTVPIVGIVAGLAVLAVVVIGAVVAAVMCRRKSSGGKGGSYSQAACSDSAQGSDVSLTA* HLA-C-11 MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPREEPRAPWIEQEGPEYWDRNTQIYKAQAQTDRESLRNLRGYYNQSEAGSHTLQSMYGCDVGPDGRLLRGHDQYAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQRRAYLEGECVEWLRRYLENGKDKLERADPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPLAVQAVVVIGAVVAAVMCRRKSSGGKGGSYSQAACSDSAQGSDVSLTA* HLA-B-17 MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSSGGKGGSCSQAASSNSAQGSDESLIACKA* HLA-A-6 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQIEKEGATLRLQAVAVPRALMCLSQLVKCETAALCGTERQELFLPFPL* HLA-A-5 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDRSRDTPAMCSMRVCPSPSP* HLA-A-4a MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPHRGHHCWPGSLWSCDHWSCGRCCDVEEEELR* HLA-A-13 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVITGAVVAAVMWRRKSSDRKGGSYSQAASSLYSVRQLPCVGLRGKSCSCPSLCDLKNPDFVSAKAPACVCVRVGIM* HLA-A-2 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVITGAVVAAVMWRRKSSDR* P16188 MAVMAPRTLLLLLSGALALTHTWAGSHSMRYFSTSVSRPGSGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQERPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQIMYGCDVGSDGRFLRGYEQHAYDGKDYIALNEDLRSWTAADMAAQITQRKWEAARWAEQLRAYLEGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWELSSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV HLA-A-9 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVITGAVVAAVMWRRKSSPIIFPVPERWG* HLA-B-11 MRVMAPRTLLLLLSGALALTETWACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSSGGKGGSCSQAASSNSAQGSDESLIACKA* HLA-B-5 MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPREEPRAPWIEQEGPEYWDRNTQIYKAQAQTDRESLRNLRGYYNQSEADPQRHT* HLA-B-19 MRVMAPRTL------ACSHSMRYFYTAVSRPSRGEPHFIAVGYVDDTQFVRFDSDAASPRGEPRAPWVEQEGPEYWDRETQKYKRQAQTDRVNLRKLRGYYNQSEAGSHTLQRMYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSSGGKGGSCSQAASSNSAQGSDESLIACKA* HLA-C-12 MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPREEPRAPWIEQEGPEYWDRNTQIYKAQAQTDRESLRNLRGYYNQSEAGSHTLQSMYGCDVGPDGRLLRGHDQYAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQRRAYLEGECVEWLRRYLENGKDKLERADPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEPSSQSTVPSWALLLAWLS* HLA-A-8 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVITGAVVAAVMWRRKSSDRKGGSYSQAASSDSAQGSDVSLTACKV* HLA-C-1 MLVMAPRTVLLLLSAALALTETWAGSHSMRYFYTSVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPREEPRAPWIEQEGPEYWDRNTQIYKAQAQTDRESLRNLRGYYNQSEAGSHTLQSMYGCDVGPDGRLLRGHDQYAYDGKDYIALNEDLRSWTARTRRLRSPSASGRRPVRRSSGEPTWRASAWSGSADTWRTGRTSWSALTPQRHT* HLA-A-14 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAASHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVITGAVVAAVMWRRKSSDRKGGKSCSCPSLCDLKNPDFVSAKAPACVCVRVGIM* HLA-A-10b ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVASLLAWFSLEL* HLA-A-3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELR* HLA-A-1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVITGAVVAAVMWRRKSSDRKGGSYSQAASSDSAQGSDVSLTACKV* HLA-C-16a ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~MYGCDVGPDGRLLRGYHQDAYDGKDYIALNEDLSSWTAADTAAQITQRKWEAARVAEQLRAYLEASAWSGSADTWRTGRRRCSARTPKDTRDHHPISDHEATLRCWAWASTLRRSH* HLA-C-14 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MYGCDVGPDGRLLRGHDQYAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQRRAYLEGECVEWLRRYLENGKDKLSCEGLRCRISSRLPFVTSRASGISFCKGT* HLA-C-3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~MYGCDVGPDGRLLRGYHQDAYDGKDYIALNEDLSSWTAATRRLRSPSASGRRPVWRSS* HLA-B-12 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAARRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVASLLAWLSWLS* HLA-B-7 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~MYGCDLGPDGRLLRGYDQSAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQWRAYLEGECVEWLRRYLENGKETLQRAEHPKTHVTHHPVSDHEATLRCWALGFYPTEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLTLRWEPSSQPTIPIVGIVAGLAVLAVLAVLGAVVAVVMCRRKSSGGKGFLHTSPL* HLA-C-13 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~MYGCDVGPDGRLLRGHDQYAYDGKDYIALNEDLRSWTAADTAAQITQRKWEAAREAEQRRAYLEGECVEWLRRYLENGKDKLERADPPKTHVTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDRTFQKWAAVVVPSGEEQRWEPSSSPPSPSWALLLAWLS* HLA-A-7a ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDGETRKVKAHSQTHRVDLGTLRGYYNQSEAGSHTVQRMYGCDVGSDWRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQTTKHKWEAAHVAEQLRAYLEGTCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSHSL* HLA-C-8 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~MRYFYTSVSRPGRGEPRFISVGYVDDTQFVRFDSDAASPREEPRAPWIEQEGPEYWDRNTQITLTWQRDGEVLGPGFLPCGDHTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPSP*

Figure 3.4 - Multiple sequence alignment for HLA-A, HLA-B and HLA-C. Multiple sequence alignment for the amino acid sequences predicted from transcript isoforms for HLA-A, HLA-B and HLA-C. The identifiers of the transcript isoforms are included.

58

Table 3.4 - Identifiers (ID) generated by the FL-seq, acronym of the gene name and full description are shown for genes in Figure 3.2d. Note that no transcripts in this group of genes were found by Frag-seq. Sequencing Cell Description Exon #* Trans # ID Gene Method Population (Gene Cards) complement 5 1 FL-seq Total PB.4061 CFD factor D (adipsin) complement 5 1 FL-seq Lin- PB.3163 CFD factor D (adipsin) GATA binding 6 1 FL-seq Total PB.6205 GATA2 protein 2 GATA binding 6 2 FL-seq Lin- PB.4811 GATA2 protein 2 major histocompatibility 9 14 FL-seq Total PB.7117 HLA-A complex, class I, A major histocompatibility 8 5 FL-seq Lin- PB.5569 HLA-A complex, class I, A major histocompatibility 9 16 FL-seq Total PB.7141 HLA-B complex, class I, B major histocompatibility 8 7 FL-seq Lin- PB.5587 HLA-B complex, class I, B major histocompatibility 9 21 FL-seq Total PB.7140 HLA-C complex, class I, C major histocompatibility 8 9 FL-seq Lin- PB.5586 HLA-C complex, class I, C

59

Table 3.5 - Exon counts, transcript isoform frequency distributions from Frag-seq or FL- seq.

Li and Mason (36) quantitatively analyzed splicing complexity by gene size showing that the combination potential for exon splicing can increase exponentially. They speculated that functional and evolutionary constraints are the reason why the number of transcripts thus far seen is less than the theoretical maximum. Our data indicate, however, that the number of transcripts reported to date is limited due to the Frag-seq approach. FL-seq is able to reveal unique transcript isoform structures beyond those discovered by traditional short read Frag-seq (23, 72,

73) and thus allows for the unambiguous attribution of reads to defined genomic loci and to cell types. By producing a more complete map of the transcript isoform structure for given loci, FL- seq also permits analysis of isoform expression and thus better understand the mechanisms of regulation of transcripts from multi-exon genes.

60

CHAPTER 4 - FULL-LENGTH RNA-SEQUENCING REVEALS SPECIFIC TRANSCRIPT ISOFORMS DISTINGUISH HUMAN BONE MARROW SUBPOPULATIONS

A. Abstract

Here, we report full-length sequencing (FL-seq) reveals isoforms in our preliminary results of transcriptome analysis of undifferentiated lineage negative (lin-) and differentiated

(lin+) cell subpopulations isolated from healthy human bone marrow tissues. We utilize full- length sequencing of unfragmented RNA molecules libraries and successfully capture transcript isoforms segregating our cell populations extending hematopoietic knowledge with additional transcript isoforms with large predicted open reading frames.

B. Results

The discovered isoforms of two genes, ENO1 and PKM, which are abundantly expressed in hematopoietic cells, illustrate our findings. Alpha-enolase 1, ENO1, is a multi-factorial enzyme with two isoforms deposited in the protein database, Uniprot. FL-seq on our cell sub- population (Figure 4.1a-c) reveals a single isoform coding for the canonical isoform (C1 in

Figure 4.1b-c), P06733, in the lineage-positive (lin+) sub-cellular population at relatively high abundance at 2589 transcripts per million (TPM). The shorter transcript isoform, P06733-2 (C2 in Figure 4.1b-c), MBP-1 or MYC binding protein 1, is found only in the lineage-negative cell population. Also present in this cell population is a transcript isoform of MYC. The presence of this isoform in the lin- cell population is consistent with the phenotype. The figures on the following pages contain the transcript isoform details and open reading frames for ENO1.

61

a ENO1 C1 TPM C2 1 kb 1 10 100 1000 10000 b 8.925 mb 8.935 mb c

P1-C1

N3-2

N1-4

N4-C2

N7-1

N8-6

N5-3

N2-7

N6-5

Figure 4.1 - ENO1 transcript isoforms. The ENO1 transcript isoforms as measured with full- length sequencing and abundance as reported by ToFU (23)(a) The canonical isoforms C1 (Uniprot P06733) and C2 Uniprot P06733-1). (b) Transcript isoform structure as measured in the Lin- (N-blue) and Lin+ (P-red) cell populations. (c) Transcript per million as reported by ToFU. Canonical isoforms are indicated in lighter shades of blue and red. ID#s of the isoforms are based on the identifiers from the sequencing method. 9 transcript isoforms coding for 9 different open reading frames. Multiple sequence alignment for ORFs are shown in Figure 4.2.

62

thesisENO1.multiple.seq.aln.docx 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| ENOA_HUMAN.P06733 MSILKIHAREIFDSRGNPTVEVDLFTSKGLFRAAVPSGASTGIYEALELRDNDKTRYMGKGVSKAVEHINKTIAPALVSKKLNVTEQEKIDKLMIEMDGTENKSKFGANAILGVSLAVCKAGAVEKGVPLYRHIADLAGNSEVILPVPAF P1-C1 MSILKIHAREIFDSRGNPTVEVDLFTSKGLFRAAVPSGASTGIYEALELRDNDKTRYMGKGVSKAVEHINKTIAPALVSKKLNVTEQEKIDKLMIEMDGTENKSKFGANAILGVSLAVCKAGAVEKGVPLYRHIADLAGNSEVILPVPAF N7-1 MSILKIHAREIFDSRGNPTVEVDLFTSKGLFRAAVPSGASTGIYEALELRDNDKTRYMGKGVSKAVEHINKTIAPALVSKKLNVTEQEKIDKLMIEMDGTENKSKFGANAILGVSLAVCKAGAVEKGVPLYRHIADLAGNSEVILPVPAF N3-2 MSILKIHAREIFDSRGNPTVEVDLFTSKGLFRAAVPSGASTGIYEALELRDNDKTRYMGKGVSKAVEHINKTIAPALVSKKLNVTEQEKIDKLMIEMDGTENKSKFGANAILGVSLAVCKAGAVEKGVPLYRHIADLAGNSEVILPVPAF N5-3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N1-4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ENOA_HUMAN.P06733-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MIEMDGTENKSKFGANAILGVSLAVCKAGAVEKGVPLYRHIADLAGNSEVILPVPAF N4-C2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MIEMDGTENKSKFGANAILGVSLAVCKAGAVEKGVPLYRHIADLAGNSEVILPVPAF N6-5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MIEMDGTENKSKFGANAILGVSLAVCKAGAVEKGVPLYRHIADLAGNSEVILPVPAF N8-6 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MGKGVSKAVEHINKTIAPALVSKKLNVTEQEK N2-7 MSILKIHAREIFDSRGNPTVEVDLFTSKGLFRAAVPSGASTGIYEALELRDNDKTRYMGRVSQRLLSTSIKLLRLPWLARN*------

160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| ENOA_HUMAN.P06733 NVINGGSHAGNKLAMQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVSIEDPFDQDD P1-C1 NVINGGSHAGNKLAMQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVSIEDPFDQDD N7-1 NVINGGSHAGNKLAMQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYRSQAPWSPVGSSSFAVV*------N3-2 NVINGGSHAGNKLAMQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVSIEDPFDQDD N5-3 ~~~~~~~~~~~~~~MQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVSIEDPFDQDD N1-4 ~~~~~~~~~~~~~~MQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVSIEDPFDQDD ENOA_HUMAN.P06733-2 NVINGGSHAGNKLAMQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVSIEDPFDQDD N4-C2 NVINGGSHAGNKLAMQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVSIEDPFDQDD N6-5 NVINGGSHAGNKLAMQEFMILPVGAANFREAMRIGAEVYHNLKNVIKEKYGKDATNVGDEGGFAPNILENKEGLELLKTAIGKAGYTDKVVIGMDVAASEFFRSGKYDLDFKSPDDPSRYISPDQLADLYKSFIKDYPVVYIEDPFDQDD N8-6

310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| ENOA_HUMAN.P06733 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESLQACKLAQANGWGVMVSHRSGETEDTFIADLVVGLCTGQIKTGAPCRSERLAKYNQLLRIEEELGSKAKFAGRNFRNPLAK P1-C1 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESLQACKLAQANGWGVMVSHRSGETEDTFIADLVVGLCTGQIKTGAPCRSERLAKYNQLLRIEEELGSKAKFAGRNFRNPLAK* N7-1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N3-2 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESLQACKLAQANGWGVMVSHRSGRLKIPSSLTWLWGCALGRSRLVPLADLSAWPSTTSSSELKRSWAARLSLPAGTSETPWPSKLWAGKPFGHLLATQT N5-3 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESL N1-4 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESLQACKLAQANGWGVMVSHRSGETEDTFIADLVVGLCTGQIKTGAPCRSERLAKYNQLLRIEEELGSKAKFAGRNFRNPLAK* ENOA_HUMAN.P06733-2 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESLQACKLAQANGWGVMVSHRSGETEDTFIADLVVGLCTGQIKTGAPCRSERLAKYNQLLRIEEELGSKAKFAGRNFRNPLAK N4-C2 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESLQACKLAQANGWGVMVSHRSGETEDTFIADLVVGLCTGQIKTGAPCRSERLAKYNQLLRIEEELGSKAKFAGRNFRNPLAK* N6-5 WGAWQKFTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLTFHQVSRVM* N8-6 FTASAGIQVVGDDLTVTNPKRIAKAVNEKSCNCLLLKVNQIGSVTESLQACKLAQANGWGVMVSHRSGRLKIPSSLTWLWGCALGRSRLVPLADLSAWPSTTSSSELKRSWAARLSLPAGTSETPWPSKLWAGKPFGHLLATQT

460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| N3-2 PPLVSAQAARGPRPTLAGVPAS* N5-3 QAARGPRPTLAGVPAS* N8-6 PPSCQLRQLEAPDQHLQGSLLVSAPPPWSSYRFLRTSTEAKLPGALLAALALQSCNWPKSLFFSPHFPPSV*

Figure 4.2 - ENO1 multiple sequence alignment. Canonical isoforms yellow (P06733) and magenta (P06733-2). Green highlighted FL-seq validated by alignment with Frag-seq on the lineage-negative (N) and total (T) cell populations.

Another example of segregating isoforms is pyruvate kinase or PKM. PKM is a multi- factorial enzyme and has three protein isoforms deposited in Uniprot, two of which were found in our cell populations. FL-seq on our cell sub population reveals 4 isoforms in the lineage- positive (lin+) sub-cellular population. Two of the isoforms P1 and P2 code for the same canonical protein, C1 (P1418KM2 but differ at the transcriptional level outside of the coding frame. P1 is expressed at a 10 fold higher level of expression as compared with P2, 2053 and

245 TPM respectively. P3 and P4 represent novel structures that code for novel isoforms with expression levels at a moderate level, 65 and 114 TPM respectively. There are a total of 18 transcript isoforms in the lineage-negative (lin-) cell population. N1 codes for PKM2 and is

63

expressed at a moderate level of expression at 122 TPM. N2 is the highest expressing isoform in the lin- cell population and codes for the C1 or PKM1 isoform, expressing at 1746 TPM.

64

a PKM Chromosome 15 C1 TPM C2 1 kb 1 10 100 1000 10000 b 72.5 mb 72.51 mb 72.52 mb c P1-C1 P2-C1 P4-2 P3-7 N2-C1* N7-C1 N1-C2 N4-4 N18-3 N10-5 N5-4 N16-14 N6-16 N3-11 N14-1 N15-9 N9-13 N11-10 N17-8 N8-12 N13-15 N12-6

Figure 4.3 - PKM Transcript Isoforms. Results for lineage-neg (N-blue) and lineage-positive (P-red) from the FL-seq (a-c) TPM is transformed from ToFU reported abundance (25). C1, C2, canonical transcript information Uniprot identifier P14618-1 and P14618-2. ID#s of the isoforms are based on the identifiers from the sequencing method. There are 22 transcript isoforms coding for 16 different open reading frames. Multiple sequence alignment for ORFs provided in Figure 4.4.

65

thesisPKM.multiple.seq.aln.docx 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| N14-1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE P4-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N18-3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE P14618-2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N1-C2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE P14618 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE P1-C1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE P2-C1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N2-C1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N7-C1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N5-4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N4-4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N10-5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N12-6 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ P3-7 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MIKSGMNVARLNFSHGTHEYHAE N17-8 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N15-9 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N11-10 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N3-11 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MSK N8-12 MQWSSERGERLLTPGACSSEVPSAVPSRSGGSPGHTVFSSERSLLVRPRSHPEPKGEHYVTGSPTPENQRTSAAMSKPHSEAGTAFIQTQQLHAAMADTFLEHMCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N9-13 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MCRLDIDSPPITARNTGIICTIGPASRSVETLKEMIKSGMNVARLNFSHGTHEYHAE N16-14 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N13-15 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N6-16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| N14-1 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR P4-2 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N18-3 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR P14618-2 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N1-C2 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR P14618-1 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR P1-C1 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR P2-C1 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N2-C1 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N7-C1 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N5-4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N4-4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N10-5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N12-6 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR P3-7 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEKDIQDLKFGVEQDVDMVFASFIR N17-8 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGPYLLL*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N15-9 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVFAACSSGTAQSLAAHHVAPPNQGKK N11-10 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAYMEKCDENILWLDYKNICKVVEVGSKIYVDDGLISLQVKQKGADFLVTEVENGGSLGSKKGVNLPGAAVDLPAVSEEGHPGSEVWGRAGC*~~~~~~~~ N3-11 PIVKPGLPSFIR N8-12 TIKNVRTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKKGATLKITLDNAQRTPNPGLGSRNSQQELGALGHWAVVPLKPTLALALTCFSSSLGLSSLHLSPPSTQLSCSKHSTLHLPFSPTTAAPPGLLL*~~~~~~~~~~~~~~~~~ N9-13 TIKNVRTATESFASDPILYRPVAVL*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N16-14 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N13-15 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N6-16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| N14-1 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETACSSGTAQSLAAHHVAPPNQGKKEECWTGGPWSQMARG*~~ P4-2 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLVSSGACPIPQGFGLGLGWMQALVQSF N18-3 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLVSSGACPIPQGFGLGLGWMQALVQSF P14618-2 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELVRASSHST N1-C2 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELVRASSHST P14618-1 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS P1-C1 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS P2-C1 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS N2-C1 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS N7-C1 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS N5-4 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS N4-4 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS N10-5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEA QLFEELRRLAPITS N12-6 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKQPARVRGLRALGCCSIEADSGPGPYLLL*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ P3-7 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPAHSG*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N15-9 EECWTGGPGARWQEGDSFLSCVYSVQFL*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N3-11 KASDVHEVRKVLGEKGKNIKIISKIENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMMIGRCNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIMLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRAGAHHST N16-14 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MLSGETAKGDYPLEAVRMQHLIAREAEAAIYHLQLFEELRRLAPITS N13-15 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MMAYFSPGEAERCRLPGEGGGKWWLLGQQEGCEPSWGCCGKRVTAS N6-16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~MAVFAACSSGTAQSLAAHHVAHPIKGRRRNAGLEALEPDGKRVTAS

490 500 510 520 530 540 550 560 570 580 590 600 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|.... N14-1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ P4-2 LGFSILLCTAFHYPPSYSSKRVGVEVEVAFFFLLFCIPAHTPTPPFPSALEASSFIGHHTVYFTSDFKVVNSSHGLSPGILLQ*~~~~~~~~~~~~~~~~~~~~~~~ N18-3 LGFSILLCTAFHYPPSYSSKRVGVEVEVAFFFLLFCIPAHTPTPPFPSALEASSFIGHHTVYFTSDFKVVNSSHGLSPGILLQ*~~~~~~~~~~~~~~~~~~~~~~~ N1-C2 DLMEAMAMGSVEASYKCLAAALIVLTESGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* P14618-2 DLMEAMAMGSVEASYKCLAAALIVLTESGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP~ P14618-1 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP~ P1-C1 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* P2-C1 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* N2-C1 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* N7-C1 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* N5-4 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* N4-4 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* N10-5 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPIIAVTRNPQTARQAHLYRGIFPVLCKDPVQEAWAEDVDLRVNFAMNVGKARGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP* N3-7 THRSHRRGR~VEASFKCCSGP* N16-14 DPTEATAVGAVEASFKCCSGAIIVLTKSGRSAHQVARYRPRAPSLL*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N13-15 FPVCTLSSSFRKNGCPETPNPGLGSRNSQQELGALGHWAVVPLKPTLALALTCFSSSLGLSSLHLSPPSTQLSCSKHSTLHLPFSPTTAAPPGLLL* N6-16 FPVCTLSSSFRKNGCPEDSQPWLGVKKQPARVRALGHWAVVPLKPTLALALTCFSSSLGLSSLHLSPPSTQLSCSKHSTLHLPFSPTTAAPPGLLL*

Figure 4.4 - PKM multiple sequence alignment. Two canonical isoforms for PKM are in yellow and encoded by structure identifier C1 and Magenta encoded by structure identifier C2. The sequence highlighted in green has been verified by alignment with the short read sequence as being present only in the Lin- cell population and not in the differentiated cell population

66

Above are two genes and their mRNA isoforms uncovered by FL-seq. The distinct transcript isoform structures for these two genes are sufficient to distinguish between hematopoietic cell subpopulations in the bone marrow. This analysis thus opens up new ways of evaluating cell population differences under physiologic conditions of steady-state in a tissue. It is reasonable to assume that such differences would also be apparent in diseased tissues. The transcript isoforms found are due alternative splicing, gene rearrangements, RNA editing or other post-transcriptional events. We find that high-throughput, full-length RNA sequencing can reveal new isoforms that were hitherto not discoverable by a global analysis of the transcriptome.

67

CHAPTER 5 - DISCUSSION AND FUTURE DIRECTIONS

The thesis of this work is that the specific transcript isoform associated with a cell in a particular subpopulation defines the specific open reading frame that is segregating for that cell population. Furthermore, the presentation shows that the standard protocol transcript reconstruction from fragmented cDNA libraries is not capable of capturing the depth of the transcriptome identified by the full-length sequencing of unfragmented cDNA libraries (FL-seq)

(Figures 3.2a-b). We have shown the transcriptome as measured through full-length transcriptome sequencing on unfragmented cDNA libraries is larger, has a far greater number of distinct transcript isoforms (at times >10-fold) than reconstructed from sequencing on fragmented cDNA libraries (Frag-seq) (Figures 3.2c-d). We have shown through in depth analysis of a set of selected genes predicts for a highly complex transcriptome and an abundance of hitherto unknown protein isoforms (Figures 3.1b-k). We have shown that transcript isoforms and transcripts of entire genes are missed with the standard transcriptome reconstruction methods

(Frag-seq) while found through full-length sequencing (FL-seq) (Figures 3.2d, 3.3 and 3.4).

We have illustrated that the full-length sequencing (FL-seq) revealed novel open reading frames that code for complete proteins and that these proteins are specific to cellular subpopulations

(Figures 3.1l-n and 3.1o-q). By focusing on the transcriptome isolated from healthy human bone marrow and comparing this total cell population with the lineage negative cell population isolated by sorting for cell-surface markers, we have demonstrated that these transcripts are distinct and specific for their cell populations (Figures 3.1b-k).

68

Though we have not used all methods of transcriptome reconstruction we propose that all existing transcript reconstruction methods based upon fragmented cDNA libraries will not be able to recapitulate the transcriptome as exposed through full-length sequencing on unfragmented cDNA libraries. The ToFU clustering method employed in this work (Figure 3.1 and refs (25)) for full-length sequencing on unfragmented libraries is a de novo method. No genome reference was used until after the transcript isoforms were identified. With full-length transcript isoforms, existing annotations and reference were used for explanation of the discoveries and localization of the full transcript isoform on the genome. The excellent work presented by existing de novo transcriptome reconstruction methods, such as Trinity (74) and

Oases (75) are based upon using reads measured from fragmented cDNA libraries. These methods likely extend the transcriptome as compared to the transcriptome demonstrated in this work (Figure 3.1a and refs (32, 70, 76)), nevertheless, we speculate these methods will expose longer reconstructed fragments, but not likely capture the entirety of the transcriptome as directly measured through full length mRNA sequencing on unfragmented libraries. We speculate that there will be a bias in the transcriptome as reconstructed. The fragmented nature will cause the most abundant fragment to be measured most often. The shortcomings of RNA-sequencing based upon fragmented cDNA libraries has already been noted and the authors speculate that most of the problems with such sequencing are overcome with the emerging full length sequencing methods (23, 77). However, they fall short of endorsing the full-length sequencing methods citing errors in the raw sequence reads. These problems have now been overcome with the clustering software used in this work(25). Transcriptome reconstruction methods are

69

handicapped by the use of reads from the fragmented library. When measuring fragments, the most oft-occurring transcript fragment will be the most sequenced fragment. This fragment may come from a paralogous gene whose resolution is not possible with fragmented sequencing methods (Figures 3.3 and 3.4). We agree with Sheynkman that the best peptide database is a sample specific database measured with transcripts whose open reading frames have been identified (78, 79). We propose the best way to obtain this peptide database is through the peptide prediction of full-length open reading frames based upon transcript sequences obtained through full-length mRNA sequencing on unfragmented cDNA libraries.

The specificity of full-length sequencing provides an opportunity to identify specific therapeutic targets from disease specific transcript isoforms. Cancer is a genetic disease actively evolving and transforming itself during its evolution into a lethal disease. In addition, identifying therapeutic targets to stop the progression is likely the best opportunity for advancement in the field. Identifying molecular markers and deciphering them in real time by analysis of accessible samples (biopsies, body fluids) can reveal specific molecular changes and possibly changes in treatment or surveillance.

Full-length transcript sequencing reveals the full spectrum of protein isoforms resulting from the sequenced sample. These coding regions contain domains that may be modeled in networks. Existing gene network models will evolve from modeling at the coarser gene level to the modeling at a potentially more specific level, at the domain level and binding site level, enabling the identification of more interaction sites between molecules and an opportunity to

70

create an abstraction that will expose even more detail as to the interplay that occurs within tissues through the course of normal development and disease progression.

Analysis of the specific cellular subpopulations arranged by phenotype will provide the clues as to the decision-making potential within a cell that has a particular biomarker profile.

Characterization of the pathways and transcriptional changes in the search for causal transcript isoforms involved in the progression of disease is dependent upon the information that goes into that analysis. If that information is incomplete, such as the transcriptome as reconstructed from fragmented cDNA libraries, the resulting characterization will be incomplete and the subsequent decisions based upon this incomplete information subject to a higher false discovery error rate

(80, 81).

Single cell sequencing is providing opportunities to characterize cellular specific subpopulations at a level of detail not seen before, obtaining the full-length sequence from single cell transcriptomes will likely give the most complete picture of the make up of a cell (82).

SNPs located in regions classified as INTRONs are given less weight than those classified as exons. Given the plethora of isoforms that include intron retaining coding potential splice variants, this view will change. Sorting out the SNP differences in terms of which polymorphisms code for tissue specific or cell subpopulation differences as compared with disease specific variations will likely have an enormous effect on understanding drug specificity as well as disease biomarkers. Alternative splicing driving oncogenic processes represents attractive therapeutic targets and can serve as biomarkers regarding disease progression.

71

The beginning of this thesis opened with a quote favored by my mentor, Anton Wellstein,

“Progress in science results from new technologies, new discoveries and new ideas, probably in that order.” Nobel Laureate Sydney Brenner (1927 - ). Given each new technology, the innovator looks to see what can be done with this technology. Then, from the application of the technology new ideas emerge. In medicine, technology will continue to drive innovation - and with such innovation, we will realize the potential of curing diseases such as cancer and be ready in rapid response to manage other diseases that emerge and evolve.

Medicine is changing based upon our enhanced molecular understanding. The traditional tools of medicine are being replaced by measuring tools that decipher the particular molecular processes in flux at the particular time in that particular individual with greater precision providing opportunities for earlier therapeutic interventions before the molecular disruptions disrupt homeostasis beyond repair (83). Leveraging the full spectrum of measurable molecular information, that may be gathered on individuals rapidly, inexpensively, and accurately, in both healthy and diseased, will lead to specific therapeutics that are tailored to an individual and are disease specific (84). One can imagine that this will not be a static process, but a dynamic one, requiring constant readjustment, the organism adjusts to the challenges (85, 86). Measurement at the transcriptome level is a very powerful method for decoding the underlying molecular networks that drive disease (87). However, not all measurements are equal: Transcriptome reconstruction methods fall short for transcripts of greater complexity (> 2 exons) as compared with full-length transcriptome sequencing. Faithful capture of the actual transcript isoform is imperative when using this as a biomarker or as a therapeutic target. Precision medicine will

72

require precision tools and confidence that the complexities of biology are captured by the analysis.

73

GLOSSARY

cDNA library complementary DNA library constructed from using reverse transcriptase on an RNA sample

FL-seq full transcript length, 5' to 3' sequencing on unfragmented cDNA libraries. (e.g. Pacific Biosciences RS II sequencing)

Frag-seq sequencing on fragmented cDNA libraries. cDNA libraries are sonicated or otherwise fragmented and size selected to sizes ranging from 100 - 400 base pairs depending upon the platform requirements (e.g. Illumina Hi-Seq, Illumina Mi-Seq, Life Technologies Ion Torrent, etc.).

Hematopoietic lineage-negative cells A subset of undifferentiated progenitor positive cells within the total bone marrow cell population. Typically representing 1-2% of the total cell population after plasma and erythrocyte elimination

Hematopoietic lineage-positive cells A subset of differentiated lineage restricted progenitors within the total bone marrow cell population. Typically representing 98-99% of the total bone marrow cell population after plasma and erythrocyte elimination.

Peptide isoform The open reading frame predicted from the transcript isoform

Total bone marrow cells The mononuclear cells isolated from bone marrow after plasma and erythrocyte elimination.

Transcript isoform The messenger RNA posttranscriptionally modified and/or edited representing the assembly of exons as either directly measured and clustered by full-length RNA (FL-seq) sequencing technology or as reconstructed from fragmented- RNA sequencing (Frag-seq).

Transcriptome The entire collection of transcript isoforms as directly measured and clustered by full-length (FL-seq) sequencing or as reconstructed from fragmented-RNA sequencing (Frag-seq)

74

BIBLIOGRAPHY

1. Klug A (1968) Rosalind Franklin and the discovery of the structure of DNA. Nature 219(5156):808–844.

2. Bernal JD (1958) Dr. Rosalind E. Franklin. Nature 182:154.

3. Watson JD, Crick FHC (1953) Molecular Structure of Nucleic Acids. Nature 171:737– 738.

4. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467.

5. Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci USA 74(2):560–564.

6. Smith LM, et al. (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321(6071):674–679.

7. Southern EM (1975) Detection of specific sequences among DNA fragments separated by gel electrophoresis. Journal of Molecular Biology 98(3):503–517.

8. Maskos U, Southern EM (1992) Oligonucleotide hybridisations on glass supports: a novel linker for oligonucleotide synthesis and hybridisation properties of oligonucleotides synthesised in situ. Nucleic Acids Research 20(7):1679–1684.

9. Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9(1):559.

10. Gerhold DL, Jensen RV, Gullans SR (2002) Better therapeutics through microarrays. Nat Genet 32(Supp):547–551.

11. Schuster SC (2007) Next-generation sequencing transforms today's biology. Nat Meth 5(1):16–18.

12. Morozova O, Marra MA (2008) Applications of next-generation sequencing technologies in functional genomics. Genomics 92:255–264.

13. Reis-Filho JS (2009) Next-generation sequencing. Breast Cancer Res 11(S3):S12.

14. Kahvejian A, Quackenbush J, Thompson JF (2008) What would you do if you could sequence everything? Nature biotechnology 26(10):1125–1133. 75

15. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 5(7):621–628.

16. Mardis ER (2011) A decade's perspective on DNA sequencing technology. Nature 470(7333):198–203.

17. Nagarajan N, Pop M (2013) Sequence assembly demystified. Nat Rev Genet 14(3):157– 167.

18. Bradnam KR, et al. (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2:10.

19. Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12(10):671–682.

20. Korlach J, et al. (2015) Real-time analytical methods and systems. US Patent (12814075).

21. Carneiro MO, et al. (2012) Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13:375.

22. Schatz MC, et al. (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology 30(7):693–700.

23. Li S, et al. (2014) Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nature biotechnology 32(9):915–925.

24. Levene MJ, et al. (2003) Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations. Science 299:682–686.

25. Gordon SP, et al. (2015) Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing. PLoS ONE 10(7):e0132628.

26. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63.

27. Thomas S, Underwood JG, Tseng E, Holloway AK (2014) Long-Read Sequencing of Chicken Transcripts and Identification of New Transcript Isoforms. PLoS ONE 9(4):e94650.

28. Brouilette S, et al. (2012) A simple and novel method for RNA-seq library preparation of single cell cDNA analysis by hyperactive Tn5 transposase. Dev Dyn 241(10):1584–1590.

29. Steijger T, et al. (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Meth 10(12):1177–1184.

76

30. Korf I (2013) Genomics: the state of the art in rna-seq analysis. Nat Meth 10(12):1165– 1166.

31. Patro R, Mount SM, Kingsford C (2014) brief communications. Nature biotechnology 32(5):462–464.

32. Trapnell C, et al. (2012) Differential analysis of gene regulation at transcript resolution with rNA-seq. Nature biotechnology 31(1):46–53.

33. Roberts A, Pachter L (2012) Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Meth 10(1):71–73.

34. Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323.

35. Engström PG, et al. (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Meth 10(12):1185–1191.

36. Li S, Mason CE (2014) The Pivotal Regulatory Landscape of RNA Modifications. Annu Rev Genom Human Genet 15(1):127–150.

37. Oltean S, Bates DO (2013) Hallmarks of alternative splicing in cancer. Oncogene 33(46):5311–5318.

38. Boveri T, Boveri M (1914, 1929) Zur Frage der Entstehung Maligner Tumoren (Gustav Fischer, Jena); English translation The Origin of Malignant Tumors (Williams and Wilkins Company, Baltimore).

39. Balmain A (2001) Cancer genetics: from Boveri and Mendel to microarrays. Nature Reviews cancer 1(1):77–82.

40. Nowell P, Hungerford D (1960) Chromosome studies on normal and leukemic human leukocytes. J Natl Cancer Inst 25:85–109.

41. Gross L (1997) The role of viruses in the etiology of cancer and leukemia in animals and in . Proc Natl Acad Sci USA 94:4237–4238.

42. Knudson AG (1971) Mutation and cancer: statistical study of retinoblastoma. Proc Natl Acad Sci USA 68(4):820–823.

43. Wong-Staal F, Dalla-Favera R, Franchini G, Gelmann EP, Gallo RC (1981) Three distinct genes in human DNA related to the transforming genes of mammalian sarcoma retroviruses. Science 213(4504):226–228.

77

44. Davies H, et al. (2002) Mutations of the BRAF gene in human cancer. Nature 417(6892):949–954.

45. Broderick DK, Di C, Parrett TJ, Samuels YR (2004) Mutations of PIK3CA in anaplastic oligodendrogliomas, high-grade astrocytomas, and medulloblastomas. Cancer Res. 64:5048–50

46. Pao W, Miller V, Zakowski M (2004) EGF receptor gene mutations are common in lung cancers from “never smokers” and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 36:13306–11

47. Vogelstein B, Kinzler KW (2004) Cancer genes and the pathways they control. Nat Med 10(8):789–799.

48. Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458(7239):719–724.

49. Consortium T1GP, et al. (2011) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073.

50. Morrison SJ, Scadden DT (2014) The bone marrow niche for haematopoietic stem cells. Nature 505(7483):327–334.

51. Goodell MA, Nguyen H, Shroyer N (2015) Somatic stem cell heterogeneity: diversity in the blood, skin and intestinal stem cell compartments. Nat Rev Mol Cell Biol 16(5):299– 309.

52. Sethi N, Kang Y (2011) Unravelling the complexity of metastasis — molecular understanding and targeted therapies. Nature Reviews cancer 11:735–748.

53. Weilbaecher KN, Guise TA, McCauley LK (2011) Cancer to bone: a fatal attraction. Nature Reviews cancer 11(6):411–425.

54. Grech G, et al. (2013) Expression of different functional isoforms in haematopoiesis. Int J Hematol 99(1):4–11.

55. Charbord P, et al. (2014) A Systems Biology Approach for Defining the Molecular Framework of the Hematopoietic Stem Cell Niche. Stem Cell 15(3):376–391.

56. Ideker T (2001) Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network. Science 292(5518):929–934.

57. Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in RNA-seq: A matter of depth. Genome Res 21(12):2213–2223. 78

58. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493–500.

59. Gilbert W (1978) Why Genes in Pieces. Nature 271(5645):501–501.

60. Tenen DG, Hromas R, Licht JD, Zhang DE (1997) Transcription factors, normal myeloid development, and leukemia. Blood 90:489–519.

61. Doolittle WF (1978) Genes in pieces: were they ever together? Nature 272(5):581–582.

62. Gilbert W (1985) Genes-in-pieces revisited. Science 228(4701):823–824.

63. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40(12):1413–1415.

64. Ozsolak F, Milos PM (2010) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12(2):87–98.

65. Jiang Z, et al. (2015) Whole transcriptome analysis with sequencing: methods, challenges and potential solutions. Cell Mol Life Sci:1–15.

66. Breitbart RE, Andreadis A (1987) Alternative splicing: a ubiquitous mechanism for the generation of multiple protein isoforms from single genes. Annual review of biochemistry 56:467-495

67. Matlin AJ, Clark F, Smith CWJ (2005) Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol 6(5):386–398.

68. Selvanathan SP, et al. (2015) Oncogenic fusion protein EWS-FLI1 is a network hub that regulates alternative splicing. PNAS 112(11):E1307–E1316.

69. Keren H, Lev-Maor G, Ast G (2010) Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet 11(5):345–355.

70. Kim D, et al. (2013) TopHat2: accurate alignment of transcriptomes inthe presence of insertions, deletions and genefusions. Genome Biol 14(4):R36.

71. Pemberton T, et al. (2012) Genomic Patterns of Homozygosity in Worldwide Human Populations. The American Journal of Human Genetics 91(2):275–292.

72. Su Z, et al. (2014) A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature biotechnology 32(9):903–914.

79

73. Kratz A, Carninci P (2014) news and views. Nature biotechnology 32(9):882–884.

74. Haas BJ, et al. (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols 8(8):1494– 1512.

75. Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28(8):1086–1092.

76. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105–1111.

77. Su Z, et al. (2014) A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature biotechnology 32(9):903–914.

78. Sheynkman GM, Shortreed MR, Frey BL, Smith LM (2013) Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Molecular & Cellular Proteomics 12(8):2341–2353.

79. Sheynkman GM, Shortreed MR, Frey BL, Scalf M, Smith LM (2014) Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J Proteome Res 13(1):228–240.

80. Chen L, et al. (2014) Transcriptional diversity during lineage commitment of human blood progenitors. Science 345(6204):1251033–1251033.

81. Parikshak NN, Gandal MJ, Geschwind DH (2015) Systems biology and gene networksin neurodevelopmental andneurodegenerative disorders. Nat Rev Genet:1–18.

82. Crosetto N, Bienko M, van Oudenaarden A (2015) Spatially resolved transcriptomics and beyond. Nat Rev Genet 16(1):57–66.

83. Wang Y, Navin NE (2015) Advances and Applicationsof Single-Cell Sequencing Technologies. Mol Cell 58(4):598–609.

84. Katsnelson A (2013) Momentum grows to make ‘personalized’ medicine more ‘precise’. Nat Med 19(3):249–249.

85. Rodriguez R, Miller KM (2014) Unravelling the genomic targets of small molecules using high-throughput sequencing. Nat Rev Genet 15(12):783–796.

86. Shapiro E, Biezuner T, Linnarsson S (2013) Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet 14(9):618–630. 80

87. Steinmetz LM, Davis RW (2004) Maximizing the potential of functional genomics. Nat Rev Genet 5(3):190–201.

81