Library Preparation for High Throughput DNA Sequencing Anders Jemt

© Anders Jemt Stockholm 2017

Royal Institute of Technology (KTH) School of Biotechnology Division of Gene Technology

Science for Life Laboratory SE-171 65 Solna Sweden

Printed by Universitetsservice US-AB Drottning Kristinas väg 53B SE-100 44 Stockholm Sweden

ISBN 978-91-7729-212-8 TRITA-BIO Report 2017:1 ISSN 1654-2312

Anders Jemt

Anders Jemt (2017): Library Preparation for High Throughput DNA Sequencing, Division of Gene Technology, School of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Sweden

Abstract Order 3 billion base pairs of DNA in the correct order and you get the blueprint of a human, the . Before the introduction of massively parallel sequencing a little more than a decade ago it would cost around $10 million to get this blueprint. Since then, sequencing throughput and cost have plummeted and now that figure is around $1000, and large sequencing centres such as the National Infrastructure in Stockholm is sequencing the equivalent of 25 human per hour. The papers that form the basis of this thesis cover different aspects of the rapidly expanding sequencing field.

Paper I describes a model system that employ massively parallel sequencing to characterize the behaviour of type IIS restriction enzymes. Enzymes are biological macromolecules that catalyse chemical reactions in the cell. All commercially available sequencing systems use enzymes to prepare the nucleic acids before they are loaded on the machine. Thus, intimate knowledge of enzymes is vital not only when designing new sequencing protocols, but also for understanding the limitations of current protocols. Paper II covers the automation of a library preparation protocol for spatially resolved transcriptome sequencing. Automation increases the sample throughput and also minimises the risk of human errors that can introduce technical noise in the data. In paper III, the power of massively parallel sequencing is employed to describe the RNA content of the endometrium at two different time points during the menstrual cycle. Finally, paper IV covers the sequencing of highly degraded nucleic acids from formalin fixed, paraffin embedded samples. These samples often have a rich clinical background, making them extremely valuable for researchers. However, it is challenging to sequence these samples and this study looks at the impact that different preparation kits have on the quality of the sequencing data.

Keywords: DNA, RNA, sequencing, massively parallel sequencing, library preparation, automation, genome, transcriptome

Sammanfattning Det mänskliga genomet, vår arvsmassa, består av lite drygt 3 miljarder baspar som om de struktureras korrekt bildar en ritning över en människa. Att ta fram den här ritningen kostade fram till introduktionen av massiv parallell sekvensering för lite mer än 10 år sedan, runt 10 miljoner dollar. Sedan dess har sekvenseringskostnaden sjunkit till den punkt där vi idag kan sekvensera ett humant genom för så lite som 1000 dollar. Stora sekvenserings center som National Genomics Infrastructure i Stockholm sekvenserar idag material motsvarande 25 humana genom per timme. Den här boken baseras på fyra artiklar som täcker in ett flertal olika aspekter av ett snabbt expanderande sekvenseringsfält.

Artikel I beskriver ett modell system för att studera typ IIS enzymer. Systemet använder sig av massiv parallell sekvensering vilket gör att enzymets funktion kan studeras med enorm precision. Enzymer är biologiska makromolekyler som cellen används för att katalysera kemiska reaktioner. Idag använder alla kommersiellt tillgängliga system enzymer för att förbereda nukleinsyror (DNA och RNA) inför sekvensering. Därför är det viktigt att förstå hur enzymer fungerar när nya DNA sekvenserings protokoll tas fram, men också för att förstå begräsningar i nuvarande protokoll. Artikel II redogör för automatiseringen av ett protokoll som används för att generera spatialt upplösta sekvenseringsbibliotek från RNA. Automatisering är en viktig del i att öka genomflödet av prover samt för att minimera teknisk variation som riskerar att dölja biologiska skillnader. I artikel III används massiv parallell sekvensering av DNA för att undersöka RNA innehållet i livmoderslemhinnan vid två tidpunkter under menstruationscykeln. Artikel IV handlar om sekvensering av nedbrutna nukleinsyror från formalin fixerad, paraffin innesluten vävnad. Dessa prover har ofta en rik klinisk historia vilket gör dem oerhört intressanta för forskare. Dock så är det mycket utmanande att sekvensera dessa prover och därför undersöks här hur olika kit påverkar kvalitén på sekvenseringsdata.

Anders Jemt

List of publications This thesis is based upon the work presented in the following publications.

I. Lundin S*, Jemt A*, Terje-Hegge F, Foam N, Pettersson E, Käller M, Wirta V, Lexow P, Lundeberg J. Endonuclease specificity and sequence dependence of type IIS restriction enzymes. PLoS One 2015; 10: e0117059.

II. Jemt A, Salmén F, Lundmark A, Mollbrink A, Navarro JF, Ståhl PL, Yucel-Lindberg T, Lundeberg J. An automated approach to prepare tissue-derived spatially barcoded RNA-sequencing libraries. Sci Rep 2016; 6: 37137.

III. Sigurgeirson B*, Åhmark A*, Jemt A, Ujvari D, Westgren M, Lundeberg J, Gidlöf S. Comprehensive RNA sequencing of healthy human endometrium at two time points of the menstrual cycle. Biology of Reproduction (in press)

IV. Jemt A, Ewels P, Hammarén R, Gruselius J, Borgström E, Wang C, Holmberg K, Lundin P, Lundin S, Gréen H, Lundeberg J, Käller M. Evaluation of methods for whole genome and transcriptome sequencing from nanograms of FFPE samples. Manuscript

*These authors contributed equally to this paper

Related publication

Zhang M, Schmidt T, Jemt A, Sahlén P, Sychugov I, Lundeberg J, Linnros J. arrays in a silicon membrane for parallel single-molecule detection: DNA translocation. Nanotechnology 2015; 26: 314002.

Contents INTRODUCTION ...... 1 NUCLEIC ACIDS, PROTEINS, CELLS AND TISSUES ...... 3 DNA ...... 3 RNA ...... 5 PROTEINS ...... 6 CELLS, TISSUES AND ORGANS ...... 7 SEQUENCING TECHNOLOGIES ...... 9 MASSIVELY PARALLEL SEQUENCING ...... 11 ...... 11 Illumina ...... 12 SOLiD ...... 13 Ion Torrent ...... 13 GeneReader, Qiagen ...... 14 Complete Genomics/BGISEQ ...... 14 SINGLE-MOLECULE SEQUENCING ...... 15 ...... 16 Oxford Nanopore Technologies ...... 16 UPCOMING TECHNOLOGIES ...... 17 GenoCare, Direct Genomics ...... 18 GENIUS, GenapSys ...... 18 Genia, Roche ...... 18 Firefly, Illumina ...... 19 Hyb & Seq, Nanostring Technologies ...... 19 Solid-state nanopore sequencers ...... 19 PREPARING NUCLEIC ACIDS FOR SEQUENCING ...... 21 EXTRACTION ...... 21 Challenging source material ...... 22 DNA LIBRARY PREPARATION ...... 23 Fragmentation ...... 23 Modification and Ligation ...... 24 Amplification and Enrichment ...... 25

Anders Jemt

Reaction clean up and size selection ...... 25 VARIANTS OF DNA SEQUENCING ...... 26 Targeted approaches ...... 26 Phasing and scaffolding ...... 28 Single-cell DNA sequencing ...... 29 RNA SEQUENCING ...... 30 Specialized RNA sequencing libraries ...... 31 Single-cell RNA sequencing ...... 32 Tissue contextual sequencing ...... 34 PROCESSING SEQUENCE DATA ...... 36 INITIAL QUALITY CHECK AND TRIMMING ...... 36 ALIGNMENT AND POST ALIGNMENT PROCESSES ...... 36 LIBRARY EVALUATION ...... 37 PRESENT INVESTIGATION ...... 39 PAPER I - ENDONUCLEASE SPECIFICITY AND SEQUENCE DEPENDENCE OF TYPE IIS RESTRICTION ENZYMES...... 39 PAPER II - AN AUTOMATED APPROACH TO PREPARE TISSUE-DERIVED SPATIALLY BARCODED RNA-SEQUENCING LIBRARIES...... 40 PAPER III - COMPREHENSIVE RNA SEQUENCING OF HEALTHY HUMAN ENDOMETRIUM AT TWO TIME POINTS OF THE MENSTRUAL CYCLE...... 40 PAPER IV - EVALUATION OF METHODS FOR WHOLE GENOME AND TRANSCRIPTOME SEQUENCING FROM NANOGRAMS OF FFPE SAMPLES...... 41 CONCLUDING REMARKS ...... 43 ACKNOWLEDGEMENT ...... 45 REFERENCES ...... 47

Anders Jemt

Introduction The has never before been so accessible. In 2004 a publication in Nature announced that the human genome had been sequenced to completion1. This marked the end of an endeavour that officially started 14 years earlier in 1990, when the Human Genome Project (HGP) was launched with the goal of sequencing the entire human genome2. The scale of the project was enormous, spanning 15 years and twenty centres in six different countries and an annual budget of $200 million. In May 1998, eight years after the HGP start, a company called Celera Genomics was formed with the explicit goal of sequencing the human genome and commercialising the information. Sequencing the human genome became a race between the academic community and Celera Genomics to generate the first reference sequence of the human genome. In June 2000 the leaders of the two initiatives held a press conference together with the US president, announcing the release of the draft sequence3,4. Finally the project resulted in two publications within the same week of February 2001, both describing the draft sequence of the human genome5,6. This would not have been possible without rapid advancements in sequencing technology and infrastructure. It is telling that by February 1999 only 15% of the genome had been sequenced7, but thanks to the investments made in technology development they were able to deliver the full reference well within the timeframe. A large part of the funding was spent on developing the infrastructure, sequencing model organisms, creating genome maps and identifying genes. Of the nearly three billion dollar funding, only $300 million was spent on the actual sequencing of the human genome8. Having sequenced the human genome, several other large initiatives were launched to actually understand its meaning9–12.

These technical achievements would not have been possible without rapid technological advances across the field. Thousands of individual researchers have slowly expanded the knowledge base in their particular area and contributed to the advancement of the entire field.

It is within this context that this thesis is written. All of the included projects have not only investigated scientific questions, but also explored the boundaries of what current technology is capable of. The first paper

1

investigates a certain class of restriction enzymes, which have many applications in molecular biotechnology. The next paper describes the automation of a protocol used for preparing spatially barcoded RNA for sequencing. The third paper looks into the RNA content of endometrium tissue and tries to establish how the levels of individual RNA transcripts differ between two time points in the menstrual cycle. Finally, the fourth paper examines the possibilities of sequencing highly damaged DNA and RNA from formalin fixed paraffin embedded (FFPE) tissue. These papers represent the author’s contribution to pushing the boundaries of what is possible.

2 Anders Jemt

Nucleic acids, proteins, cells and tissues Deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and proteins form the basis of cellular life. These macromolecules are maintained and ordered within a lipid bilayer membrane that together forms the cell. Encapsulation of genetic material into cells allows multicellular organisms to specialise their cells into different cell types, which can carry out highly specialised tasks. Similar cells can further organize themselves into tissues, and different tissue types can then order themselves into organs.

The general flow of information in a cell is as follows: DNA is transcribed into RNA, which is translated into proteins. DNA is used to store information, which is copied during cell division. There are also more uncommon processes such as reverse transcription, in which RNA is transcribed back into DNA13,14, and RNA replication, in which RNA is the template for new RNA molecules. The central dogma of molecular biology describes this flow of information between DNA, RNA and proteins. It was first outlined by Francis Crick in 1956 who simply stated: “Once information has got into a protein it can’t get out again”15. By “information” he meant the sequence of amino acids or related sequences. This idea was first described in 195816 and further refined in 197017 after the discoveries of the reverse transcription and the RNA dependent RNA polymerases.

DNA In human somatic cells the DNA is ordered into 2 sets of 23 chromosomes. These contain just over 3 billion base pairs and constitute the human genome.

In 1869 Friedrich Mischer isolated the substance found in the nuclei of the cell and named the isolate nuclein18. Richard Altman was later able to refine the isolation process and was able to obtain an isolate free of protein that he termed nucleic acids. Albrecht Kossel and Phoebus Levene made further contributions to the chemical understanding of the DNA molecule, identifying the individual ( (A), thymine (T), (C) and (G)) by 192919. this point it was not known that DNA carried hereditary information. It was not until 1944 that Oswald Avery, Colin McLeod and Maclyn McCarty suggested that DNA

3

carried such information20.The idea was met with much scepticism, as DNA was regarded as a far too simple a molecule to be able to harbour this information. However, a study published in 1952 by Alfred Hershey and Martha Chase supported this claim and DNA became accepted as the conveyer of genetic information21. The puzzle pieces were now in place, but not yet put together. The structure of DNA was still unknown. The problem was solved one year later when James Watson and Francis Crick, based on the X-ray diffraction patterns generated by Rosalind Franklin and Maurice Wilkins, presented the structure of DNA22–24. In their model they ordered the nucleotides into two chains that coil around each other, creating a double helix. The nucleotides consist a sugar deoxyribose linked on one side to a phosphate group and the other either a purine base (adenine or guanine) or a pyrimidine base (cytosine or thymine). These nucleotides are linked together at the sugar via the phosphate group, creating a backbone that allows the bases to face inwards. Erwin Chargaff had earlier discovered that the purines and pyrimidines existed in equal amounts25. This was used in Watson and Crick’s model by having paired to thymines and paired to via hydrogen bonds. At the end of the paper the authors modestly point out that the proposed structure with its complementary bases allows for a clever way of copying the DNA.

In parallel to this work on the DNA molecule, other scientists were trying to understand how traits are inherited. Early work on plants by Gregor Mendel concluded that some traits, such as the colour of the flower, were inherited as discrete units - the colour of the two parental plants do not mix to form a third colour. In 1909 Wilhelm Johansen named these discrete units of inheritance “genes”. It is important to understand that the gene is a concept invented to help us understand the genome and where the functional elements are placed. Thus, when our knowledge of the genome and inheritance expands, the gene definition changes as well. A recent definition of the gene describes it as: “…the union of genomic sequences encoding a coherent set of potentially overlapping functional products.”26 It is likely that the definition we have currently will be updated in the future. Traditionally, a gene is described as a unit of the genome that is transcribed into a functional RNA molecule. This is the definition that will be used in this thesis.

4 Anders Jemt

Although the genome is largely similar between individuals, small differences in the base pair sequence are common. They come in the form of single variations (SNVs) and larger structural variants (SVs) such as copy number variation (CNVs), including insertions, deletions and duplications. The 1000 Genomes project ran between 2008 and 2015 with the aim of creating a catalogue of human genetic variation; in 2015 the catalogue covered over 88 million variants27. Genetic variations are a main contributor to the differences between individuals. However, many genetic variations do not manifest themselves as a change in phenotype. Indeed, a study in 2015 found around 100 genes that could be deleted without apparent consequences for the carrier28. Finding the variants that lead to genetic disorders is one of the big challenges in the evolving field of genetic medicine.

RNA In simple terms, DNA is used for storing information while RNA is used to mediate information and control the transfer of information. RNA differs from DNA in several ways: RNA uses ribose sugars instead of deoxyribose, it is generally single stranded (though it often forms intra and intermolecular structures) and features the base uracil (U) in place of thymine (T). There are several classes of RNA serving different purposes in the cell. These can be broadly divided in two different groups according to whether they are directly involved in protein synthesis or not. The list here is by no means complete and is mainly here to give a broad picture of the RNA complexity. The major classes of RNA involved in protein synthesis are ribosomal RNA (rRNA), transfer RNA (tRNA) and messenger RNA (mRNA). rRNA is an integral part of the ribosome and is the most abundant form of RNA. It can constitute up to 90% of the entire RNA content in each cell29. tRNA is responsible for recruiting amino acids to the ribosome. mRNA serves as the template during protein synthesis and can also be called protein-coding RNA. The other group of RNA species are involved in RNA modifications such as small nuclear RNA (snRNA) and small nucleolar RNA (snoRNA). There are also RNAs involved in the regulation of the transcription and translational processes such as a long noncoding RNA (lncRNA), micro RNA (miRNA), antisense RNA (iRNA) and small interfering RNA (siRNA).

5

The transcribed RNA content in its entirety is called the transcriptome. While DNA can be regarded as an almost static molecule (the bases and their order being mostly the same in the majority of the cells within an individual), RNA is a dynamic molecule. The levels of individual RNA transcripts vary greatly across cells and thus the transcriptome differs between cell types. Two papers in Nature from 2012 found 62.1% and 74.7% of the genome to be covered by processed and primary transcripts, respectively, and 80% of the genome to be functional30. However, these figures have been disputed31,32. The transcriptome adapts as the cell responds to changes in it surroundings, making the transcriptome an interesting marker to study.

Messenger RNA has traditionally been given the most focus in comparison to other types of RNA. This molecule is transcribed from DNA in the nucleus in a premature state called precursor mRNA (pre- mRNA). These pre-mRNA transcripts consist of several blocks of coding sequences (exons) separated by non-coding sequence (introns), as well as untranslated regions (UTR) at the ends. The pre-mRNA is processed into mature mRNA by poly-adenylation of the 3’ end, addition of a capping sequence to the 5’ end and removal of introns from the transcript. The process in which introns are removed from the transcript is called splicing. Splicing is not confined to the removal of introns; exons can be removed from the transcript, introns can be retained and parts of the exons might also be removed together with the introns when the transcript matures. Through variation in splicing it is possible to get altered mature mRNAs from the same pre-mRNA, a process known as alternative splicing. When the transcript has been processed into mature mRNA, it is transported from the nucleus to the cytoplasm where it is translated into proteins.

Proteins Mature mRNA transcripts are translated into proteins by ribosomes. The RNA is translated in blocks of three bases, called codons, and each of these codons encodes one of the 20 amino acids or a start/stop codon. Several combinations encode for the same amino acid and this redundancy is why some mutations in the genetic code may not affect proteins33. Translation starts at the start codon, after which tRNA with complementary codons to the mRNA deliver amino acids that are linked

6 Anders Jemt

together into a polypeptide chain. When the machinery reaches a stop codon the newly formed chain is released. The amino acid sequence then undergoes post-translational modifications such as phosphorylation, glycosylation and lipidation, and is folded into the correct 3D structure to form a functional protein.

The entire protein content of a cell is called that cell’s proteome. The proteome is studied with techniques such as mass spectrometry and antibody arrays. The Human Protein Atlas (HPA) project was initiated in 2003 and aims to describe the proteome of all major tissues in the human body34. The field of high throughput proteomics is not as mature as genomics or transcriptomics; one of the reasons is that proteins lack the copying mechanism of its nucleic acid counterparts. As such, the transcriptome is often studied instead as a proxy to the proteome, which has been shown to correlate to the proteome on a qualitative level35–37.

Today, all big commercial DNA sequencing technologies utilize proteins, both during the sequencing and when preparing the nucleic acids for sequencing. Protein engineering has been, and still is, key in the development of DNA sequencing technologies.

Cells, tissues and organs The encapsulation of DNA, RNA and protein within a lipid bilayer is what constitutes a cell. In multicellular organisms the cells specialize, fulfilling different tasks thus forming cell types. Similar cells band together to form tissues, and several tissues can form larger structural units called organs. It is important to remember that the DNA sequence in the cells is the same – they only differ in which transcripts are expressed and by extension their proteins. Even though the DNA is the same, an organ can be extremely heterogeneous. Efforts such as the Genotype-Tissue Expression (GTEx) project aims to characterize these differences38. However, sequencing the transcriptome of an organ and its tissues only gives an overview of the state in the tissue - the finer nuances are lost in a bulk sample. It is possible to sequence multiple small parts of the tissue, even down to the single cell level. This provides a more detailed picture of the tissue and the cells within it. What is lacking from that picture is the interplay between cells and how they are ordered to form the tissue and later the organ. However, there are sequencing techniques that can

7

provide that information39–41 and thus allow us to gain new insights into cells, tissues and organs.

8 Anders Jemt

Sequencing Technologies In 1976 the bacteriophage MS2 became the first organism to have its genome sequenced42. MS2 is a single stranded RNA virus and the sequence was determined on the RNA level by digesting radiolabeled RNA and using two dimensional gel electrophoresis to separate the fragments43. The first DNA genome to be sequenced was the bacteriophage Phi X 174 in 197744, which was sequenced using Frederick Sanger’s plus and minus method developed a few years earlier. The plus and minus method creates radiolabeled double stranded DNA with sticky ends of different lengths that are split into two sets, the minus and the plus set. Each set is then split into four reactions, resulting in a total of eight reactions. In the minus set the newly synthesized strand is further extended in the presence of three out of the four bases. The plus set undergoes extension with only one of the four bases. Each of the reactions in the two sets has a different base omitted or present. The resulting eight reactions are analysed using gel electrophoresis for fragment size separation. The fragments from the minus reactions are used to deduce the next base to be incorporated while the plus reactions shows which base was the last to be incorporated45. Although a significant advancement from 2D gels, the system suffered from poor resolution over homopolymers, as the homopolymer length had to be estimated from the spacing of the bands in the gel46. Maxam and Gilbert took the next technological step when they published their chemical cleavage technique in 197747. Radiolabeled double stranded DNA is fragmented, split in four reactions and in each reaction different chemical cleavage agents are added. These agents cleave the DNA either at purines (A and G), at pyrimidines (C and T), at A or at C. The fragments generated from these four reactions are separated using gel electrophoresis creating a band for every sequence position, minimizing the homopolymer problem of the plus and minus method46.

A few months later, Sanger introduced the sequencing technology that would dominate the field for nearly 30 years48. The method came to be known as and circumvented the homopolymer problem of the plus and minus system by using chain-terminating nucleotides. These nucleotides lack the hydroxyl group of normal nucleotides and thus cannot form a bond to the next nucleotide. Sequencing was originally carried out in four reactions, each having all of

9

the normal deoxynucleotides, where one type is radioactively labelled, plus a small percentage of one . In each reaction the polymerase extends the nucleotide chain over the single stranded template until a dideoxynucleotide is incorporated. Since all bases are present and the chain terminating nucleotides are in minority, a wide range of different fragment lengths will be created. The fragments from each reaction are then separated on a polyacrylamide gel forming distinct bands at the point in the chain where a dideoxynucleotide has been incorporated. Thanks to its accuracy, robustness and ease of use, Sanger sequencing became the sequencing method of choice for many years and is still widely used for validation experiments49. Sanger sequencing was the method used in the Human Genome Project (described earlier) and the method was continuously improved over the years. Some of the most significant introductions to the technology include the switch from radiolabelling to fluorescence (first by using fluorescently labelled primers50 and later moving the label to the dideoxynucleotides51), automation52 and fragment separation by capillary electrophoresis53.

In 1998 Pål Nyren introduced a new method for sequencing, called pyrosequencing54. used a sequencing by synthesis (SBS) approach outlined in a patent by Melamede from 198555. In both Sanger sequencing and pyrosequencing, a primer is extended over a template sequence, but while Sanger sequencing uses separation of the labelled fragments by gel electrophoresis to deduce the sequence, pyrosequencing uses real-time detection of the incorporated base. Pyrosequencing relies on the indirect detection of pyrophosphate, which is released upon the successful incorporation of a matching nucleotide. The four bases A, G, C and T are added sequentially to the reaction. Sulfurylase then catalyses the addition of pyrophosphate to adenosine 5’-phosphosulfate (APS) forming adenosine triphosphate (ATP), which is used as a cofactor by luciferase in the conversion of luciferin to oxyluciferin, a light producing reaction. A camera is used to detect the burst of light, which scales with the number of incorporated nucleotides. Apyrase then degrades remaining ATP and non-incorporated bases and the next base is added. This technology would become the basis of the first commercially available massively parallel sequencing instrument.

10 Anders Jemt

Massively parallel sequencing Moore’s law is the name given to a prediction made by Gordon Moore that the number of transistors in each square inch of commercially available computer processors will double every two years56. This model has been able to predict the rapid pace of increased computational power quite accurately. A popular comparison in the field of sequencing has been to show how the reduction in sequencing cost has outpaced even Moore’s law57. This drop can be matched to the introduction of massively parallel sequencing. What connects the different sequencing platforms within this concept is miniaturization of the sequencing reaction, making it possible to run millions of reactions in parallel. In the early days of massively parallel sequencing the market was dominated by three systems: the Genome Sequencer (often called 454) from Roche, the Genome Analyzer from Solexa (later Illumina) and the SOLiD machines from . Since then new technologies have emerged, like the Ion torrent platform, Complete Genomics and the GeneReader, some of which have successfully grabbed a piece of the market.

454 Life Sciences In 2005, 454 Life Sciences (later acquired by Roche) released the Genome Sequencer 20 (GS20). Optimization and miniaturization of the pyrosequencing reaction had produced an instrument capable of generating ~300,000 reads of 100 bases of length in each run58. To achieve this, template molecules were clonally amplified on micrometre- sized beads in a process called emulsion PCR. Parallelisation was achieved by washing the DNA covered beads over a picotiter plate with wells big enough to hold one bead. The enzymes necessary for the light emitting reaction were coupled to smaller beads that would fit around the larger template bead. The dNTP’s were then added sequentially with an apyrase-washing step in-between. A charge-coupled device (CCD) monitored the plate and registered the light bursts that signals base incorporation. Further optimisations and parallelisation of the technology finally lead to an instrument capable of sequencing around 1 million DNA fragments to a read length of 700 bases59. The system was at the end not able to compete with the other technologies in terms of throughput and Roche decided to discontinue the sequencing platform in 201360. Another weak point for the platform was the sequencing of homopolymers longer than 6-8 bases, at which point the signal does not scale with the number of incorporated bases. This contributed to a relatively high rate of false insertions and deletions (indels) in the resulting sequences61.

11

Illumina The second massively parallel sequencing instrument to reach the market was the Genome analyser from Solexa (later acquired by Illumina). The Illumina platform uses sequencing by synthesis (SBS) with reversible chain terminating nucleotides coupled to fluorescent dyes. Clusters of clonally amplified single stranded DNA are generated on a flow cell in a process called bridge amplification, resulting in DNA fragments physically attached to the glass surface. Reversible chain terminating nucleotides containing a fluorescent label are added, extending the sequencing primers of each cluster by one base. After washing away excess nucleotides, the flow cell is imaged and each cluster lights up with the colour corresponding to the incorporated base. The fluorophore together with the chemical modification that prevents elongation is then cleaved from the sequences and the clusters are then ready for elongation again, with the next base62. Since the introduction of the Genome Analyzer I in 2006, Illumina has continued to develop their technology and diversified their portfolio of sequencing instruments to suit the varying demands of data output between laboratories. The instrument line-up now spans from the MiniSeq's 7.5 giga-base (Gb) output to the 1800 Gb output of the HiSeqX63. The core technology has remained largely the same with the exception of two recent developments: the introduction of patterned flow cells and a new two-colour SBS chemistry. These developments have allowed the company to increase output speed of sequencing whilst decreasing the cost.

Patterned flow cells allow for a higher cluster density by physically separating sequence clusters in nano-wells that cover the surface64. A standard Illumina flow cell features an open surface covered by adapter sequences, to which DNA fragments hybridize and form clusters during bridge amplification. Overloading the flow cell with DNA fragments leads to the entire surface being covered by clusters, which cannot be separated by the imaging software. Finding the loading concentration that allows for high data output while not overloading is key to maximising the output from the sequencers. In contrast, patterned flow cells use a new amplification strategy, called kinetic exclusion amplification, which makes the cluster generation process much less sensitive to overloading65.

12 Anders Jemt

The introduction of the two-channel SBS chemistry puts fewer requirements on the optics, which means that the instruments can be made cheaper and sequencing is faster66. In traditional Illumina sequencing A, G, C, and T are coupled to different fluorophores; the two- channel system only uses two dyes. T and C is each labelled with one dye while A is labelled with a mix of the two dyes and G is unlabelled. Clusters with a mixture of the dyes are thus called as A while a dark cluster is treated as G.

SOLiD Applied Biosystems’ SOLiD (Sequencing by Oligonucelotide Ligation and Detection) system does not use a SBS approach, instead relying on the ligation of detectable DNA probes to read out the DNA sequence. As is the case with the other two MPS technologies presented earlier, the sequenced fragments must first be isolated and clonally amplified either on beads or directly on the flow chip to create single stranded sequencing templates. Probes that have been labelled with one of the four available fluorophores (depending on the probe’s first two bases) are added to the flow cell and are ligated to a sequencing primer in the case of sequence complementarity. Redundant probes are washed off and the fluorescent signal corresponding to the first two bases of the ligated probe is registered. The 5’ end of the probe is blocked so as to only allow for ligation of one probe. After imaging, the block together with a few bases is removed to allow for the next cycle of ligation. By using sequencing primers of different lengths, each position on the sequencing template is read twice and the registered base is derived from the colour change that occurs when shifting two bases that are probed by one step61. This dependency leads to the possibility of isolating sequencing errors from variations in the sequence. Over the last years the SOLiD platform has lost ground to Illumina. The read length is shorter (75 bp for single end runs), data output is lower (max 320 Gb)67, the sequencing is more expensive and takes longer time compared to Illumina’s high output instruments61. However, the sequencing chemistry has been adopted in new applications such as in situ sequencing68.

Ion Torrent In a Nature paper from 2011, Jonathan Rothberg and associates from Life Technologies (now part of ThermoFisher Scientific) introduced the Ion Torrent sequencing platform69. It was the first platform to use a non-

13

optical readout, instead relying on fluctuations in pH to deduce the sequence. The platform shares similarities with 454 pyrosequencing in that it uses native nucleotides that are added sequentially to the sequencing reaction. Nucleotide incorporation results in release of pyrophosphate (measured in pyrosequencing) and a hydrogen ion, a by- product that the Ion Torrent registers. Clonal amplification of the sequencing template by emulsion PCR prior to sequencing ensures that enough protons are released upon incorporation for the change in pH to be detectable by an anion-sensitive field-effect transistor (ISFET) coupled to a complementary metal-oxide semiconductor (CMOS) device. Ion torrent sequencers suffer from the same type of limitations as 454 sequencing: a weakness to call homopolymer lengths correctly, resulting in indel errors. The current read length is 400 bp and the system is capable of outputting 10-15 Gb per chip70.

GeneReader, Qiagen The GeneReader from Qiagen is a relative recent addition to the sequencing market. The platform utilizes emulsion PCR in order to create clonal sequencing templates on beads. The actual sequencing reaction is very similar to the chemistry used by Illumina. Terminally blocked nucleotides are added to the sequencing reaction but only a fraction of these have been labelled with a fluorophore. After imaging, the fluorophores are cleaved off and a 3’-OH group is regenerated61. What separates the GeneReader from its competitors is the focus on a complete workflow, bundling the sequencing machine together with sample preparation robotics and analysis. Qiagen has also chosen to tailor the GeneReader towards clinical use by currently only offering gene panels targeting clinically actionable genes. Currently, there are no numbers available on read lengths and throughput.

Complete Genomics/BGISEQ In January 2010 Complete Genomics published a paper in Science describing sequencing by unchained ligation71. The sequencing template is prepared by first fragmenting the DNA, followed by repeated use of type IIS restriction enzymes that allow the ligation of sequencing adapters into the fragment. The fragments are circularized and amplified into “nanoballs” using rolling circle amplification (RCA). The resulting constructs are sequenced using anchor probes that hybridize to the adaptor sequence and a detection probe with degenerate bases and a

14 Anders Jemt

single known base coupled to a fluorophore. The probes are removed after imaging and a new set of probes are added. By alternating the known position in the detection probe the sequence can be deduced. The base call (the interpretation of the raw signal to a base) is independent of the previous reaction since the sequencing reaction is reset after each ligation, which contributes to an improved sequence quality. The business model of Complete Genomics was never to sell instruments, but rather sell sequenced human genomes as a service. The company was acquired by BGI in 2013 and an instrument based on the technology has since been launched for the Chinese market under the name BGISEQ-500. According to the company website the system is capable of sequencing a 100 bp paired end and outputting up to 200 Gb72, but this has yet to be confirmed.

Single-molecule sequencing All of the sequencing technologies described above use clonally amplified templates in the sequencing reaction in order to amplify the sequencing signal and reduce the impact of incomplete chemical reactions. Each read from the sequencer represents the consensus from sequencing several separate DNA fragments. A different approach to DNA sequencing is to read the sequence of one individual DNA molecule: single-molecule sequencing. Sequencing on this level is challenging as stochastic errors are directly translated to an incorrect base call instead of being masked by a consensus signal. However there are advantages such as longer read lengths and minimized GC bias. Helicos Biosciences was the first company to offer a single-molecule sequencer, the Heliscope73. Poly- adenylated DNA strands are hybridized to sequencing primers and the bases are called by adding labelled reversible chain terminator nucleotides followed by imaging and signal registration74,75. The system was used in the first single-molecule sequencing of a human genome76 and was also capable of sequencing RNA directly, without converting the RNA to DNA77. However the platform suffered from poor read length (35 bp) and limited throughput (35 Gb per run)78 and the company filed for bankruptcy in 201279. Today’s market is dominated by the system from Pacific Biosciences, but the MinION device from Oxford Nanopore Technologies is gaining traction and several other systems are nearing launch.

15

Pacific Biosciences The PacBio RS instrument from Pacific Biosciences was made commercially available in 201180. The technology, called Single-Molecule Real-Time (SMRT) sequencing, was first presented in a Science paper in 2009 and utilizes single DNA polymerases deposited at the bottom of a zeptoliter sized wells, called zero-mode waveguides (ZMW)81,82. The sequence is read by having the polymerase extend a template strand in the presence of fluorescently labelled nucleotides. As nucleotides are incorporated, fluorophores are cleaved off and diffuse out from the ZMW. The wells are continuously monitored by a CCD camera, which registers the colour and duration of the fluorescent signals. A successful incorporation leads to a prolonged fluorescent signal from the ZMW, as compared to signals registered from nucleotides freely diffusing in and out of the ZMW. The main benefit with the PacBio system is the longer read lengths. The original system had an average read length of about 3000 bases83 and the sequel to the PacBio RS, the Pacbio RS II, boasts an average read length of over 10 kilo bases (kb) with occasional reads over 60 kb84. Being a single-molecule platform, the long reads suffer from a high raw error rate of around 13%61. However, the errors are stochastic, and by creating shorter circular fragments the polymerase can make multiple passes over the same molecule, effectively creating a consensus read with an error rate of less than 1%. Recently, Pacific Biosciences released an updated machine, which is much smaller and lower priced, called the Sequel System. At the time of writing the details on the system are scarce, but according to Pacific Biosciences, users can expect a seven fold increase in throughput and an average read length between 8 kb and 12 kb85.

Oxford Nanopore Technologies Oxford Nanopore Technologies (ONT) first presented their USB stick sized DNA sequencer, the MinION, at the 2012 Advances in Genome Biology and Technology (AGBT) conference86. The company planned to make the instrument available to early access costumers by the end of 2012, but it was to take until 2014 before the instruments started reaching users87.

At the core of the technology is a protein nanopore, immobilized in a polymer membrane, through which a DNA strand is threaded88. As an electric potential is applied over the membrane an ion current flows

16 Anders Jemt

through the pore. While the DNA is translocated through the pore with the assistance of a motor protein, changes in this potential are registered. The pattern of these changes is characteristic of the nucleotide bases passing through the pore and can be translated to base calls89. The platform reads the signal continuously in the form of k-mers, with each possible combination having a distinct signature. Currently, the platform uses a modified CsgG protein pore90, and are able to pass 250 bases per second through the channel91, which registers the signal pattern from 6 bases. As with all single-molecule sequencers, the MinION suffers from a high raw error rate, currently around 15%. To improve the base call qualities, ONT has devised a simple library prep in which a hairpin loop is ligated to the end of the double stranded fragment. This enables both DNA strands to be sequenced, creating a so-called 2D read with improved accuracy (currently around 95%) as compared to just reading one strand92. The read length of the system is ultimately limited by the length and quality of the input DNA. Reads as long as 100 kb has been reported93 but most reads are considerably shorter. The MinION can have up to 512 simultaneously active pores. At the time of writing it is possible to register for an early access program for a larger version of the instrument, the PromethION, which has up to 144,000 simultaneously active pores.

The small size of the MinION sequencer and its simple library prep has opened up new possibilities and niches for DNA sequencing. During the Ebola outbreak in 2014 the MinION was deployed at a field hospitals for monitoring the virus94,95. Furthermore, the size and ease of use has seen the sequencer being used for sequencing on the International Space Station96. Other interesting applications are direct RNA sequencing97 and methylation studies98,99.

Upcoming technologies The sequencing market is today entirely dominated by Illumina. There are however several new technologies and applications being developed that seek to challenge Illumina’s supremacy. Some of them are years away from reaching the market, whilst others are preparing for launch in the near future. A common theme for many of these technologies is that they try to target new markets that Illumina has not yet cornered. This includes focusing on a specific geographical region or targeting the

17

clinical sequencing market, which in a few years has the potential to dwarf the academic market.

GenoCare, Direct Genomics Direct Genomics have revived the single molecule sequencing approach used by Helicos Biosciences, modified it and implemented this technology in the GenoCare sequencer. The instrument uses only a few library preparation steps: DNA shearing, denaturation and Cy3 labelling. The genes of interest are captured on the flow cell by hybridization to probes that double as sequencing primers. The Cy3 tag is used to identify the clusters, and labelled bases with reversible chain terminators are added sequentially. Imaging is carried out using Total Interference Reflection (TIRF) microscopy, creating an “evanescent wave” that only excites the molecules within 150 to 200 nm from the surface100. The authors report on a detection sensitivity of 3% minor allele frequency. The system is designed for gene panel sequencing in a clinical setting and is at this point only intended for the Chinese market.

GENIUS, GenapSys Details remain scarce on the GENIUS system from GenapSys. According to a patent application from 2011, co-authored by the founder of the company, the detection principle seems to be the change in pH and temperature that occurs upon successful incorporation of nucleotides101. The patent describes the detection of temperature changes as low as 0.003 °C in picoliter-sized wells on a semiconductor chip, similar to those used by IonTorrent. The company partnered with Sigma Aldrich in 2015 to co-market the GENIUS system, which was then said to be close to being launched102. The sequencer has yet to reach the market.

Genia, Roche One of the main challenges of is converting the signal from the nanopore into a correct sequence of bases. Fragments translocate through the pore quickly and this, together with the structure of the pore, means that signals are comprised of multiple bases. The Genia system is able to offer single base detection by sequencing tags instead of bases on their CMOS nanopore chip. A Phi29 DNA polymerase is covalently attached to a α-haemolysin nanopore situated in a lipid bilayer. The polymerase latches on to a primed single stranded DNA sequence and begins to synthesize a complementary strand using bases with molecular tags added to them. As the nucleotide is incorporated, the

18 Anders Jemt

tag is removed and translocates through the nanopore. Each base has a different tag, which compared to nucleotides has a much more distinct signal when translocating through the pore103. As of 2014 the company is a part of Roche104, and is focusing the sequencer towards the clinical market105. There are no reports as to when the sequencer can be expected to reach the market.

Firefly, Illumina The Firefly project at Illumina seeks to develop a CMOS based SBS instrument. Currently, Illumina uses multi-channel detection systems to call bases (four-channels in the HiSeq and MiSeq systems and two- channels in NextSeq and MiniSeq). CMOS sensors are one-channel devices, which means that it only registers a binary on or off signal. Ion Torrent solves that problem by adding bases sequentially and registering the signal. Illumina seeks to develop a one-channel variant of its SBS chemistry where T has a permanent label, A has a removable label, G has no label and C starts with no label but can accept one. The base is deduced by imaging the chip two times. After the first image, the label is transferred from A to C and by combining the information from the two images the base can be called. Launch is expected during the second half of 2017106.

Hyb & Seq, Nanostring Technologies At AGBT 2016, Nanostring technologies showcased their enzyme free sequencing principle for gene panels. The principle is based on their nCounter technology, originally designed for transcript quantification. Single stranded DNA is deposited on a flow cell and the sequence is interrogated by hybridizing detection probes. The detection probes have a 10 base long recognition sequence, the middle 6 bases of which are the sequence that is being interrogated, and a reporter sequence that is used to read the sequence. The reporter sequence is an expanded version of the recognition sequence to which optical barcodes are hybridized in a sequential manner in order to read the sequence. The same sequence can be interrogated multiple times, reducing the error rate. The company is targeting the clinical market with this system but launch is still years away107.

Solid-state nanopore sequencers The nanopore sequencers of today use a biological protein pore placed in a lipid bilayer. Fabricating synthetic in a solid-state membrane

19

would potentially increase stability, facilitate mass production and allow for easy variation in pore size. However, fabrication of solid-state nanopores has proven to be challenging, and sequencing devices based on this technology are still far away. The first pores were fabricated in a 108 109 Si3N4 membrane but today a verity of materials can be used . Recently, single-layer materials such as graphene have sparked much interest. The thickness of a graphene membrane is comparable to the size of a single nucleotide, which potentially would facilitate signal recognition. Graphene is also electrically conductive thus making it possible to register changes in the in-plane current as the molecule translocates through the pore. Several issues, aside from expensive fabrication, need to be solved before graphene nanopores can be implemented in a sequencing device. The pore is prone to clogging, the signal is dependent on the orientation of the DNA molecule and the translocation speed is too fast for single base pair detection110. This may, however, be the future of DNA sequencing.

20 Anders Jemt

Preparing nucleic acids for sequencing Currently there are no sequencing machines capable of using a raw sample, such as blood or tissue, as input. Instead the nucleic acids need to be separated from proteins, cell debris and other contaminants in a process called nucleic acid extraction. Furthermore, the naked nucleic acids need to be modified to suit both the sequencing system and the question at hand. This process is referred to as the library preparation.

Extraction The principle of isolating nucleic acids has remained largely similar for many years. The workflow can be divided into three parts. First is the disruption of the tissue and cell lysis. This is followed by the isolation of the nucleic acid of interest. Finally, some sort of quality control is usually performed. In this section DNA extraction is described. The process is similar for RNA extraction, though some special considerations will be briefly mentioned.

Depending on the source material, disrupting tissue and lysing cells can be a challenging process. For example, gram-positive bacteria and plant cells have sturdy cell walls that effectively protect the contents of the cells. Most cells are however quite delicate, and can be disrupted and lysed using salt, detergents and proteases111. If the cells are part of a larger tissue biopsy it might be necessary to mechanically disrupt the structure using different shearing techniques such as bead milling, high pressure or sonication. The cell has dedicated organelles (the lysosomes) for degrading biomolecules. Upon disruption of the cells, the DNA will come into contact with the nucleases in the lysosomes, which will degrade the DNA if not neutralized by detergents and proteases.

Isolating the DNA from cell debris and other impurities can be done by binding the DNA to a solid support and washing away unwanted molecules before re-suspending the DNA. The solid support is most often a silica matrix, which in the presence of chaotropic salts binds DNA112 as water molecules become unavailable112. This bond is stable in alcohol, which is used to wash the silica bound DNA. When the silica is re- hydrated, the DNA is released from the matrix and can be collected. The silica matrix can be in the form of a spin filter or micrometre sized beads113, which can be made in a paramagnetic material for easy

21

handling. Another approach is to use liquid phase extraction with phenol:chloroform:isoamyl alcohol. This produces a very clean product but the method is laborious, uses hazardous chemicals and is therefore avoided when possible. RNA is a less stable molecule then DNA. Furthermore, RNAses that can rapidly degrade RNA are ever-present and are highly resistant to inactivation114. Thus, in order to be able to extract high quality RNA, great care must be taken right from the time the sample is collected. The best method to ensure high integrity RNA is to rapidly freeze the samples to - 80° C (flash freezing). This is often not practically possible and several alternative methods to preserve RNA have been developed. Chemical cross-linking of proteins and nucleic acids with the right amount of formaldehyde solution has been proven to be an effective and inexpensive method115. There are also commercial offerings that help in preserving RNA integrity116.

Sequencing library quality is highly dependent on the quality of the nucleic acids that are used as input material. Quality metrics can include the lengths of the nucleic acids as well as sample contamination with DNA/RNA or proteins. Fragment lengths can be measured using gel electrophoresis or, for greater resolution, capillary gel electrophoresis. A frequently used quality metric of RNA is the RNA integrity number (RIN), where a value of 10 reflects a fully intact sample and 1 is completely degraded117. Protein contamination can be estimated by measuring the absorbance at 260 nm and 280 nm using a spectrophotometer. Nucleic acids have their absorbance maxima at 260 nm while proteins (mainly the amino acids tyrosine and tryptophan) absorb strongly at 280 nm. The ratio of these two measurements gives an approximate value describing the level of protein contamination. The nucleic acid quantity can be measured using either absorbance at 260 nm or by having fluorescent dyes binding specifically to the DNA or RNA. Fluorescence based methods offer greater accuracy and are therefore preferred118.

Challenging source material Compared to fresh in suspension samples such as blood, extracting high quality nucleic acids from formalin fixed paraffin embedded (FFPE) samples is difficult. This preservation method is very common for clinical biopsies, as it leaves the sample intact for histological analysis. However,

22 Anders Jemt

the method introduces artefacts in the DNA sequence119 and the extracted material is generally of low quality. Two additional steps are required when extracting nucleic acids from FFPE samples. First is the deparaffinization of the sample. This has classically been done using xylene followed by washing with ethanol but xylene-free deparaffinization is increasingly common in commercial kits. Secondly, the crosslinks between the nucleic acids and their surrounding molecules need to be reversed, which is often done by simply heating the sample120. The temperature and time for this step is dependent on the type of nucleic acid that is being extracted. RNA is more sensitive than DNA and excessive heating will degrade the RNA fragments121. After this step the process of nucleic acid extraction follows the normal procedure as described above.

DNA Library preparation Since all the papers included in this thesis have used Illumina sequencing, this section will take an Illumina-centric approach to library preparation. Many of the steps and basic techniques are shared across the sequencing platforms, however. The typical protocol follows this workflow: DNA fragmentation, size selection, modification of fragment ends, ligation of adapters and amplification, with reaction clean-ups in between steps. This creates a paired end library, which has adapters with sequencing primers on both ends of the genomic fragments, enabling sequencing from both sides into fragment.

Illumina dominates the field with its massively parallel short read sequencers. Reads are cheap but short, which poses a problem for certain applications such as genome assembly, discovery of structural variants and haplotyping. Other technologies (PacBio and ONT) can provide longer reads but at a higher cost per base. Several specialist protocols, as well as bioinformatics methods, exist that try to harness the massive data output from the Illumina machines to form artificial long reads.

Fragmentation DNA extracted from cells or fresh frozen tissue can reach lengths of 100 kb. Manipulating DNA fragments of these lengths is challenging, and they do not amplify well. Fragmenting the DNA into more manageable sizes also means that more of the bases become available to the short read sequencer. For genome re-sequencing, Illumina recommends fragment

23

sizes of around 450 bp. A good DNA shearing technique should have minimal sequencing bias, yield a tight length distribution of fragments and be controllable. The strategies for shearing DNA can be divided into three categories: physical (nebulization, sonication, point-sink shearing), enzymatic (fragmentase, transposase) or chemical. Two widely used techniques are acoustic shearing with Covaris’ ultrasonicators or use of transposase for simultaneous fragmentation and adapter incorporation. Covaris uses a technology that they call Adaptive Focused Acoustics. High frequency sound waves are focused on the sample in bursts, causing cavitation that shears the DNA. Advantages include minimal sample loss, no sequence bias and low risk of contamination. Transposase based fragmentation uses a hyperactive variant of the Tn5 transposase122. By including fragmentation and adapter incorporation in the same 5 minute reaction, the library preparation is significantly quicker when compared to other methods. One negative aspect of the method is that the transposase enzymes have a slight sequence bias122.

Modification and Ligation The fragmentation process creates a mix of double stranded DNA fragments with sticky ends (3’ and 5’ overhangs). Before adapters can be ligated to the fragments, the ends need to be made blunt and active. A typical enzymatic cocktail contains T4 DNA polymerase, which fills in the 5’ overhangs. Thanks to the enzyme’s strong 3’ to 5’ exonuclease activity, any protruding 3’ ends are removed. Furthermore, the mix often includes a T4 polynucleotide kinase that will activate the fragments by phosphorylating the 5’ ends. To add sequencing adapters to the fragments, a common strategy is to first add a single non-template A nucleotide to the 3’ ends of the fragment, using the terminal transferase activity of certain DNA polymerases, such as Taq polymerase123. The adapters can then be designed with a 5’ T overhang, which lowers the number of adapter dimers that form during the ligation process. The adapters contain a sequence for hybridizing and amplifying on the flow cell surface during the cluster generation process. The adapters also feature a region to which the sequencing primers can hybridize, as well as a sample specific index sequence, allowing multiple samples to be loaded onto the same flow cell.

24 Anders Jemt

Amplification and Enrichment With adapters ligated to the DNA fragments, the library is commonly amplified using the polymerase chain reaction (PCR). Exponential amplification of DNA fragments with PCR is one of the techniques that has transformed the field since its development in 1986124. It is the most widely used technique for amplifying DNA, mainly due to its simplicity and immense capability to amplify single fragments of DNA into millions of copies. Selective amplification also allows PCR to work as an enrichment technique, using primer sequences that hybridise to the adapter sequences in both ends of the fragment. However, PCR-free library preparation protocols have started to become the norm for sequencing projects with no shortage of starting material. The advantage of using a PCR-free protocols are minimised sequence biases, fewer sequencing errors and a more uniform coverage of the genome125.

Reaction clean up and size selection A reaction clean up is done prior to loading the finished library on the DNA sequencer for clustering. Aside from a change of buffers, the clean up also serves as a size selection step. The PCR library amplification process and the flow cell amplification favour shorter fragments over longer ones. Without a library size selection, a large fraction of the reads would originate from amplified adapters and PCR artefacts such as primer-dimers. Loading very long fragments on the flow cell is also wasteful. They will not amplify well, and form diffuse clusters that waste sequencing capacity. Fragments as long as 1500 bp can be amplified on unpatterned flow cells, albeit with lower efficiency126. The techniques for purifying the library follows the same principles as described in the extraction section (capturing the DNA on a silica matrix or binding it to paramagnetic beads).

A common strategy for combined size selection and purification uses paramagnetic beads coated in carboxylic acids in a NaCl and polyethylene glycol (PEG) solution127. PEG acts as a crowding agent, causing the DNA to aggregate, and the sodium ions form a salt bridge between the DNA backbone and the carboxylic acids on the beads. Longer DNA precipitates more easily and by adjusting the PEG concentration it is possible to set the lower cut-off128. A size selection step with both a lower and higher cut- off can be done by performing two bead purifications with different PEG concentrations129. Bead purifications are easy to integrate into an

25

automated protocol but the cut-offs can be inaccurate. Semi-automated gel excision systems offer a narrower span of insert sizes but at the cost of a lower throughput130.

Variants of DNA sequencing The protocol described above is typical for re-sequencing a reference organism. Depending on the sample and scientific question, it may be more appropriate to employ a protocol developed for a specific purpose.

Targeted approaches It is possible to make a selection within a DNA library for specific sequences of interest. For example, sequencing only the protein coding parts of the genome is called exome sequencing. The exome constitutes only 1% the size of the genome, but for Mendelian disorders (disorders that originate from a single faulty gene) it harbours 85% of the disease causing mutations131. The exome can be isolated from the finished sequencing library by hybridizing biotinylated probes to coding sequences. The hybridized probes and sequences can then be separated from the rest of the library with streptavidin covered paramagnetic beads. The advantage of exome sequencing is that a higher coverage over the regions of interest can be obtained for the same total number of sequenced reads. However, coverage outside of these regions is lost and as the cost of whole genome sequencing decreases so does the need for whole exome sequencing.

For applications that require a large number of samples, like population genetics, sequencing library complexity reduction might be desirable. It is possible to increase coverage and obtain confident variant calls while keeping the cost low by using restriction site-associated DNA sequencing (RADseq). This achieves reduced representation by digesting the genome with one or more restriction enzymes, followed by adapter ligation to the fragment ends and sequencing132. Because cut sites are sequence specific, the same fraction of the genome is enriched in all samples. The reduction level can be adjusted to match the budget and number of samples by using restriction enzymes that have more or less common sequence motifs in the genome. This approach also lends itself well to situations where no reference genome is available.

26 Anders Jemt

Several other specialized library preparation protocols exist. Chromatin immunoprecipitation sequencing (ChIP-seq) is used to study the interaction between DNA and proteins. DNA is cross-linked to nearby proteins, followed by enrichment using antibodies that target the proteins of interest. These fragments are then used to create a sequencing library133. By using antibodies that target DNA binding proteins such as transcription factors, it is possible to enrich for enhancer or repressor regions, and gain insights into gene regulation. Another approach is to enrich for active parts of the genome with ATAC-seq (assay for transposase-accessible chromatin with high-throughput sequencing). The transposase in ATAC-seq preferentially targets regions of the genome that are not tightly packed, resulting in a selection for regions of open chromatin.

Parts of the genome that are distant on a linear sequence level might be in close proximity in the three-dimensional state of the cell. Chromosome conformation capture techniques, such as 3C, 4C, 5C and HiC, are used to study how chromatin is organized. It relies on crosslinking chromatin fragments that are in close proximity, followed by fragmentation and dilute ligation of the DNA fragments. DNA sequences that were physically close in the cell will be joined together and can be used to create a sequencing library134.

5-methyl cytosine is a common base modification found in many species. Promoters with high levels of this mark often belong to inactive genes in mammals, and genome-wide methylation promotes genome stability. The pattern of 5-methyl cytosine marks in the genome can be studied using bisulfite sequencing. In bisulfite sequencing DNA is subjected to treatment with sodium bisulfite, which converts unmethylated cytosines to uracil. The sample is then sequenced and the remaining C’s are identified as methylated135. Bisulfite sequencing requires complete conversion of the DNA in order to minimize the rate of false positives. The protocol is unable to distinguish between methylated and hydroxymethylated cytosines. Alternative protocols such as oxidative bisulfite are able to make this distinction. Single molecule sequencers show promise as well; PacBio has demonstrated that their sequencer can distinguish between methylated and non-methylated136. There are also

27

indications that nanopores from ONT might have the possibility to identify methylated bases98,99.

Phasing and scaffolding Human somatic cells contain two alleles, each with its own unique set of variants. Using regular short-read Illumina paired end sequencing it is not possible to identify which allele variants originate from. In a case with two deleterious heterozygote variants on the same gene it is important to know if the variants map back to the same allele (leaving one copy of the gene intact and functioning) or if both of the alleles are affected. Placing the variants into consensus allelic blocks is called phasing. Long read single molecule techniques can offer phased blocks natively but there are also specialized library preparation protocols for Illumina sequencing that can generate phased data, such as Illumina Synthetic long read sequencing (formerly known as Moleculo)137,138 and the Chromium platform from 10x genomics139. Both methods work by compartmentalizing a few long fragments of the genome so that there is a very low risk of having two copies of the same genomic region in the same compartment. The fragments are amplified using primers with compartment specific barcodes that after sequencing are used to construct blocks of phased reads. The 10x approach uses a droplet station for compartmentalization, while the offering from Illumina use 384 well plates.

The human genome has a large number of repetitive elements140 and a lot of structural variants. Paired end libraries are not well suited to resolve these regions, as the relatively short inserts do not span the repetitive elements. The challenge is even greater for organisms that lack a reference genome. For these organisms the reads are first assembled into contigs (blocks of overlapping short read sequences) that are scaffolded together over repetitive regions. A possible solution is to increase the library insert size to a length where it spans the repetitive region. However, as mentioned previously, longer fragments do not cluster well on the flow cell and some regions in the genome are notoriously hard to bridge. Here, long read techniques such as PacBio141 and ONT offers advantages. Short read techniques, like Illumina sequencing, can circumvent the problem by employing a mate pair library preparation protocol. The protocol requires high molecular weight DNA which is sheared to ~5-20 kb followed by size selection to remove short fragments.

28 Anders Jemt

The fragments are then circularized so that the ends of the fragments are joined together via an adapter sequence. A second fragmentation then takes place followed by selection for the adapter sequence. Sequencing adapters are added to the end of the fragments containing the adapter sequence and after amplification, the product is sequenced. The resulting reads will then originate from the ends of the original 20 kb fragment and can be used for joining contigs together in a scaffold. This is a very labour intensive protocol and requires large quantities of high quality DNA. Illumina has a library preparation kit that utilizes transposase for simultaneous fragmentation and adapter insertion but the process remains cumbersome.

Single-cell DNA sequencing Even though the genomes of somatic cells are nearly identical, there are variations to be found. These somatic variations are more common in certain genomic regions where high genetic variability can even be desirable, such as the immune gene loci. Single cell sequencing is one of the tools that can be employed to find these differences. The technique also has great potential in cancer research, where single cell sequencing can be used to map tumour heterogeneity142. There are several technical challenges associated with single cell sequencing. Isolation of single cells from a tissue can be a tricky and laborious process. This is followed by a highly sensitive amplification process needed to generate enough material for DNA sequencing.

Isolating a single cell can be achieved with laser capture microdissection (LCM) or other micromanipulation techniques. In LCM a tissue section is visualized under a microscope and a laser is use to isolate the cells of interest. The technique works well when the sample size is small, but is not well suited to large-scale experiments. In high throughput techniques, the tissue is dissociated to form a single-cell suspension. Compartmentalizing the single cells from this suspension can then be achieved by using dilutions, droplet stations, or fluorescence-activated cell sorting (FACS). Human somatic cells contain two nearly identical copies of the genome, with a total mass of around 7 picograms. Broadly speaking, there are three approaches for amplifying these minute amounts of DNA: PCR based amplification, isothermal amplification, and a combination of both143. PCR based techniques use random priming to introduce sequencing adapters, which can be used as primer targets in a

29

subsequent PCR. Isothermal approaches, like multiple displacement amplification (MDA), use an isothermal polymerase with strand displacement activity, such as Phi29, along with random priming to amplify the material. As DNA strands are displaced by the polymerases, new sites become available for primers and further amplification. Hybrid approaches, like Multiple Annealing and Looping Based Amplification Cycles (MALBAC)144 and PicoPLEX145, first employ a few cycles of linear amplification with a strand displacing polymerase before the fragments are amplified using PCR. The purely PCR based methods can struggle with covering the entire genome, leading to regions where one or both alleles are missing (allelic dropout) but have a more uniform coverage of the amplified regions143. A 40% coverage over the genome and dropout rates as high as 76% has been reported for PCR based techniques146. MDA approaches seldom yield a uniform amplification but cover more of the genome (~80% coverage) with a dropout of around 35%146. The isothermal polymerases used in MDA are less error prone leading to fewer false positive variants. Deeper sequencing can potentially alleviate the non-uniform amplification, but calling copy number variations from this data is hard143. Hybrid approaches try to combine the best from both techniques and consequentially share some characteristics of each143. The coverage over the genome is generally not as high as for MDA techniques (between 50% and 70%) but the allelic dropout is lower (~25%)146.

RNA sequencing The Illumina sequencing machines are not capable of sequencing RNA. There are indications that PacBio and ONT could offer this possibility in the future97,147, but there are currently no commercially available kits. Instead, RNA must be converted into complementary DNA (cDNA) prior to sequencing using a reverse transcriptase. In all forms of RNA sequencing the library preparation enriches or depletes for one or several types of RNA. With this in mind it is important to consider the scientific question before preparing the library, so that the molecules of interest are preserved.

Protein coding mRNA has classically been the subject of most interest and is separated from most other RNA species by its poly-A tail. The occurrence of the poly-A tail makes it possible to select for poly-A transcripts using biotinylated poly-T probes. This method is used in the

30 Anders Jemt

stranded Illumina TruSeq RNA library preparation kits. Good quality RNA is of utmost importance if coverage over the full transcript is desired, as fragmented RNA can generate a strong 3’ bias in the sequencing data148,149. Another strategy for reducing RNA complexity is to deplete ribosomal RNA, which constitutes the vast majority of all RNA content. Two major strategies are used to reduce the rRNA content. The first captures rRNA using baits that are complementary to the rRNA sequences. These can then be removed using magnetic beads. The other approach uses Ribonuclease H (RNase H) and its ability to selectively degrade RNA when it is hybridized to DNA. DNA probes against the rRNA are added and RNase H degrades the rRNA. These DNA probes are then degraded using deoxyribonuclease (DNase). Both methods are effective and bring the potential benefit of preserving other noncoding RNA in the sample148. The enriched RNA is fragmented using heat and divalent metal cations150, after which cDNA is created using random priming and a reverse transcriptase in a process called first strand synthesis. This is followed by second strand synthesis, in which the RNA strand is replaced by a second DNA strand using DNA polymerase 1 and RNase H151. After this point the protocol follows the steps described above for constructing DNA sequencing libraries, starting with the end repair and adapter ligation.

It is beneficial to know from which of the library DNA strands the RNA molecule originated, and today most protocols retain this information, known as stranded protocols. A common way of achieving strandedness is to use dUTP instead of dTTP in the second strand synthesis. When the adapters have been ligated to the double stranded fragments the strand synthesized with dUTP can be degraded using Uracil-N-Glycosylase152.

Specialized RNA sequencing libraries Several variations of this standard protocol exist. Below follows a description of some of the more specialized protocols for other RNA sequencing applications.

RNA originating from tissues that have been preserved using FFPE techniques are typically heavily degraded and often only available in a limited amount. An interesting approach to generate RNA sequencing data from these samples is to employ sequence capture against mRNA. A cDNA library is created, after which baits hybridizing to the coding parts

31

of the transcriptome are added. The mRNA selection is then amplified using PCR and sequenced. A concern could be that this approach suffers from a non-uniform capture of targets, due to differences in hybridization efficiencies, which would skew the transcription pattern. However, initial reports show good correlation with standard RNA sequencing when performed on a good quality sample from the same tissue153.

Another interesting application is the sequencing of small RNA species. These lack a poly-A tail and are generally selected against in the size selection step preceding sequencing. In brief, the standard practice is to ligate adaptors to the fragments, amplify and then perform a size selection step that is targeted towards short fragments. The size selection is challenging, since the sequencing fragments will be very similar in size to the PCR artefacts. Furthermore, single stranded ligation with T4 RNA ligase has been shown to have a sequence bias, potentially leading to inefficient ligation to some small RNAs154.

Single-cell RNA sequencing An increasingly popular application of RNA sequencing is single cell transcriptomics. Sequencing the transcriptome of single cells makes it possible to identify rare cell types and probe tissue heterogeneity that would be masked in a bulk RNA sequencing experiment155. As with sequencing the DNA of single cells, efficient isolation and amplification are major hurdles to overcome when constructing transcriptome libraries from single cells. Extreme amplification of a limited number of starting molecules may also skew the original proportions. Introducing unique molecular identifiers (UMI) prior to amplification makes it possible to separate true duplicate transcripts from amplification duplicates156. Most single cell transcriptomics techniques use poly-T probes for capturing mRNA transcripts while at the same time priming the transcript for reverse transcription. It has been estimated that the efficiency of this step is only around 10%, which puts a limit on the sensitivity for all the methods157. After poly-A capture, the protocols have different approaches to amplification. One approach is to use a poly-T probe that also includes a T7 promoter. After reverse transcription this allows for amplification using in vitro transcription (IVT), a linear amplification method. The benefits of a linear amplification method include a more stringent amplification with fewer by-products as compared to PCR. However, IVT is a time consuming method and each round of IVT only gives around

32 Anders Jemt

1000-fold amplification of the target158. IVT is the amplification strategy that is used in CEL-seq159,160, MARS-seq161 and MASC-seq162. These methods use a UMI sequence and cell specific barcode in the poly-T probe to correct for amplification errors and to allow for early sample pooling. Another amplification strategy is to use a poly-T probe with a PCR primer sequence and a template switching reverse transcriptase, which adds a few non-template C’s to the DNA strand as it reaches the end of the transcript. The C’s are used to add a primer sequence (i.e. a template-switch primer) to the end of the transcript over which the reverse transcriptase can extend. The template switching activity ensures that only full-length transcripts are amplified in the following PCR. Finally, sequencing adapters are added through a tagmentation reaction with Tn5 transposase. In Smart-Seq2163 the poly-T probe has no UMI sequence and the samples are not barcoded until the final PCR right before sequencing. Thus the samples must be kept in separate compartments until they are loaded on the sequencer. However, because of this, the protocol is able to produce sequencing fragments from the entire transcript and not only the 3’ part of the transcripts, making it possible to study splicing as well. Two other protocols that utilize the same template switching mechanism but with a UMI sequence and a cell specific barcode in the poly-T probe are Drop-Seq164 and the Single Cell 3’ Solution from 10x165. These two methods create the cDNA library in droplets and thus enable studies of impressive amount of cells simultaneously. They are however confined to counting transcripts based on 3’ tags. A recent study compared CEL-seq, Drop-seq and Smart-seq2 (among other technologies)166. The authors found that Smart-seq2 was the most sensitive method, but the lack of UMI meant that it had the highest level of amplification noise.

Recently, protocols for the simultaneous preparation of DNA and RNA from the same cell have been published167–169. This offers an interesting opportunity to study how genomic variations, such as copy number variation, affects the transcriptome. Another recently published protocol, called scM&T-seq, combines RNA sequencing with bisulfite sequencing on the same single cell170. This could lead to new insight regarding what effect DNA methylation has on the transcriptome.

33

Tissue contextual sequencing There is an increased realisation that studying the transcriptome at the single cell level only provides a limited view of the cellular landscape. Most cells act and respond in the spatial context of their tissue; preserving this information is the aim of several recently published protocols.

Traditionally transcript localization has been studied using fluorescent in situ hybridization (FISH) microscopy. This method has been adapted to identify multiple fragments in single cells by sequential hybridization of fluorescent probes171,172. Impressive as these methods are, they do not sequence transcripts but instead identify and quantify them. There are two conceptually similar methods that actually sequence the transcripts of single cells in the tissue39,40. In fluorescent in situ sequencing (FISSEQ)39 a random hexamer primes an in situ reverse transcription after which the resulting cDNA is circularized. The circularized fragments are used in a RCA to create nanoballs that are covalently attached to surrounding macromolecules. The nanoballs are then sequenced whilst still within the tissue using chemistry similar to SOLiD. The other approach for in situ sequencing40 uses padlock probes173 against cDNA created in-situ. Upon hybridization of the padlock probe, the gap is bridged and a RCA creates nanoballs that are sequenced using a strategy similar to that which was employed by Complete Genomics. A variant of this approach uses barcodes on the padlock probes so that each barcode corresponds to a transcript. Sequencing the barcode instead of the transcript offer easier separation of similar transcript sequences.

The methods described above are not able to capture the full transcriptome. Therefore advanced computational methods have been devised that can infer spatial resolution from high-throughput single-cell experiments174,175. The methods rely on pre-existing spatial maps of the transcriptomics profile for the tissue in question. A transcriptional profile is created for each single cell from the high throughput experiment. This profile is compared to the spatial map, ideally from an adjacent section using either of the in-situ techniques described above, and each cell is assigned a probability based position. The limitation of this technique is that the spatial map and the sequencing data do not originate from the same cells.

34 Anders Jemt

Spatial transcriptomics offers the advantage of running spatially resolved high throughput RNA sequencing experiments41. The method uses barcoded poly-T probes spotted on a glass slide upon which a thin tissue section is placed. Following imaging and permeabilization of tissue, the transcripts hybridize to the poly-T probes directly below for in-situ reverse transcription. The cDNA library is cleaved from the slide and is amplified in-vitro, following a CEL-seq like library preparation. After sequencing, the spatial barcode is used to map the transcript back to the original image. The density of the spots limits the resolution to 100 µm, but higher density arrays will continue to push this figure further towards the single cell.

35

Processing sequence data This section of the thesis will briefly touch upon the steps after the sequencing, where raw data is translated into results. The focus will be on the steps used to verify the library quality and will be centred on Illumina data. A few programs will be mentioned, but there are an enormous amount of bioinformatics tools and alternative workflows available. Detailed descriptions on the tools as well as underlying algorithms are outside the scope of this thesis, and so are variant calling and linking variants with phenotypes.

Initial quality check and trimming The output from the sequencers comes in the form of plain text files, called FASTQ files. Each read is represented in the file by four lines. First a title line, followed by a line with the called sequence. Then comes a line beginning with a “+”, indicating that the next line contains the quality score for each called base176. The first step after sequencing is often to run a quick quality control on the raw reads to assess the sequencing run. A popular program to use is FastQC177. The program generates a report containing several metrics to evaluate the run, such as a visualization of the quality scores, sequenced adapters, and distribution of read lengths. Based on the report, a choice can be made as to whether the mapping would benefit from the removal of low quality bases and adapter sequence contamination at the end of the reads with sequence trimming. Adapter contamination comes from short insert sizes, which causes the sequencer to start to read into the adapter sequence. There are several tools that can be used for trimming the read; two popular programs are cutadapt178 and trimmomatic179.

Alignment and post alignment processes The next step is to place the reads into context. This is done by aligning the reads towards a reference genome, or if no reference exists by assembling the reads into contigs (only reference based mapping will be covered in this thesis). There are a multitude of aligners available, but two popular programs for aligning DNA reads against a reference are BWA- MEM180 and Bowtie 2181. DNA data can undergo realignment with the GATK182 toolkit to improve the mapping around regions that are known to be problematic. The final touch before variant calling is to use another GATK tool for recalibrating the base quality scores.

36 Anders Jemt

For RNA sequencing experiments, STAR183 and TopHat2184 are two widely used splice aware RNA aligners, for mapping RNA reads to the genome. Afterwards the most common application is to do a differential expression analysis between two time points or conditions. The first step is to quantify gene expression. This is done in different ways depending on which programs that will be used for the differential expression analysis. One path is to count the number of reads that cover each gene in the specified annotation file, using tools such as HTSeq-count185 or Subread featureCounts186. Each gene will be assigned a count and the resulting matrix can be loaded into R for read depth normalization and differential expression analysis using software packages such as DESeq2187 or edgeR188. This approach does not take into account that a long gene is much more probable to be assigned reads than a short one. This is not necessary when comparing the same genes across different samples. Another strategy is to take the gene length into consideration when assigning them an expression value. Cufflinks189 is one program that does this and assigns each read an FPKM (fragments per kilobase of exon per million reads mapped) value that can be used for differential expression analysis with cuffdiff190.

Library evaluation Depending on the complexity and the read depth, the data will contain a varying number of duplicated reads. These can be marked by running the Picard191 tool MarkDuplicates, which looks for aligned reads with the same 5’ coordinates. In a low complexity library, a limited number of molecules may have been excessively amplified, leading to multiple sequencing reads from the same initial molecule. Marking duplicates is especially important with DNA data, as the quality score of the variant calls can be inflated. In RNA data biological duplicates are expected for highly expressed transcripts. For RNA sequencing experiments the incorporation of a UMI during the library preparation process is advisable (especially for low input protocols), as it will help in separating biological from technical duplicates. It is possible to estimate the technical duplication levels in RNA libraries without the use of UMIs but it is not possible to correct them. dupRadar192 is a package written in R that looks at the amount of duplicates relative to the expression level of

37

each gene. Low complexity libraries will have high numbers of duplicates for all genes, including those that are lowly expressed.

An interesting metric to look at is how many of the sequenced reads the aligner can map to the genome, and of those how many are mapped to a single unique location. This can be a somewhat blunt measurement as the mapping rate is dependent on which settings are supplied to the mapper. It might be necessary to adjust these settings to find a balance between high mapping rate and erroneous mapped reads. Looking at the uniformity of the read coverage is also important. One reason for uneven coverage is that different positions are unequally available during the library preparation. The presence of regions with extreme GC content and biases during the fragmentation process are both factors that can lead to uneven coverage. Some regions are also notoriously hard to amplify or to align reads against.

Data from RNA sequencing experiments may contain sequence reads that originate from DNA. Signs of DNA contamination in RNA data include a large fraction of intronic reads and a loss of strandedness (if the preparation was done with a stranded kit). The effect of DNA contamination will be small on highly expressed genes, but the perceived expression of lowly expressed genes may be inflated, leading to a reduced range of variation. Problems with DNA contamination are best solved by treating the extracted RNA with DNase prior to library preparation. Even though most library preparation protocols include a selection against rRNA it is important to evaluate the success of this step. Often there will be some rRNA molecules left after either poly-A selection or depletion with probes. Both strategies are effective on high quality RNA, but poly A selection generally produces a cleaner result for low quantity samples148. Another metric to evaluate when doing RNA sequencing is gene body coverage. When preparing samples with poor RNA integrity it is preferable to use ribosomal depletion instead of poly-A selection, as it will be biased for the 3’ end of the transcript. 3’ bias is manageable for gene counting strategies149 but will lead to a loss of splice variants. Furthermore, poly-A selection of samples with both low input amounts and poor RNA integrity gives fewer molecules available for the library preparation, leading to a loss in sensitivity.

38 Anders Jemt

Present investigation The papers that form the basis of this thesis all cover different aspects of preparing nucleic acids for sequencing. Enzymes remain at the heart of the library preparation process and Paper I is a systematic study of type IIS restriction endonucleases. Automation is another important aspects of the library preparation process and Paper II describes the automation of the Spatial Transcriptomics protocol. In Paper III RNA from human endometrium tissue is extracted and sequenced for an exploratory study of how the tissue changes over a menstrual cycle. Paper IV evaluates the performance of commercially available kits for sequencing highly degraded RNA and DNA.

Paper I - Endonuclease specificity and sequence dependence of type IIS restriction enzymes. Type IIS restriction enzymes are a fascinating class of endonucleases for which the cleavage site is shifted relative to the recognition site. This unique characteristic meant that they were extensively used in early gene expression studies as well as in some massively parallel sequencing protocols. In this paper, 14 restriction enzymes with shifted cleavage sites are systematically evaluated in terms of their propensity to slip, where slippage is defined as cleaving one or two bases away from the preferred site. Furthermore we evaluate whether slippage tendency can be linked to preferences in the base composition in the shifted sequence. This is a largely overlooked property of these enzymes with important implications for future protocol development.

We designed an automatable model system with MPS readout that can be further expanded to encompass additional Type IIS enzymes. Four out of the 14 tested enzymes were found to slip one or more base in more than 15% of the restriction events. BpuEI had previously not been reported to slip but was found to slip in more than 40% of the events. We were unable to find any correlation between slippage tendency and factors such as cleavage distance or reaction conditions. This work highlights the need for evaluating each restriction enzyme separately.

39

Paper II - An automated approach to prepare tissue- derived spatially barcoded RNA-sequencing libraries. Spatial transcriptomics is a high throughput protocol for studying gene expression of cells in the context of their surrounding tissue. The protocol was developed in our research group and is extensively used by several researchers. The library preparation is laborious, originally designed to process four samples in parallel over the course of two days, and requires an extensive amount of hands on time. This motivated us to set up the protocol on a liquid handling robot. A slight modification to the protocol was necessary to facilitate handling on the robot but otherwise the automated protocol is the same as the manual protocol.

In a first step we verified that the robot did not introduce any sample-to- sample variation. In order to have control of the input material we designed a model system with commercially available reference RNA as input. Eight libraries were prepared with minimal variation, which encouraged us to benchmark the automated protocol against the manual protocol. In total six libraries were prepared from gingival tissue, three were prepared on the robot and the other three manually. We sequenced the samples using Illumina chemistry and generated more than 170 million reads per sample. By comparing the expression levels, both for the whole tissue and in an isolated region we could show that the correlation levels were consistently higher for the samples prepared on the robot. This confirms that manual preparation risks introducing technical variation in samples, which can mask true biological variation. Aside from giving more robust results, the robot was able to increase the throughput 4-fold while at the same time minimizing the hands-on time to a third of that in the manual protocol.

Paper III - Comprehensive RNA sequencing of healthy human endometrium at two time points of the menstrual cycle. This paper explores the transcriptome of the human endometrium. The endometrium is a highly specialized tissue in the uterus and undergoes extensive morphological changes during the menstrual cycle. Previous to this study the transcriptional landscape of the endometrium, concerning non-coding RNA as well as protein coding RNA, had not been extensively characterized in healthy women. This apparent lack of knowledge

40 Anders Jemt

motivated us to perform this study, which we believe will be of great importance when expanding to studies of infertility, pregnancy complications and menstruation disorders.

The study encompasses seven women, who were sampled twice during a normal menstrual cycle: once during the early proliferative phase and again 7-9 days after ovulation when the endometrium is receptive. The samples used in the study had been stored under suboptimal conditions and the extracted RNA had RIN values ranging from 4.3 to 9. In light of the varying sample qualities it was decided to use a ribosomal depletion approach, which would also allow us to study the non-coding RNA. Since we wanted to include the micro RNA in the study, two libraries were created and sequenced. The small RNA libraries was sequenced to an average of 36 million reads and the total RNA libraries to an average of 77 million reads, giving us a great chance to pick up lowly expressed transcripts. There were initial concerns that the samples would not cluster according to time point and instead each donor would form a separate cluster. However, a principal component analysis showed that the menstrual phase is a stronger separator than inter-individual differences. We applied differential expression analysis to identify several significantly expressed mRNA and miRNA transcripts that had never before been reported as differentially expressed in endometrial tissue. The results were cross-validated in order to remove genes that had support only in one sample. To further validate our findings we applied quantitative PCR against three differentially expressed mRNA and miRNA in three sample pairs, which showed excellent correlation with RNA sequencing data.

Paper IV - Evaluation of methods for whole genome and transcriptome sequencing from nanograms of FFPE samples. For many years the standard method of preserving samples has been fixation in formalin followed by embedding in paraffin. There are vast numbers of samples stored in this way, often linked to extensive clinical data. However, sequencing formalin fixed paraffin embedded samples is challenging. The fixation process introduces artefacts in the DNA and thus nucleic acids extracted from FFPE samples tend to be of inferior quality, shorter and cross-linked, compared to nucleic acids from snap

41

frozen samples. Most previous studies have focused on sequencing either the exome or transcriptome, and whole genome sequencing of these samples are not common practice. The project was performed in collaboration with the Genomics Application Development group at the National Genomics Infrastructure (NGI), which had a large demand from the users to sequence FFPE samples. Our goal was to identify suitable library preparation kits and bioinformatics workflows that would enable us to robustly sequence FFPE samples from a limited amount of starting material.

Whole genome sequencing libraries were prepared from the one tissue block and were sequenced to a minimum of 298 million read pairs. We assessed two different sample qualities, two different preparation protocols and also investigated the effect of a kit designed to repair highly damaged DNA. The data reflects the highly fragmented nature of FFPE samples. More often than not, the read pairs overlapped and the reads required extensive trimming to remove sequenced adapters. The data show that the library preparation kit from Rubicon Genomics yielded the library with the highest complexity. However, whole genome sequencing of DNA from FFPE samples remained challenging.

Two different approaches to RNA sequencing of FFPE samples were also investigated. In one, the preparation kit use sequence capture targeted towards the mRNA part of the transcriptome. In the other, a low input kit that uses random priming and rRNA depletion was used. Both kits have uniform gene body coverage, indicating that the capture kit did not introduce any bias towards any part of the transcripts. Generally, the sequence capture kit performed better with more reads aligning towards the protein coding part of the genome.

42 Anders Jemt

Concluding remarks The introduction describes the advent of human genome sequencing and how the technology has evolved to the point where consumers can have their genome sequenced for under $1000. When Illumina announced the $1000 genome back in 2014, their cost calculation was immediately questioned. The $1000 only included reagents, machine cost and the salary for a technician to prepare the library and load it on the sequencer. In reality the cost was never $1000, as it did not include factors such as real estate, electricity, analysis and storage of the data193. Furthermore, you had to buy ten HiSeqX machines at a total cost of $10 million, and run them at full capacity around the clock for a year in order to reach a cost of $1000 per genome. The first time the cost actually reached $1000 for the consumer was this year. However, at the moment it is unlikely that the company is able to yield profit based solely on the consumer price. Instead, they will probably need to launch additional services for which they can charge their customers. There is also the interesting question of what a genome, even in an anonymous form, is worth to people and organisations other than the consumer. Highly curated databases with rich phenotypic information linked to the data are certainly valuable194.

The thesis has not covered the growing field of clinical sequencing. For some conditions, such as cancer diagnosis and Mendelian disorders, sequencing is already a part of the diagnostic toolbox. Technical innovations will continue to drive down the sequencing cost and thus make it feasible to apply sequencing to more disorders. It is likely that sequencing will become a standard tool for diagnosis, prognosis and that it may even expand into pre-emptive medicine.

What will the sequencing filed look like in the future? For many years the prediction has been that solid-state nanopore sequencers will replace the current era of sequencers that use biological processes for both library preparation and sequencing. However, solid-state nanopores have proven to be both difficult to fabricate and to lack in sensitivity. It is likely that we will continue to use biological sequencers for the foreseeable future (which in this field rarely means more than five years). This means that intimate knowledge of enzymes and their function will continue to be important in the preparation of nucleic acids for sequencing. The need for automated library preparations will only continue to increase as the

43

quantity of samples subjected to sequencing increases. There are also important insights to be made from the wealth of FFPE samples, which often have a rich clinical history that accompany them. It is now possible to sequence these samples, though it remains challenging. There is therefore a need for new improved extraction methods and library preparation protocols in order to improve data quality for the most degraded samples

The completion of the human genome project in 2004 has left the genomic field in search for the next great multinational endeavour. Recently, an initiative called the Human Cell Atlas195 has been launched with the intent of sequencing all the cell types in the human body at a single cell level. The advent of single cell sequencing and spatial mapping means that this is now achievable. The body map would not be limited to the transcriptome, but rather integrate the genome, transcriptome and epigenome. This will yield new spectacular insights into the heterogeneity and interplay between the cells in the body. And as with the human genome project before, numerous new technologies, collaborations and standards will follow in the wake of the project.

44 Anders Jemt

Acknowledgement När folk frågar mig vad jag tycker om att doktorera brukar jag säga något i stil med att det inspirerande, kul, mycket frihet och yada yada... Allt det är sant men det som verkligen gör doktorandtiden till något speciellt är alla fantastiska vänner och kollegor.

Joakim, ett enormt tack för möjligheten att doktorera i din grupp. Det var en del tuffa projekt i början men du lyckas alltid förmedla entusiasm och tilltro under våra möten. Har man varit doktorand i din grupp blir de svenska fjällen är aldrig samma sak igen. Tack också till mina bihandledare Afshin och Henrik. Afshin, vi har inte haft många möten under mina år som doktorand men du var även min huvudhandledare under exjobbet och från den tiden vet jag att dörren alltid står öppen. Henrik, så oerhört betydelsefull under min första tid som doktorand. Då med välbehövlig praktisk handledning och på senare tid mer som mentor, kollega och vän.

En stor del av den här avhandlingen och hela doktorandtiden består av de artiklar som följer i andra halvan av avhandlingen. Utan alla mina begåvade medförfattare hade det alltså inte blivit mycket till avhandling. Avhandlingen blev också avsevärt bättre av att Savo kritiskt granskade texten. Hans jobb blev i sin tur betydligt lättare i och med att Anna, Mickan, Erik och Phil hade hjälp till med att kolla igenom allting innan.

Tack till alla gamla och nya Genetechare (eller är det STare nu för tiden?): Beata, Julia, Mahya, Johanna, Maja, David, Jonas, Emelie, Kim, Joseph, Ludvig, José, Niyaz, Stefania, Linda, Sailendra, Linnea, Kostas, Annelie och Patrik. Simon, Mengxiao, Emanuela, härligt att höra musiken flöda på andra sidan glaset. Tack också till Mårten för handledning under exjobbet och introduktionen till klättring. Per L, tågen går fortfarande inte. Du kan lika gärna stanna på en öl till. Tack till Max och Kicki för all hjälp med FFPE bestyren. NGI Production och NGI Application, kort sagt, Alla på Alfa 3, det är ni som gör det här planet till ”the premium floor” på SciLife.

Även om alfa 3 är the premium floor så finns det mycket premium folk i form av gamla kursare, kollegor och vänner på andra våningar och

45

AlbaNova så som Hanna, Lisa, Magnus, Fredrik E, Tarek, Anna H, Sejsing och Frida.

Det finns mycket gäng på SciLife som har gjort veckorna mycket roligare och lite svettigare. Jag tänker framförallt på: Löpargänget, Skidgänget, MTBgänget (med Skogsprinsen i spetsen), Racergänget och Klättergänget.

Tack till alla mina kära vänner på SciLife. Erik, vi har följts åt länge nu. I princip daglig kontakt under 1o års tid. Helt sinnessjukt vad mycket kul vi har. Anna, även vi har hängt sedan KTH tiden och haft mycket kul ihop. Även om du lämnat SciLife så är koppen kvar. Sanja, du fixade hela den underbara LA resan. Också, grymt att ha någon att klaga med när det känns jobbigt. Britta, Sanja kallar oss för ibland för tvillingar, vilket är oerhört smickrande för mig. Resor, kräftskivor, barhäng, helt fantastiskt. Mickan, sån himla tur för att du är med och orkar dra igång upptåg och trevligheter på och utanför jobbet. Johannes, killen som tränar enligt en fem års plan men även hinner med att vara sjukt trevlig och uppdaterad med nya tankar och idéer. Phil, we’re still going do that road trip and drink absurdly strong cider at the Cori Tap. Sverker och Benjamin, brädspel, ölbryggning eller snack över en fredagsöl - alltid trevligt. Tack också till, Fredrik, Sebastian, Joel, Emma, Conny, Robin, Guillermo, Olov ni har alla bidragit till den unikt trevliga miljö som är SciLife.

Tillslut, alla vänner på och utanför jobbet, Brorsan, Syrran, Mamma och Pappa, Tack!

46 Anders Jemt

References 1 Human Genome Sequencing Consortium I. Finishing the euchromatic sequence of the human genome. Nature 2004; 431: 931–945. 2 Lander ES. Initial impact of the sequencing of the human genome. Nature 2011; 470: 187–97. 3 June 2000 White House Event - National Human Genome Research Institute (NHGRI). https://www.genome.gov/10001356/ (accessed 11 Oct2016). 4 Human Genome Announcement at the White House (2000). 2000.https://www.youtube.com/watch?v=slRyGLmt3qc (accessed 11 Oct2016). 5 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J et al. Initial sequencing and analysis of the human genome. Nature 2001; 409: 860–921. 6 Venter JC. The Sequence of the Human Genome. Science (80- ) 2001; 291: 1304– 1351. 7 Collins FS, Morgan M, Patrinos A. The Human Genome Project: lessons from large-scale biology. Science 2003; 300: 286–90. 8 Shendure J, Mitra RD, Varma C, Church GM. Advanced sequencing technologies: methods and goals. Nat Rev Genet 2004; 5: 335–344. 9 Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001; 409: 928–933. 10 Feingold E, Good P, Guyer M, Kamholz S, Liefer L, Wetterstrand K et al. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science (80- ) 2004; 306: 636–40. 11 International HapMap Consortium. A haplotype map of the human genome. Nature 2005; 437: 1299–320. 12 Durbin RM, Altshuler DL, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A et al. A map of human genome variation from population-scale sequencing. Nature 2010; 467: 1061–1073. 13 TEMIN HM, MIZUTANI S. Viral RNA-dependent DNA Polymerase: RNA- dependent DNA Polymerase in Virions of Rous Sarcoma Virus. Nature 1970; 226: 1211–1213. 14 BALTIMORE D. Viral RNA-dependent DNA Polymerase: RNA-dependent DNA Polymerase in Virions of RNA Tumour Viruses. Nature 1970; 226: 1209–1211. 15 CRICK FHC. Ideas on Protein Synthesis. 1956.https://profiles.nlm.nih.gov/ps/access/SCBBFT.pdf (accessed 9 Aug2016). 16 CRICK FH. On protein synthesis. Symp Soc Exp Biol 1958; 12: 138–63. 17 CRICK F. Central Dogma of Molecular Biology. Nature 1970; 227: 561–563. 18 Dahm R. Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Hum Genet 2008; 122: 565–581. 19 Choudhuri S. The Path from Nuclein to Human Genome: A Brief History of DNA with a Note on Human Genome Sequencing and Its Impact on Future Research in Biology. Bull Sci Technol Soc 2003; 23: 360–367. 20 Avery OT, MacLeod CM, McCarty M. STUDIES ON THE CHEMICAL NATURE OF THE SUBSTANCE INDUCING TRANSFORMATION OF PNEUMOCOCCAL TYPES : INDUCTION OF TRANSFORMATION BY A DESOXYRIBONUCLEIC ACID FRACTION ISOLATED FROM PNEUMOCOCCUS TYPE III. J Exp Med 1944; 79: 137–158. 21 Hershey AD, Chase M. INDEPENDENT FUNCTIONS OF VIRAL PROTEIN AND NUCLEIC ACID IN GROWTH OF BACTERIOPHAGE. J Gen Physiol 1952; 36: 39–56. 22 WATSON JD, CRICK FHC. Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid. Nature 1953; 171: 737–738. 23 FRANKLIN RE, GOSLING RG. Molecular Configuration in Sodium Thymonucleate. Nature 1953; 171: 740–741.

47

24 WILKINS MHF, STOKES AR, WILSON HR. Molecular Structure of Nucleic Acids: Molecular Structure of Deoxypentose Nucleic Acids. Nature 1953; 171: 738–740. 25 Chargaff E. Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia 1950; 6: 201–209. 26 Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO et al. What is a gene, post-ENCODE? History and updated definition. Genome Res 2007; 17: 669– 681. 27 Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR et al. A global reference for human genetic variation. Nature 2015; 526: 68–74. 28 Zarrei M, MacDonald JR, Merico D, Scherer SW. A copy number variation map of the human genome. Nat Rev Genet 2015; 16: 172–183. 29 Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A et al. A survey of best practices for RNA-seq data analysis. Genome Biol 2016; 17: 13. 30 Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A et al. Landscape of transcription in human cells. Nature 2012; 489: 101–108. 31 Graur D, Zheng Y, Price N, Azevedo RBR, Zufall RA, Elhaik E. On the immortality of television sets: ‘Function’ in the human genome according to the evolution-free gospel of encode. Genome Biol Evol 2013; 5: 578–590. 32 Doolittle WF. Is junk DNA bunk? A critique of ENCODE. Proc Natl Acad Sci U S A 2013; 110: 5294–5300. 33 Lagerkvist U. ‘Two out of three’: an alternative method for codon reading. Proc Natl Acad Sci 1978; 75: 1759–1762. 34 Uhlén M, Björling E, Agaton C, Szigyarto CA-K, Amini B, Andersen E et al. A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics 2005; 4: 1920–32. 35 Lundberg E, Fagerberg L, Klevebring D, Matic I, Geiger T, Cox J et al. Defining the transcriptome and proteome in three functionally different human cell lines. Mol Syst Biol 2010; 6: 450. 36 Wilhelm M, Schlegl J, Hahne H, Gholami AM, Lieberenz M, Savitski MM et al. Mass-spectrometry-based draft of the human proteome. Nature 2014; 509: 582– 587. 37 Edfors F, Danielsson F, Hallström BM, Käll L, Lundberg E, Pontén F et al. Gene‐ specific correlation of RNA and protein levels in human cells and tissues. Mol Syst Biol 2016; 12. 38 Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S et al. The Genotype- Tissue Expression (GTEx) project. Nat Genet 2013; 45: 580–585. 39 Lee JH, Daugharthy ER, Scheiman J, Kalhor R, Yang JL, Ferrante TC et al. Highly Multiplexed Subcellular RNA Sequencing in Situ. Science (80- ) 2014; 343: 1360– 1363. 40 Ke R, Mignardi M, Pacureanu A, Svedlund J, Botling J, Wählby C et al. In situ sequencing for RNA analysis in preserved tissue and cells. Nat Methods 2013; 10: 857–860. 41 Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science (80- ) 2016; 353: 78 LP-82. 42 Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J et al. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature 1976; 260: 500–507. 43 Sanger F, Brownlee GG. [43] A two-dimensional fractionation method for radioactive nucleotides. Methods Enzymol 1967; 12: 361–381. 44 Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes JC et al. Nucleotide sequence of bacteriophage [phi]X174 DNA. Nature 1977; 265: 687–695. 45 Sanger F, Coulson AR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 1975; 94: 441–448. 46 Hutchison CA. DNA sequencing: bench to bedside and beyond. Nucleic Acids Res

48 Anders Jemt

2007; 35: 6227–37. 47 Maxam AM, Gilbert W. A new method for sequencing DNA. Proc Natl Acad Sci U S A 1977; 74: 560–564. 48 Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977; 74: 5463–5467. 49 Heather JM, Chain B. The sequence of sequencers: The history of sequencing DNA. Genomics 2016; 107: 1–8. 50 Smith LM, Fung S, Hunkapiller MW, Hunkapiller TJ, Hood LE. The synthesis of oligonucleotides containing an aliphatic amino group at the 5’ terminus: synthesis of fluorescent DNA primers for use in DNA sequence analysis. Nucleic Acids Res 1985; 13: 2399–2412. 51 Prober JM, Trainor GL, Dam RJ, Hobbs FW, Robertson CW, Zagursky RJ et al. A system for rapid DNA sequencing with fluorescent chain-terminating . Science (80- ) 1987; 238: 336–341. 52 Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR et al. Fluorescence detection in automated DNA sequence analysis. Nature 1986; 321: 674–679. 53 Swerdlow H, Gesteland R. Capillary gel electrophoresis for rapid, high resolution DNA sequencing. Nucleic Acids Res 1990; 18: 1415–1419. 54 Ronaghi M, Uhlén M, Nyrén P. A Sequencing Method Based on Real-Time Pyrophosphate. Science (80- ) 1998; 281: 363–365. 55 Melamede R.J. Automatable process for sequencing nucleotide. 1985; US Patent. 56 Muir P, Li S, Lou S, Wang D, Spakowicz DJ, Salichos L et al. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol 2016; 17: 53. 57 Wetterstrand KA. DNA Sequencing Costs: Data. Natl. Hum. Genome Res. Inst. 2016.https://www.genome.gov/sequencingcostsdata/ (accessed 2 Sep2016). 58 Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005; 437: 376–80. 59 GS FLX+ System. http://454.com/products/gs-flx-system/ (accessed 22 Aug2016). 60 Roche Shutting Down 454 Sequencing Business. GenomeWeb. 2013.https://www.genomeweb.com/sequencing/roche-shutting-down-454- sequencing-business (accessed 22 Aug2016). 61 Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next- generation sequencing technologies. Nat Rev Genet 2016; 17: 333–51. 62 Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008; 456: 53–9. 63 Sequencing Systems | Sequencer Comparison Table. Illumina. http://www.illumina.com/systems/sequencing.html (accessed 28 Aug2016). 64 HiSeq 3000/HiSeq 4000 Technology. http://www.illumina.com/systems/hiseq- 3000-4000/technology.html (accessed 29 Aug2016). 65 Shen M-JR, Boutell JM, Stephens KM, Ronaghi M, Gunderson KL, Venkatesan BM et al. Kinetic exclusion amplification of nucleic acid libraries. 2013. 66 2-Channel SBS Technology | Faster sequencing and data acquisition. http://www.illumina.com/technology/next-generation-sequencing/sequencing- technology/2-channel-sbs.html (accessed 29 Aug2016). 67 SOLiD Next Generation Sequencing. https://www.thermofisher.com/se/en/home/life-science/sequencing/next- generation-sequencing/solid-next-generation-sequencing.html (accessed 15 Nov2016). 68 Lee JH, Daugharthy ER, Scheiman J, Kalhor R, Ferrante TC, Terry R et al. Fluorescent in situ sequencing (FISSEQ) of RNA for gene expression profiling in intact cells and tissues. Nat Protoc 2015; 10: 442–458.

49

69 Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 2011; 475: 348–352. 70 Ion S5 and Ion S5 XL Next-Generation Sequencing System Specifications. https://www.thermofisher.com/se/en/home/life-science/sequencing/next- generation-sequencing/ion-torrent-next-generation-sequencing-workflow/ion- torrent-next-generation-sequencing-run-sequence/ion-s5-ngs-targeted- sequencing/ion-s5-specifications.html (accessed 15 Nov2016). 71 Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG et al. Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science (80- ) 2010; 327: 78–81. 72 BGISEQ-500. http://seq500.com/en/portal/Sequencer.shtml (accessed 15 Nov2016). 73 Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I et al. Single- molecule DNA sequencing of a viral genome. Science 2008; 320: 106–9. 74 Bowers J, Mitchell J, Beer E, Buzby PR, Causey M, Efcavitch JW et al. Virtual terminator nucleotides for next-generation DNA sequencing. Nat Methods 2009; 6: 593–5. 75 Thompson JF, Steinmann KE. Single Molecule Sequencing with a HeliScope Genetic Analysis System. In: Current Protocols in Molecular Biology. John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2010 doi:10.1002/0471142727.mb0710s92. 76 Pushkarev D, Neff NF, Quake SR. Single-molecule sequencing of an individual human genome. Nat Biotechnol 2009; 27: 847–50. 77 Ozsolak F, Platt AR, Jones DR, Reifenberger JG, Sass LE, McInerney P et al. Direct RNA sequencing. Nature 2009; 461: 814–8. 78 Thompson JF, Steinmann KE. Single molecule sequencing with a HeliScope genetic analysis system. Curr Protoc Mol Biol 2010; Chapter 7: Unit7.10. 79 Helicos BioSciences Files for Chapter 11 Bankruptcy Protection. GenomWeb. 2012.https://www.genomeweb.com/sequencing/helicos-biosciences-files-chapter- 11-bankruptcy-protection (accessed 27 Sep2016). 80 Karow J. PacBio Ships First Two Commercial Systems; Order Backlog Grows to 44. GenomeWeb. 2011.https://www.genomeweb.com/sequencing/pacbio-ships-first- two-commercial-systems-order-backlog-grows-44 (accessed 27 Sep2016). 81 Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G et al. Real-time DNA sequencing from single polymerase molecules. Science 2009; 323: 133–8. 82 Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW. Zero- mode waveguides for single-molecule analysis at high concentrations. Science 2003; 299: 682–6. 83 Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Genome Biol 2013; 14: 405. 84 Pacific Biosciences. Smrt Sequencing: Read Lengths. 2016.http://www.pacb.com/smrt-science/smrt-sequencing/read-lengths/ (accessed 27 Sep2016). 85 Heger M. PacBio Launches Higher-Throughput, Lower-Cost Single-Molecule Sequencing System. GenomWeb. 2015.https://www.genomeweb.com/business- news/pacbio-launches-higher-throughput-lower-cost-single-molecule-sequencing- system (accessed 27 Sep2016). 86 Eisenstein M. Oxford Nanopore announcement sets sequencing sector abuzz. Nat Biotechnol 2012; 30: 295–296. 87 Jain M, Fiddes IT, Miga KH, Olsen HE, Paten B, Akeson M. Improved data analysis for the MinION nanopore sequencer. Nat Meth 2015; 12: 351–356. 88 Clarke J, Wu H-C, Jayasinghe L, Patel A, Reid S, Bayley H. Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol 2009; 4: 265–270. 89 Ip CLC, Loose M, Tyson JR, de Cesare M, Brown BL, Jain M et al. MinION

50 Anders Jemt

Analysis and Reference Consortium: Phase 1 data release and analysis. F1000Research 2015; 4: 1075. 90 Brown CG, Clarke J. Nanopore development at Oxford Nanopore. Nat Biotechnol 2016; 34: 810–811. 91 Oxford Nanopore Technologies. Update: New ‘R9’ nanopore for faster, more accurate sequencing, and new ten minute preparation kit. 2016.https://nanoporetech.com/about-us/news/update-new-r9-nanopore-faster- more-accurate-sequencing-and-new-ten-minute-preparation (accessed 28 Sep2016). 92 Oxford Nanopore Updates: Reveals the nanopore used in Nanopore devices. Next Gen Seek. 2016.http://nextgenseek.com/2016/03/oxford-nanopore-updates-from- technology-focus-by-clive-g-brown/ (accessed 28 Sep2016). 93 Urban JM, Bliss J, Lawrence CE, Gerbi SA. Sequencing ultra-long DNA molecules with the Oxford Nanopore MinION. 2015 doi:10.1101/019281. 94 Quick J, Loman NJ, Duraffour S, Simpson JT, Severi E, Cowley L et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 2016; 530: 228–232. 95 Hoenen T, Groseth A, Rosenke K, Fischer RJ, Hoenen A, Judson SD et al. Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool. Emerg Infect Dis 2016; 22. doi:10.3201/eid2202.151796. 96 Rainey K. First DNA Sequencing in Space a Game Changer. 2016.http://www.nasa.gov/mission_pages/station/research/news/dna_sequencin g (accessed 5 Nov2016). 97 Garalde DR, Snell EA, Jachimowicz D, Heron AJ, Bruce M, Lloyd J et al. Highly parallel direct RNA sequencing on an array of nanopores. 2016 doi:10.1101/068809. 98 Rand AC, Jain M, Eizenga J, Musselman-Brown A, Olsen HE, Akeson M et al. Cytosine Variant Calling with High-throughput Nanopore Sequencing. 2016 doi:10.1101/047134. 99 Simpson JT, Workman R, Zuzarte PC, David M, Dursi LJ, Timp W. Detecting DNA Methylation using the Oxford Nanopore Technologies MinION sequencer. 2016 doi:10.1101/047142. 100 Gao Y, Deng L, Yan Q, Gao Y, Wu Z, Cai J et al. Single molecule targeted sequencing for cancer gene mutation detection. Sci Rep 2016; 6: 26110. 101 Esfandyarpour H, Ronaghi M. Heat and pH measurement for sequencing of DNA. 2011; : 23. 102 Karow J. Sigma-Aldrich Enters Co-Marketing Agreement with GenapSys for Genius Sequencer. GenomeWeb. 2015.https://www.genomeweb.com/sequencing- technology/sigma-aldrich-enters-co-marketing-agreement-genapsys-genius- sequencer (accessed 5 Oct2016). 103 Fuller CW, Kumar S, Porel M, Chien M, Bibillo A, Stranges PB et al. Real-time single-molecule electronic DNA sequencing by synthesis using polymer-tagged nucleotides on a nanopore array. Proc Natl Acad Sci U S A 2016; 113: 5233–8. 104 Roche Acquires Nanopore Sequencing Firm Genia Technologies for up to $350M. GenomWeb. 2014.https://www.genomeweb.com/sequencing/roche-acquires- nanopore-sequencing-firm-genia-technologies-350m (accessed 5 Oct2016). 105 Heger M. Genia Publishes Proof of Principle Study, Says Commercial System Will Serve Clinical Market. GenomWeb. 2016.https://www.genomeweb.com/sequencing-technology/genia-publishes- proof-principle-study-says-commercial-system-will-serve (accessed 5 Oct2016). 106 Heger M. At AGBT, Illumina Provides Additional Details on Project Firefly. GenomeWeb. 2016.https://www.genomeweb.com/sequencing-technology/agbt- illumina-provides-additional-details-project-firefly. 107 Ashford M. NanoString Shares First Details of New Sequencing Chemistry at AGBT. GenomeWeb. 2016.https://www.genomeweb.com/sequencing- technology/nanostring-shares-first-details-new-sequencing-chemistry-agbt

51

(accessed 9 Oct2016). 108 Li J, Stein D, McMullan C, Branton D, Aziz MJ, Golovchenko JA. Ion-beam sculpting at nanometre length scales. Nature 2001; 412: 166–169. 109 Agah S, Zheng M, Pasquali M, Kolomeisky AB. DNA sequencing by nanopores: advances and challenges. J Phys D Appl Phys 2016; 49: 413001. 110 Heerema SJ, Dekker C. Graphene nanodevices for DNA sequencing. Nat Nanotechnol 2016; 11: 127–136. 111 S.A.Miller DDD and HFP. A simple salting out procedure for extractin DNA from humam nnucleated cells. Nucleic Acids Res 1988; 15: 1215. 112 Vogelstein B, Gillespie D. Preparative and analytical purification of DNA from agarose. Proc Natl Acad Sci U S A 1979; 76: 615–619. 113 Boom R, Sol CJ, Salimans MM, Jansen CL, Wertheim-van Dillen PM, van der Noordaa J. Rapid and simple method for purification of nucleic acids. J Clin Microbiol 1990; 28: 495–503. 114 Peirson SN, Butler JN. RNA Extraction From Mammalian Tissues. In: Rosato E (ed). . Humana Press: Totowa, NJ, 2007, pp 315–327. 115 Vickovic S, Ahmadian A, Lewensohn R, Lundeberg J. Toward Rare Blood Cell Preservation for RNA Sequencing. 2015 doi:10.1016/j.jmoldx.2015.03.009. 116 Weber DG, Casjens S, Rozynek P, Lehnert M, Zilch-Schöneweis S, Bryk O et al. Assessment of mRNA and microRNA Stabilization in Peripheral Human Blood for Multicenter Studies and Biobanks. Biomark Insights 2010; 5: 95–102. 117 Schroeder A, Mueller O, Stocker S, Salowsky R, Leiber M, Gassmann M et al. The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol Biol 2006; 7: 3. 118 Gallagher SR. Quantitation of DNA and RNA with absorption and fluorescence spectroscopy. Curr Protoc Mol Biol 2011; Appendix 3: 3D. 119 Quach N, Goodman MF, Shibata D. In vitro mutation artifacts after formalin fixation and error prone translesion synthesis during PCR. BMC Clin Pathol 2004; 4: 1. 120 Gilbert MTP, Haselkorn T, Bunce M, Sanchez JJ, Lucas SB, Jewell LD et al. The isolation of nucleic acids from fixed, paraffin-embedded tissues-which methods are useful when? PLoS One 2007; 2: e537. 121 Evers DL, Fowler CB, Cunningham BR, Mason JT, O’Leary TJ. The effect of formaldehyde fixation on RNA: optimization of formaldehyde adduct removal. J Mol Diagn 2011; 13: 282–8. 122 Adey A, Morrison HG, (no last name) A, Xun X, Kitzman JO, Turner EH et al. Rapid, low-input, low-bias construction of shotgun fragment libraries by high- density in vitro transposition. Genome Biol 2010; 11: R119. 123 Clark JM. Novel non-templated nucleotide addition reactions catalyzed by procaryotic and eucaryotic DNA polymerases. Nucleic Acids Res 1988; 16: 9677– 9686. 124 Mullis K, Faloona F, Scharf S, Saiki R, Horn G, Erlich H. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harb Symp Quant Biol 1986; 51 Pt 1: 263–73. 125 Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods 2009; 6: 291–5. 126 Head SR, Kiyomi Komori H, LaMere SA, Whisenant T, Van Nieuwerburgh F, Salomon DR et al. Library construction for next-generation sequencing: Overviews and challenges. Biotechniques 2014; 56: 61–77. 127 DeAngelis MM, Wang DG, Hawkins TL. Solid-phase reversible immobilization for the isolation of PCR products. Nucleic Acids Res 1995; 23: 4742–4743. 128 Lis JT, Schleif R. Size fractionation of double-stranded DNA by precipitation with polyethylene glycol. Nucleic Acids Res 1975; 2: 383–389. 129 Lennon NJ, Lintner RE, Anderson S, Alvarez P, Barry A, Brockman W et al. A

52 Anders Jemt

scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biol 2010; 11: R15. 130 Quail MA, Gu Y, Swerdlow H, Mayho M. Evaluation and optimisation of preparative semi-automated electrophoresis systems for Illumina library preparation. Electrophoresis 2012; 33: 3521–3528. 131 Rabbani B, Tekin M, Mahdieh N. The promise of whole-exome sequencing in medical genetics. J Hum Genet 2014; 59: 5–15. 132 Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat Rev Genet 2016; in press: 81–92. 133 Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet 2012; 13: 840–852. 134 Denker A, de Laat W. The second decade of 3C technologies: detailed insights into nuclear organization. Genes Dev 2016; 30: 1357–82. 135 Yong W-S, Hsu F-M, Chen P-Y. Profiling genome-wide DNA methylation. Epigenetics Chromatin 2016; 9: 26. 136 Flusberg BA, Webster D, Lee J, Travers K, Olivares E, Clark TA et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods 2010; 7: 461–465. 137 Voskoboynik A, Neff NF, Sahoo D, Newman AM, Pushkarev D, Koh W et al. The genome sequence of the colonial chordate, Botryllus schlosseri. Elife 2013; 2: e00569. 138 Kuleshov V, Xie D, Chen R, Pushkarev D, Ma Z, Blauwkamp T et al. Whole-genome haplotyping using long reads and statistical methods. Nat Biotechnol 2014; 32: 261–266. 139 Weisenfeld NI, Kumar V, Shah P, Church D, Jaffe DB. Direct determination of diploid genome sequences. bioRxiv 2016. doi:10.1101/070425. 140 Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 2012; 13: 36–46. 141 Chaisson MJP, Huddleston J, Dennis MY, Sudmant PH, Malig M, Hormozdiari F et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 2015; 517: 608–611. 142 Navin NE. The first five years of single-cell cancer genomics and beyond. Genome Res 2015; 25: 1499–507. 143 Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet 2016; 17: 175–88. 144 Zong C, Lu S, Chapman AR, Xie XS. Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell. Science (80- ) 2012; 338: 1622 LP-1626. 145 PicoPLEX® WGA Kit. http://rubicongenomics.com/products/picoplex/ (accessed 19 Oct2016). 146 Huang L, Ma F, Chapman A, Lu S, Xie XS. Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications. Annu Rev Genomics Hum Genet 2015; 16: 79–102. 147 Vilfan ID, Tsai Y-C, Clark TA, Wegener J, Dai Q, Yi C et al. Analysis of RNA base modification and structural rearrangement by single-molecule real-time detection of reverse transcription. J Nanobiotechnology 2013; 11: 8. 148 Adiconis X, Borges-Rivera D, Satija R, DeLuca DS, Busby MA, Berlin AM et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat Methods 2013; 10: 623–629. 149 Sigurgeirsson B, Emanuelsson O, Lundeberg J. Sequencing degraded RNA addressed by 3’ tag counting. PLoS One 2014; 9: e91851. 150 Forconi M, Herschlag D. Chapter 5 – Metal Ion-Based RNA Cleavage as a Structural Probe. In: Methods in Enzymology. 2009, pp 91–106. 151 Gubler U, Hoffman BJ. A simple and very efficient method for generating cDNA

53

libraries. Gene 1983; 25: 263–269. 152 Parkhomchuk D, Borodina T, Amstislavskiy V, Banaru M, Hallen L, Krobitsch S et al. Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res 2009; 37: e123–e123. 153 Cabanski CR, Magrini V, Griffith M, Griffith OL, McGrath S, Zhang J et al. cDNA Hybrid Capture Improves Transcriptome Analysis on Low-Input and Archived Samples. J Mol Diagnostics 2014; 16: 440–451. 154 Zhuang F, Fuchs RT, Sun Z, Zheng Y, Robb GB. Structural bias in T4 RNA ligase- mediated 3′-adapter ligation. Nucleic Acids Res 2012; 40: e54–e54. 155 Wills QF, Livak KJ, Tipping AJ, Enver T, Goldson AJ, Sexton DW et al. Single-cell gene expression analysis reveals genetic associations masked in whole-tissue experiments. Nat Biotech 2013; 31: 748–752. 156 Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods 2011; 9: 72–74. 157 Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods 2014; 11: 163– 166. 158 Tang F, Lao K, Surani MA. Development and applications of single-cell transcriptome analysis. Nat Methods 2011; 8. doi:10.1038/nmeth.1557. 159 Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep 2012; 2: 666–73. 160 Hashimshony T, Senderovich N, Avital G, Klochendler A, de Leeuw Y, Anavy L et al. CEL-Seq2: sensitive highly-multiplexed single-cell RNA-Seq. Genome Biol 2016; 17: 77. 161 Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I et al. Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types. Science (80- ) 2014; 343. 162 Vickovic S, Ståhl PL, Salmén F, Giatrellis S, Westholm JO, Mollbrink A et al. Massive and parallel expression profiling using microarrayed single-cell sequencing. Nat Commun 2016; 7: 13182. 163 Picelli S, Björklund ÅK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smart- seq2 for sensitive full-length transcriptome profiling in single cells. Nat Methods 2013; 10: 1096–1098. 164 Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 2015; 161: 1202–1214. 165 Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R et al. Massively parallel digital transcriptional profiling of single cells. bioRxiv 2016; : 65912. 166 Ziegenhain C, Vieth B, Parekh S, Reinius B, Smets M, Leonhardt H et al. Comparative analysis of single-cell RNA sequencing methods. bioRxiv 2016. doi:10.1101/035758. 167 Dey SS, Kester L, Spanjaard B, Bienko M, van Oudenaarden A. Integrated genome and transcriptome sequencing of the same cell. Nat Biotechnol 2015; 33: 285–289. 168 Macaulay IC, Haerty W, Kumar P, Li YI, Hu TX, Teng MJ et al. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat Methods 2015; 12: 519–22. 169 Reuter JA, Spacek D V, Pai RK, Snyder MP. Simul-seq: combined DNA and RNA sequencing for whole-genome and transcriptome profiling. Nat Methods 2016. doi:10.1038/nmeth.4028. 170 Angermueller C, Clark SJ, Lee HJ, Macaulay IC, Teng MJ, Hu TX et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods 2016; 13: 229–32. 171 Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M, Cai L. Single-cell in situ RNA profiling by sequential hybridization. Nat Methods 2014; 11: 360–361.

54 Anders Jemt

172 Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science (80- ) 2015; 348: aaa6090. 173 Nilsson M, Malmgren H, Samiotaki M, Kwiatkowski M, Chowdhary B, Landegren U. Padlock probes: circularizing oligonucleotides for localized DNA detection. Science (80- ) 1994; 265: 2085–2088. 174 Achim K, Pettit J-B, Saraiva LR, Gavriouchkina D, Larsson T, Arendt D et al. High- throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nat Biotechnol 2015; 33: 503–9. 175 Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotech 2015; 33: 495–502. 176 Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010; 38: 1767–71. 177 Andrews S. FASTQC A Quality Control tool for High Throughput Sequence Data. Babraham Inst. doi:2016-10-14. 178 Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 2011; 17: 10. 179 Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30: 2114–20. 180 Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. arXiv Prepr arXiv 2013; : 3. 181 Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9: 357–359. 182 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010; 20: 1297–303. 183 Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013; 29: 15–21. 184 Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013; 14: R36. 185 Anders S, Pyl PT, Huber W. HTSeq - A Python framework to work with high- throughput sequencing data. Bioinformatics 2014; 31: 166–169. 186 Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014; 30: 923–30. 187 Love MI, Huber W, Anders S, Lönnstedt I, Speed T, Robinson M et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014; 15: 550. 188 Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26: 139–40. 189 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010; 28: 511–515. 190 Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 2012; 31: 46–53. 191 Picard Tools - By Broad Institute. http://broadinstitute.github.io/picard/ (accessed 23 Oct2016). 192 Sayols S, Scherzinger D, Klein H. dupRadar: a Bioconductor package for the assessment of PCR artifacts in RNA-Seq data. bioRxiv 2016; : 46243. 193 Check Hayden E. Is the $1,000 genome for real? Nature 2014. doi:10.1038/nature.2014.14530. 194 23andMe’s Data Deal. GenomeWeb. 2015.

55

https://www.genomeweb.com/scan/23andmes-data-deal (accessed 30 Oct2016). 195 Human Cell Atlas. https://www.humancellatlas.org/ (accessed 29 Oct2016).

56