<<

Technologies for Single Analysis Erik Borgström

© Erik Borgström Stockholm 2016

Royal Institute of Technology (KTH) School of Division of Technology

Science for Life Laboratory SE-171 21 Solna Sweden

Printed by Universitetsservice US-AB Drottning Kristinas väg 53B SE‐100 44 Stockholm Sweden

ISBN 978-91-7595-842-2 TRITA-BIO Report 2016:1 ISSN 1654-2312

Erik Borgström

Abstract During the last decade high throughput DNA of single cells has evolved from an idea to one of the most high profile fields of research. Much of this development has been possible due to the dramatic reduction in costs for massively parallel sequencing. The four papers included in this thesis describe or evaluate technological advancements for high throughput DNA sequencing of single cells and single molecules.

As the sequencing technologies improve, more samples are analyzed in parallel. In paper 1, an automated procedure for preparation of samples prior to massively parallel sequencing is presented. The method has been applied to several projects and further development by others has enabled even higher sample throughputs. Amplification of single cell is a prerequisite for . Paper 2 evaluates four commercially available kits for whole genome amplification of single cells. The results show that coverage of the genome differs significantly among the protocols and as expected this has impact on the downstream analysis. In Paper 3, single cell genotyping by exome sequencing is used to confirm the presence of fat cells derived from donated bone marrow within the recipients’ fat tissue. Close to hundred single cells were exome sequenced and a subset was validated by . In the last paper, a new method for phasing (i.e. determining the physical connection of variant ) is presented. The method barcodes amplicons from single molecules in emulsion droplets. The barcodes can then be used to determine which variants were present on the same original DNA molecule. The method is applied to two variable regions in the bacterial 16S gene in a metagenomic sample.

Thus, two of the papers (1 and 4) present development of new methods for increasing the throughput and information content of data from massively parallel sequencing. Paper 2 evaluates and compares currently available methods and in paper 3, a biological question is answered using some of these tools.

Keywords: DNA, sequencing, single molecule, single cell, whole genome amplification, exome sequencing, emulsions, barcoding, phasing.

Sammanfattning Under det senaste decenniet har storskalig DNA-sekvensering av enskilda celler utvecklats från bara en idé till att bli ett av de mest uppmärksammade forskningsområdena. En stor del av denna utveckling har möjliggjorts av den dramatiska reduceringen av kostnaderna för denna typ av analys. De fyra artiklar som ingår i den här avhandlingen beskriver eller utvärderar tekniska framsteg för storskalig DNA- sekvensering av enskilda celler och enskilda molekyler.

Allteftersom sekvenseringsteknologierna förbättras ökar också antalet prover som kan analyseras parallellt. I artikel 1 presenteras en automatiserad metod för preparering av prover inför storskalig DNA- sekvensering. Tekniken har tillämpats inom flera projekt och har senare också vidareutvecklats (av kollegor) vilket gjort det möjligt att parallellt analysera ett ännu större antal prover. Amplifiering av DNA från enskilda celler är en förutsättning för sekvensanalys av deras genom. Artikel 2 utvärderar fyra kommersiellt tillgängliga kit för amplifiering av fullständiga genom från enskilda celler. Resultaten visar att täckningen av genomet skiljer sig väsentligt mellan amplifierings-produkterna och som väntat påverkar detta efterföljande analyssteg. I artikel 3 utförs genotypning av enskilda celler med hjälp av exom-sekvensering. Detta för att i en donationsmottagares fettvävnad bekräfta förekomsten av fettceller som härstammar från donerad benmärg. Nära hundra enskilda celler exom-sekvenserades. En delmängd av dessa celler validerades med hjälp av sekvensering av genomet. I den sista artikeln presenteras en ny metod för bestämning av fysisk koppling av alleler för specifika sekvens-varianter över en hel DNA-molekyl. Amplikon från enstaka molekyler skapas i emulsionsdroppar och märks med specifika DNA- sekvenser. Efter sekvensering kan denna märkning användas för att bestämma vilka av varianternas alleler som var närvarande på samma ursprungliga DNA-molekyl. Metoden tillämpas på två variabla regioner i den bakteriella 16S genen i ett metagenomiskt prov.

I två av artiklarna (1 och 4) presenteras nya metoder som är utvecklade för att öka genomströmningen av prover samt öka informationsinnehållet i det data som kommer från den efterkommande storskaliga DNA- sekvenseringen. Artikel 2 utvärderar och jämför tillgängliga metoder för amplifiering av hela det genomiska DNA’t från enstaka celler. Slutligen, i artikel 3, besvaras en biologisk frågeställning med hjälp av storskalig DNA-sekvensering av enstaka celler.

Erik Borgström

List of publications 1. Erik Borgström, Sverker Lundin, Joakim Lundeberg. (2011) Large Scale Generation for High Throughput Sequencing. PLoS ONE 6(4): e19119. doi:10.1371/journal.pone.0019119

2. Erik Borgström, Marta Paterlini, Jeff Mold, Jonas Frisen, Joakim Lundeberg. Comparison of Whole Genome Amplification Techniques for Single Cell (Exome) Sequencing. Manuscript.

3. Mikael Rydén, Mehmet Uzunel, Joanna L. Hård, Erik Borgström, Jeff E. Mold, Erik Arner, Niklas Mejhert, Daniel P. Andersson, Yvonne Widlund, Moustapha Hassan, Christina V. Jones, Kirsty L. Spalding, Britt-Marie Svahn, Afshin Ahmadian, Jonas Frisén, Samuel Bernard, Jonas Mattsson, Peter Arner. (2015). Transplanted Bone Marrow-Derived Cells Contribute to Human Adipogenesis. Cell Metabolism, 22(3), 408–417. doi:http://dx.doi.org/10.1016/j.cmet.2015.06.011

4. Erik Borgström, David Redin, Sverker Lundin, Emelie Berglund, Anders F. Andersson, Afshin Ahmadian. (2015). Phasing of single DNA molecules by massively parallel barcoding. Communications, 6, 7173. doi:10.1038/ncomms8173

Contents                                                                                                                                              !         "             # !"                  #    !"            #                 $ % &   

Erik Borgström

 #    ! #      !         #      !"  "  !        '      

Erik Borgström

Uniqueness and the Ultimate Resolution Your fingerprint is unique, it can be used to unlock your smartphone or to identify you when you travel abroad. Likewise, your genome is unique and might in the future be used in the same manner as your fingerprint. Still, your genome contains much more information than your fingerprint. It is the basis for how the cells in your body function and respond to influence from the world around you.

Each cell in your body harbours a copy of your genome, two sets of all the 23 human chromosomes1. One set inherited from each of your parents, enclosed together in the first of your cells. This cell has then divided thousands of times to form the approximately thirty seven thousand billions (3.72e13) of cells in your body2.

On rare occasions during cell division an error occurs while copying the genome, which leads to one of the two resulting cells getting a genome slightly different from all the others. Due to these so-called replication errors, each of the cells in your body is unique, slightly different from all the others. Most changes will never be noticed, though some of the cells might acquire changes that lead to cellular malfunction or disease, such as . Therefore it is important to be able to analyze not only the average present in an individual's body but also the precise genome present in a given cell. Ultimately one would desire to be able to deduce on which molecule, of the two sets of within the cell, the error has occurred.

This thesis aims to be a part of the technological advance leading to this ultimate resolution of biological data.

The four papers included in the thesis either address one of the main obstacles encountered during single cell DNA analysis and/or are applications thereof. In Paper 1 the throughput of library generation for DNA sequencing is addressed while in Paper 2 amplification techniques prior to sequencing of single cells are evaluated. In Paper 3 a high throughput single cell study is conducted and in Paper 4 a method for single molecule analysis by massively parallel sequencing is presented.

1

The cell and its content.

A Compartmentalization The term “cell” stems from the book published by in 1665. In Micrographia Hooke describes his observations of various matters through microscopes. While examining wood and cork he mentions observing “little boxes”, “pores”, “caverns” and “cells” within the materials3. The word cell comes from the Latin Cella meaning “small room”4 and is a very suitable name for these fundamental biological compartments. The cell harbours the biochemical reactions considered to be the basis for life and separates them from the surroundings by compartmentalization. This fundamental type of compartmentalization allows the cells to manipulate their microenvironment and is the physical link between and and thereby a fundamental building block in the process of (it has also inspired molecular techniques such as emulsion PCR, described later). Cellular life ranges from relatively simple unicellular prokaryotic life forms to very complex eukaryotic multicellular organisms like . The great diversity of cellular life is reflected by a diverse range of possible molecular components. The molecular components available to a cell are largely determined by the that are present within the cell. The full collection of genes in a cell is harboured within the genome and is encoded in the structured sequence of deoxyribonucleic acid (DNA) polymers. Information from active parts of the genome is copied into ribonucleic acid (RNA) molecules called transcripts, collectively known as the . Some molecules in the transcriptome are translated into amino acid polymers, known as . The molecules are responsible for carrying out specific functions inside a cell and the presence of a specific protein can for example determine if an organism is able to metabolize a certain nutrient or not. The direction of the information flow within cells, from nucleic acids (the genome through the transcriptome) to functional protein molecules in the proteome, is called the central dogma of molecular biology5. Each cell in an organism encapsulates the three main components present in the encoding of life; the genes within the genome, the transcripts within the transcriptome and finally the proteins in the proteome. Compartmentalized together with a plethora of metabolites and other biologically active molecules they

2 Erik Borgström

form the basis for all cellular life as we know it and each of the three components are described in brief below.

The DNA Deoxyribonucleic acid, DNA, was first isolated by Friedrich Miescher in the middle of the 19th century. Many findings and theories about its chemical composition and role as the carrier of genetic information were presented during the latter half of the 19th century and the first half of the 20th. In 1953 Watson and Crick presented the structure of the DNA molecule and thereby paved the way for our understanding of how the genetic material is able to be replicated and propagated to following generations6. As suggested by Watson and Crick the (double stranded) DNA molecule consist of “two helical chains each coiled around the same axis”. The chains consist of alternating phosphate and deoxyribose sugar groups whereof to each sugar there is a nucleobase attached. The bases are located so that they pair to each other by hydrogen bonding and thereby hold the two chains together. The base pairing is specific where adenine (A) always pairs to thymine (T) and guanine (G) pairs to cytosine (C)7.

The Genes and the Genome Studies on inheritance were initially performed by in the mid 19th century. Without using the term gene, Mendel described some of the basic principles of inheritance e.g. the concept of dominant and recessive traits8. The word gene was first used in 1909 after the rediscovery of Mendel's work. Initially the definition was “discrete unit of heredity”. As the understanding deepened, the concept of a gene was developed from this early idea through a number of intermediate and more strict definitions. At the end of the it ended up in more open definitions such as “a DNA segment that contributes to phenotype/function”. A popular metaphor in the recent decades has been to think of genes as subroutines within a computer operating system, using transcription and translation as means to run the subroutines. A recent definition is “a gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products”9. The definition of a gene is constantly evolving, large parts of the genome are transcribed and regulatory regions might be considered. Furthermore the definition of function is also debated. Still, common denominators among

3

the definitions are that a gene is a subset of genomic sequence(s) with a “function” and its inherent coupling to heritability.

A Genome is “an organism’s complete set of DNA, including all of its genes”10 (there are also organisms with RNA as genetic material, eg. RNA ). The first organism to get its genome sequenced was Bacteriophage MS211. The bacteriophage has a 3.5kb RNA genome and the sequence of the final gene of the genome was published in 197612. This was soon followed by the first DNA genome, the 5.4kb genome of bacteriophage phiX, sequenced by ’s “plus and minus method” in 197713 (prior to the invention of his chain termination method). The Human (HGP) was initiated in October 1990. The three giga-base human genome was 25 times larger than any previously sequenced genome and eight times larger than all previously sequenced genomes together. In February 2001, two draft assemblies of the sequence were published simultaneously by the public HGP14 and a privately funded initiative15. The “finished” genome sequence was published in 200316. Several projects aiming at variant discovery and functional annotation of the sequence followed the completion of the human genome. A few examples are the HapMap Project17 and 1000 genomes project18, both with the aim of cataloging human variations, as well as the ENCODE project19 assigning functional annotations for large parts of the genome.

The RNA The ribonucleic acid (RNA) molecules within a cell are generally the product of transcription from DNA. During transcription active parts of the genome are copied into complementary RNA molecules called transcripts. RNA is structurally similar to DNA though it has a very different role in the cell. The two fundamental chemical differences of RNA compared to DNA is firstly an extra hydroxyl group on each sugar moiety of the backbone and secondly the presence of the uracil bases instead of thymine bases. Just as DNA, the RNA molecules can hybridize to themselves, other RNA species or DNA and function in a double stranded fashion. However, as opposed to DNA, it often appears in single stranded form (as is the case for messenger ). The transcribed RNA molecules that later are translated into proteins are termed messenger RNA (mRNA). Apart from mRNAs there are many

4 Erik Borgström

additional types of functional RNA molecules in a cell, each with a distinct function for example transfer RNAs, ribosomal RNAs, non- coding RNAs and many more20.

The Transcripts and the Transcriptome In a cell, RNA polymerase produces primary transcript from active genes. Depending on the organism and the function of the gene, the primary transcripts can be modified into more mature forms by processes such as capping, poly A tailing and splicing. The mature transcripts are then translated into proteins or fulfill other functions in the cell (for example in regulation).

All cells within an organism have the same overall genome (except some rare and potentially impactful somatic variants). Still, cells from separate tissues may be very different from each other. This is due to the fact that only a subset of the genome is transcribed in a cell at a certain time point. The resulting set of transcripts dynamically changes during the lifetime of a cell as response to external and internal stimuli. The population of transcripts that are present in one cell at a certain time point are collectively called the transcriptome21. The transcriptome is the basis for the state of that cell, i.e. which proteins are produced and in which levels and thereby how the cell will appear and function.

The Proteins and the Proteome Mature messenger RNA molecules are translated into amino acid chains. These polypeptides can be edited and post translationally modified, for example by addition of sugar moieties. Ultimately they form functional three dimensional protein molecules. The proteins are the working force of the cell. To mention a few functions, they catalyze chemical reactions, and function as structural building blocks, channels, transporters and signaling substances. The collection of all the proteins present in a cell is called the proteome and is studied in the field of .

Communities and Tissues Cells interact with the environment and each other within multicellular communities such as biofilms or tissues. These multicellular structures are quite diverse and analyzing this heterogeneity is important. For example, some of the organisms in environmental samples might possess

5

interesting properties such as production of antibiotic substances22 or ability to metabolize specific compounds23. Still, these microorganisms live side by side with many other organisms, potentially in symbiosis or competing for the same resources. Thus, their function might be hard to locate and characterize in these communities. In the case of human tissues, analyzing intercellular heterogeneity could provide information about disease progress or deepen our understanding of fundamental processes in developmental biology24. Some microorganisms are not cultivable under laboratory conditions. Moreover, human tissues are traditionally analyzed as an average of multicellular samples. Direct investigation of single cells can avoid cultivation completely and allow a holistic view of the environmental samples. The single cell analysis approach also provides an increase in the resolution and sensitivity for human samples25. Throughout the rest of this thesis, the focus will be on analysis of DNA from single cells.

6 Erik Borgström

Analysis of Cells

One, Two or Several Molecules High throughput analysis of DNA sequences from single cells is a relatively new research field. PCR was performed on single cells already during the 1980s26 and amplification of whole genomes soon followed27. It was not until recently that molecular techniques and the cost of sequencing reached a level that allowed for high throughput analysis of single cells. Still, it is worth noting that other techniques for single cell analysis such as histological observations has been done for hundreds of years and more modern techniques like fluorescent in situ hybridization (FISH)28 has been around for decades.

Genetic material for sequence analysis is generally extracted from samples containing a large number of cells (for example from an individual, a population or an environmental sample). This classical approach to DNA sequence analysis assays a mixture of hundreds to several thousands of cells (and/or molecules) and any conclusions drawn are thus from population wide averages. This introduces a risk of missing important information about sample heterogeneity and rare entities are especially hard to detect29 (e.g. circulating tumor cells). The two main challenges when working with single cell DNA sequencing as compared to regular DNA sequencing are (1) acquiring the single cells preferably with as little perturbation of the cell as possible and (2) amplification of the DNA or RNA to be able to prepare sequencing libraries from the samples30. My work has dealt with amplification, library preparation and analysis of DNA sequences (often originating from single cells) but acquiring the cells is just as important for the end result.

Acquiring Single Cells Isolation of single cells is a prerequisite for analysis. It is usually done by physical separation of the cells prior to amplification and analysis and a number of such techniques are described in the paragraphs below.

A recent study identified fluorescence activated cell sorting (FACS), laser microdissection (LCM), manual cell picking, random seeding/dilution and as the most commonly used methods for single cell isolation31. No single technology, however, is applicable to all sample-

7

types and therefore one may not always be able to choose freely from the currently available methods.

Serial dilution and manual (or robotic) pipetting followed by visual inspection of single cells probably require the least instrumentation in the lab. By enough dilution of a sample and depositing aliquots to wells, it can be statistically ensured that a large proportion of the wells will contain no more than one single cell. When diluted enough, the distribution of single cells will follow the Poisson distribution. Visual inspection (or similar) is finally needed to verify that a cell is present31. Serial dilution was used to isolate single cells in Paper 2.

Another method for single cell isolation is to, after visual inspection of a sample, pick out the desired cells using a micromanipulator. There are two types of micromanipulator: mechanical and optical. Both types are accompanied by visualization through microscopy. The earliest mechanical micromanipulators were described during the 19th century and used regular screws to carefully control movement. More recent mechanical instruments use hydraulics32. The optical type, also called optical tweezers, utilizes one or several lasers to move cells around. These methods require no physical contact with the cell. However, long exposure of the sample to lasers can damage the cells33.

Laser capture microdissection (LCM) is similar to optical tweezers in the sense that both use lasers to manipulate the sample. However instead of using low to moderate intensity infrared lasers which is usually the case for optical tweezers, the LCM instrument uses high intensity lasers that can cut biological tissue34. LCM was used to isolate single cells in paper 3.

Both LCM and micromanipulators not only isolate single cells but also give the opportunity to select desired cells from a sample or tissue section. This still requires manual intervention, which might limit the throughput of such technologies. Fluorescent activated cell sorting (FACS, not to be confused with Fax invented by Alexander Bain in 184335) on the other hand sort cells (or particles) based on fluorescent labels and light scattering properties of cells in suspension and can thereby achieve higher throughput. During FACS, cells in suspension are pushed through a nozzle creating droplets containing single cells. Fluorescent colors and

8 Erik Borgström

scattering properties of the droplets are then recorded by illumination with a laser and detection of the light signals. Depending on predefined cutoffs for the recorded parameters, droplets are selectively charged and passed through an electric field that directs them to a collection or waste container36. The collection container can be of various types and dimensions such as microtiter plates or microscope slides. The instrument can also sort a predefined number of cells to each specific location, for example a test tube. FACS is used in paper 4 to sort DNA coated beads into 96 well plates.

Lab on a chip methods have the potential to do much of the sample processing within one device, all from cell isolation to library preparation. Stephen Quake’s lab at has been working on microfluidic cell sorting devices since the 1990s37,38. Later, valves were introduced to control access to small chambers allowing for different reactions to be performed on chip, for example reverse transcription39 or library preparation40. The technologies were commercialized by Fluidigm who claims to be able to do isolation, lysis and whole genome amplification (multiple displacement amplification, described later) of 96 single cells in one run41.

Droplet microfluidics is similar to lab on a chip methods and there is a large overlap of the technologies where droplets are created and processed within lab on a chip devices. Water-in-oil droplets created by shaking, stirring or on a chip are used to compartmentalize single cells together with reagents for the assay to be performed42, for example single cell PCR or whole genome amplification43. Commercial instruments are marketed by for example Raindance technologies44.

Amplification most often follows single cell isolation and the next chapter will describe whole genome amplification.

Whole Genome Amplification Techniques A single human cell harbours approximately 7 pg of DNA in two copies of the genome. The two copies carry a distinct set of variable positions whereof some are of great importance for the functionality of the cell. The amount of RNA in a single cell is slightly higher than the DNA content, about 10-30pg45. Most transcripts are estimated to have less than

9

100 copies46, although the dynamic range is from one or a few copies up to hundreds. Due to the small numbers of nucleic acid molecules, whole genome amplification (WGA) and/or whole transcriptome amplification (WTA) are prerequisites for library preparation for massively parallel DNA sequencing of single cells. The library molecules are usually further amplified prior to the sequencing reaction to obtain detectable signals for base calling.

A single cell genome is a population of unique single copy DNA molecules. Considering this, sequencing of single cell genomes is by definition a type of parallel single molecule analysis and as such also faces the same challenges and limitations. In the case of RNA, many molecules are not single copy. Still, there is currently no method that is capable of direct sequencing of the RNA content of a single cell. Furthermore the uniformity of the amplification is of great importance as the abundance of transcripts usually plays a central role in the analysis. All amplification reactions potentially introduce bias and errors that affect the final results. These effects are less apparent when amplifying multiple cells or molecules with high copy numbers because some of the differences are averaged out over the population. The fidelity of the amplification is therefore of extra importance when analyzing single cells.

There are several different methods for WGA, each of them with different characteristics in terms of methodology, performance, bias and error profile. The choice could for example be between using a temperature- cycled amplification vs. an isothermal amplification or a linear vs. an exponential amplification and depending on which is used the error profile may differ. Three main types of errors are introduced during amplification. Firstly, the regular replication errors such as base misincorporation, insertions, deletions, inversions etc. Secondly, artifact formation such as chimeras of different regions in the genome due to non-intended priming. Thirdly, preferential amplification (PA) of some regions over others leading to non-uniform read depth, dropout (ADO) and locus dropout. Other parameters such as compatibility with a certain isolation procedure or size of the produced fragments can also be of importance depending on the application, as discussed later. Some of the techniques used for WGA will be described below. In the present investigation section (Paper 2) a comparison of several commercial kits

10 Erik Borgström

for whole genome amplification is presented. Whole transcriptome amplification will not be covered in this thesis.

Temperature cycled amplification techniques The polymerase chain reaction47,48 (PCR) has been used to amplify specific regions of genomes since the mid 1980s. WGA methods based on PCR are temperature cycled and utilize degenerate primers or adapters carrying universal priming sites to enable amplification of large (non specific) genomic regions. Even though not intended as a WGA technique the first approach for amplifying a human genome, called interspersed repetitive sequence PCR (IRS-PCR), was published in 1989 and used ALU repeats as priming sites for PCR amplification49,50. The methods primer extension pre-amplification PCR (PEP-PCR)51 and degenerate oligonucleotide primed PCR (DOP-PCR)52 for amplification of whole human genomes were published in 1992.

In the case of PEP-PCR a 15 base degenerate oligonucleotide is used as PCR primer at low annealing temperature to promote primer hybridization. This is followed by a slow temperature increase and a long extension step. By cycling these steps the authors estimate to cover at least 78% of the genome. Adapted versions of the protocol, in which the 14-hour protocol was shortened by changing the cycling conditions, was published soon after the original article53.

DOP-PCR utilizes a 22 bases long semi-degenerated oligonucleotide. PCR is performed with low annealing temperature (Ta) (30°C) a few initial cycles, followed by 25-35 additional cycles with increased Ta. DOP-PCR was initially developed as an alternative to IRS-PCR and linker adapter PCR (LA-PCR) where a restriction enzyme is used to create overhangs to which adapters later are ligated and utilized as priming sites52. In the original paper DOP-PCR was tested on 10ng (and more) genomic material. It was later shown that the method introduced high locus dropout and strong amplification bias50.

In 1999 an improved version of the PEP protocol (improved-PEP, I-PEP) was published54. In I-PEP, another lysis buffer, an additional polymerase and an extra extension step were used to improve efficiency. I-PEP showed greater efficiency than both DOP-PCR and traditional PEP-PCR. Still only 20-50% of tested amplicons were successful for single cell

11

WGAs and 40% of the five-cell samples showed preferential amplification of one of two alleles.

As opposed to the DOP and PEP strategies where amplification is done using degenerate primers, LA-PCR utilizes incorporation of general adapters in the initial part of the protocol. It is followed by amplification using the adaptors as priming sites for the amplification reaction. There are three main strategies for adapter incorporation. The first approach is ligation-mediated PCR (LM-PCR) where adapters are attached to the ends of the fragments by ligation, either to sticky ends resulting from restriction endonuclease treatment (as in the case of LA-PCR) or blunt ends resulting from enzymatic treatment, or chemical or physical shearing. Several LM-PCR techniques have been used for single cell sequencing for example GenomePlex55,56,57 and SCOMP58, which are available as commercial kits. The second approach for incorporation of adapter sequences utilizes amplification primers carrying general handle sequences at the 5’-ends59,60 and will be further discussed below. In the third approach insertion of adapters is done by in vitro transposition.

Isothermal amplification techniques Multiple displacement amplification (MDA) is one of the most commonly used isothermal amplification methods though several others exist61,62. In MDA the target DNA and short degenerate primers (usually random hexamers) are incubated with a highly processive polymerase with strand displacement activity such as phi29. Random hexamers hybridize to the target DNA and are extended during the isothermal incubation. The polymerase simultaneously displaces the non-template strand making it available for further primer hybridization and extension63,64. This branching structure leads to a massive non-linear amplification of the starting material. As in PCR-based approaches, certain regions are amplified preferentially leading to overrepresentation of some loci65. Longer incubation time and more amplification (analogous to more cycles in the PCR based approaches) may result in greater amplification bias66.

Exponential vs. linear amplification PCR-based thermocycled techniques and MDA for WGA leads to exponential amplification of target DNA. An exponential amplification scheme potentially leads to large differences in representation of different genomic regions as length and sequence dependent biases also are

12 Erik Borgström

exponentially amplified67. Whole transcriptome amplification was initially based on PCR and exponential amplification techniques but these methods were replaced by linear in vitro transcription (IVT) amplification methods68,69. The IVT techniques were later adapted to amplification of genomic DNA70,71. Multiple annealing and looping-based amplification cycles (MALBAC) is a method for WGA that was presented in 2012 and claims to do “a close-to-linear amplification” in the first cycles of the protocol72. This is achieved by a low temperature annealing of degenerate primers with identical handle sequences attached to the 5’-ends. After at least two rounds of primer extensions the ends of fragments become complementary to each other, hybridizing and forming hairpin loops, which renders the molecules inaccessible for further rounds of amplification (it is not known to which extent this occurs). After a few rounds of this close to linear amplification, exponential amplification is performed utilizing general handle sequences as priming sites. The method has lately been adapted for amplification of RNA73. A conceptually similar technique is PicoPlex from Rubicon in which an initial pre-amplification is performed leading to accumulation of looped hairpin molecules, which are then amplified using general handles74.

After whole genome amplification a sufficient amount of product has hopefully been acquired to move on to library preparation and DNA sequencing. Depending on certain features of the WGA products (concentration, fragment size etc.) some sequencing technologies might be better suited than others. There is no point to use a long read sequencing technology if the product has an average size of a few hundred bases. The application or biological question and the budget also affect the choice of sequencing technologies and method for library preparation. In the next chapter such technologies and methods will be described.

13

Sequencing Genomes The human genome was sequenced in 13 years with an estimated cost of 3 billion USD75. A few years after the project was finished (in 2007), 's (diploid) genome was sequenced at a cost of approximately 60 million USD. In 2008, the genome of became the first to be sequenced with a massively parallel sequencing (MPS) platform, for less than one million USD76. As of October 2015 the sequencing of a genome is estimated to cost 1363 USD77. The constantly decreasing cost of sequencing has lead to increased fields of applications of the technology. Accordingly, high throughput studies of many hundreds or even thousands samples is now a reality. The rate of cost reduction for DNA sequencing followed Moore's law from the early 2000 until 2007 but a much steeper rate has been observed in the years that followed. Still, the rate seemed to level off during 2012 and the price has been relatively constant during 2014-15. Probably, the most frequently used sequencing technologies of today are approaching their maximum capacity and the lack of competitors is starting to become apparent. The massive increase in sequencing throughput observed during the last decades is not a result of one company or a single technology. The following paragraphs briefly describe the main technologies (past and current) for DNA sequencing.

DNA Sequencing Techniques There are many ways to group DNA sequencing technologies, by generation, by throughput, by detection principle or by number of molecules (e.g. single molecule or multiple molecules). The technologies will here be grouped into three blocks. First, the early sequencing technologies before the completion of the human genome project, all with relatively low throughput. Second, the massively parallel sequencing technologies by which throughput have been increased constantly since 2005 to the levels of today. And finally, single molecule sequencing technologies that currently have relatively low throughput but have opened up the possibility to examine DNA and RNA with single molecule resolution.

Early Sequencing Technologies Maxam and Gilbert published their method for DNA sequencing in 197778 and later the same year Frederick Sanger presented his dideoxy method79. A few years earlier (in 1975) Sanger had published the plus and minus

14 Erik Borgström

method80. The name comes from using two sets of four reactions, one “plus system” with only one of the four present per reaction and one “minus system” where each of the four nucleotides was omitted from one reaction. Resulting products from the eight reactions were then separated in a polyacrylamide gel. The technology had major problems with calling the correct length of homopolymers. Since strand termination only occurred at incorporation of a new base type, the length had to be decided from spacing of bands and the method was therefore replaced by other techniques81. Using the plus and minus method one could “deduce a sequence of 50 nucleotides in a few days”80.

Maxam and Gilbert's method, like the plus and minus method, produce labelled fragments of varying lengths that are separated on a gel followed by readout of the sequence. The varying lengths were achieved by selective chemical cleavage, either at sites with A or G with preference for either the A or G base, sites with exclusively C or at sites with C or T bases. The four cleavage reactions (A>G, G>A, only C, C and T) produced fragments ending at each nucleotide in the final gel image analysis, and therefore there were no need to estimate the length of homopolymers from band spacing. The authors claimed to be able to sequence up to 100 nucleotides using the method.

Sangers dideoxy method that was published 10 months later also produced fragments of all possible lengths. But, instead of chemical cleavage the method utilized primer extension by a polymerase using a mix of regular dNTPs and strand terminating dideoxynucleotides (ddNTPs). At the time of publication the author claimed to be able to read 15-200 bases and occasionally up to 300 bases with “reasonable accuracy”. In the last paragraph of the paper, Sanger and co authors mention that the user should not rely on this method alone but confirm the results using another technique. The dideoxy method is today known as Sanger DNA sequencing. Thanks to more than 20 years of improvements, the read lengths and throughput of the method has improved considerably. Sanger DNA sequencing was the main technology for sequencing the first human genome and even though the throughput is not close to matching the current MPS technologies it is still considered the gold standard for DNA sequencing (and is often used for validation of variants).

15

All three methods described above create labeled DNA fragments of various lengths and then reveal the sequence by gel separation. The pyrosequencing method, developed by Pål Nyren in 1998 used a completely different approach called sequencing by synthesis (SBS). In SBS, signal detection and sequencing is made in real time as a primed target DNA strand is extended, one base at a time. In pyrosequencing, the four bases are added one at a time (usually in a cyclic manner) to a reaction well in which the enzymes polymerase, sulfurylase, luciferase and apyrase act together to reveal incorporation of the added nucleotide. Each time a base is incorporated into an elongating strand a pyrophosphate (PPi) is released, which will then be converted to ATP by the sulfurylase. The ATP molecules will in turn be converted to a light signal by luciferase. After detection of the incorporation event, any residual nucleotides (and also ATP) are removed by apyrase and a new nucleotide is added82. Similar to Sanger's plus and minus method homopolymers are extra tricky, as all incorporation events of identical bases will be detected in one step. The authors of the original pyrosequencing article published in 1998 mention sequencing of “more than 20 bases” and envisioned improved read length in future versions. Pyrosequencing was later automated83 and eventually highly parallelized giving rise to the first MPS platform released commercially, 454 sequencing.

Massively Parallel Sequencing Technologies The emergence of massively parallel sequencing (MPS) platforms and the recent developments of these technologies have led to massive a reduction of costs for DNA sequencing. As a result, high throughput DNA sequencing is now a routine experiment in many labs.

As mentioned above, the 454 system (later acquired and distributed by Roche) was the first MPS platform to become commercially available. The first version of the system, the GS-20, was released in 200584. At the time of release, GS-20 performed pyrosequencing of DNA fragments in a so- called picotiter plate consisting of 1.6 million wells. Approximately 35% of the wells housed a single 28μm bead covered with a clonally amplified DNA fragment (originated from a process called emulsion PCR). The reagents needed for the pyrosequencing were immobilized on the surface of small packing beads (luciferase and sulfurylase) placed in each well or

16 Erik Borgström

attached to the primed targets (polymerase). After addition and incorporation of each nucleotide a washing buffer containing small amounts of apyrase could be flowed over the immobilized DNA fragments. The reaction was performed while a 16.8 megapixel CCD camera captured the light signals produced85. In the last version of the instrument around 1 million reads with a read length of about 700 bases were generated86. Roche announced the discontinuation of the 454 sequencing platform in October 201387.

Another technique for massively parallel sequencing was published in 2005, the Polonator88. It was not until 2007 that an adapted version of this approach was commercially available under the name SOLiD (sequencing by oligonucleotide ligation and detection)89. However, a year before SOLiD was introduced, the Genome Analyzer (GA)90 was developed by the UK-based company Solexa. Illumina later acquired Solexa in 2007. The Illumina sequencing platforms are the most commonly used today and will be described in more detail later.

The Polonator and the SOLiD sequencing platform are sequencing by ligation (SBL) techniques. SBL utilize the discriminatory power of ligase enzymes to deduce nucleotide sequences. Fluorescently labeled probes with known sequence in predetermined positions are hybridized to the target DNA and only if no mismatch is present the probes are ligated to the sequencing primers. The probes are blocked in one end allowing for ligation of only one probe to the sequencing primer in each cycle. Residual probes are then washed off and the fluorescent signal is detected. This is followed by removal of end blocking and fluorescent labels, thus, preparing the template for elongation in the next cycle. SOLiD produces so called color space data. The colors reflect the four fluorophores attached to ligation probes. Each probe encodes two bases and the system is designed so that if color space data is aligned to a mismatched bases can be identified as either sequencing errors or true variations (and/or artifacts)91. The complete genomics technology acquired by BGI in 201392 also use SBL but the sequencing reactions are performed on so called DNA nano balls (products from a type of phi29 based amplification called rolling circle amplification)93 instead of DNA covered beads from emulsion PCR as is the case for SOLiD.

17

In both 454 and SOLiD sequencing emulsion PCR (emPCR) is used to amplify the signal from individual library molecules prior to the sequencing reaction. Water-in-oil (w/o) emulsions are extensively used in molecular biotechnology. They are used as cell like compartments for creating small and isolated reaction environments94,95. In 2003 a protocol for amplification of single molecules within emulsion droplets, to cover the surface of μm-sized beads96 was presented. Similar methods are employed prior to the sequencing reactions such as in Paper 4 in which 454 emulsions and emPCR are extensively used.

Another technology using emulsion PCR to amplify library molecules prior to sequencing is the Ion Torrent technology published in 201197. This was the first MPS technology not using light for signal detection that reached the market. The sequencing is very similar to the 454 though instead of measuring the amount of released PPi as a light signal each base incorporation is detected by measuring the hydrogen ion released in the same process. The sequencing instrument is in a way a massively parallel pH-meter. The pH-meter array is constructed on top of a semiconductor chip and thereby this technology avoids using expensive optical devices usually necessary for signal detection in traditional MPS systems.

Unlike the MPS platforms described above, the Illumina sequencing technology (originally developed by Solexa) does not require emPCR prior to the sequencing reaction. Instead the individual library molecules are amplified by a process called bridge amplification. Illumina held two thirds of the market in 201298 and their sequencing technology is currently the most extensively used commercial platform. It is also the sequencing platform used in all the papers included in this thesis and will therefore be explained in more detail.

During the process of bridge amplification (also known as cluster generation) library molecules with general adapter sequences in each end are hybridized to a planar surface (within a so called flow cell) covered with two primers. A PCR like amplification then takes place on the surface of the flow cell using the population of the two immobilized primers. As all DNA molecules (except the original templates) are

18 Erik Borgström

attached by one end to the solid support, the amplification product from each original template generates a local cluster of clonally amplified fragments. In the cluster, both the 5’-ends of the double stranded products are attached to the surface. After de-hybridization, the cluster consists of two directionally different populations of ssDNA. The cluster is made mono-directional by selective cleavage of one of the original primers. This specifically releases half of the ssDNA molecules from the flow cell resulting in one population of ssDNA within each cluster. The end result is thereby equivalent to emPCR where a clonal population of DNA molecules is also immobilized on a solid support.

Illumina’s sequencing method falls into the category of sequencing by synthesis (SBS). The sequencing starts with hybridization of a sequencing primer to the molecules of all clusters. The four bases, each labeled with a specific fluorophore, are added simultaneously during each sequencing cycle (as opposed to 454 and Ion torrent sequencing in which the four nucleotides are added one by one). The nucleotides are reversible terminators and thereby further extension is blocked after incorporation of one nucleotide at the 3’-end of the growing sequencing primer. Residual nucleotides are removed by washing and the fluorophores attached to the incorporated nucleotides are detected by imaging of the full flow cell. The terminating chemistry is then reversed and fluorophores removed, enabling the initiation of another sequencing cycle. In this manner fluorescent signals are recorded from all clusters, while the sequencing primers are synchronously extended by one base at a time99.

Like most of the companies marketing MPS platforms, Illumina provides several different instruments. The basic sequencing principles are the same with minor modifications of the technology or chemistry leading to varying throughput, runtimes and costs. For example, some versions of the instruments use predetermined cluster locations on the flow cell rather than a full surface coating of primers100. This allows for easier signal detection and a more dense spatial distribution of clusters. Another modification is the use of two fluorophores instead of four. In this case two of the bases are labeled with a specific fluorophore, one base is labeled with both the fluorophores and thus identified when a mixed

19

signal is detected and the last base has no fluorophore attached leading to a “dark signal”101.

In all the MPS chemistries described so far, read length is limited by nonsynchronous extension within the clonally amplified template populations. This leads to an increase in background noise for each cycle as the reaction progresses. Factors such as non-labeled nucleotides, incomplete incorporation, missing terminators, failure to remove reversible terminators and residual reagents after washing all potentially lead to a population of molecules with a few bases ahead (positive frame shift) or after (negative frame shift) in the extension step. When this population of out of phase molecules grows the strength of the correct signal decreases compared to background, leading to decreasing quality (certainty) of base calls along the sequence.

This type of out of phase signal in the sequencing reaction (often called phasing) is a result of non-synchronous extension within the clusters (or beads). This limitation should not exist if single molecules were imaged. Single molecule sequencing technologies therefore generally display longer read lengths. Recent advances in the techniques for single molecule imaging and detection have now reached the point where commercial systems are being developed. The current state of single molecule sequencing is described below.

Single Molecule Sequencing Technology As mentioned above there are no phasing issues for single molecule sequencing. Instead, signal detection becomes more complex and high error rates limit the applicability of these technologies. The higher error rates originate from the fact that signals are recorded from only one molecule instead of an average of thousands. Any incomplete reaction step that would lead to phasing in a cluster instead leads to sequencing errors. Several methods for single molecule sequencing exist today. Still, similar to the MPS techniques, library preparation often requires nanogram to microgram of input material102,103,104 and therefore individual unamplified single cells cannot be prepared.

The first single molecule sequencing platform to reach the market was the Heliscope from Helicos BioSciences. The technology is based on a paper from 2003105 but the first instrument was available in 2008106. In short,

20 Erik Borgström

the sequencing was done by detection of single fluorescently labeled nucleotides with reversible terminators much like Illumina sequencing. Despite detecting single molecules Helicos failed to achieve read lengths longer than 30 to 50 bases and the sequencing costs were high compared to the other available MPS platforms. The instrument never became a commercial success and Helicos BioSciences went bankrupt in 2012107.

Pacific Biosciences was the second company to commercialize a single molecule sequencing platform when they introduced their PacBio RS instrument in 2011108. The sequencing reaction is a single molecule SBS chemistry performed using fluorescently labeled nucleotides within ~100nm sized wells called zero-mode waveguides (ZMWs). The sequencing reaction is not cycled, instead fluorescent signals are recorded in real time from the ZMWs. The DNA fragment that is to be sequenced is attached to a polymerase, which is immobilized to the bottom of the ZMW. The ZMW is constructed in a way that the detection volume is in the zeptoliter scale, allowing detection of the single nucleotide retained in the active site of the polymerase. In the original publication, read lengths of more than 4000 bases and 17% error rate was reported109. An upgraded version of the instrument, the Pacbio RS II system, was released in 2013110 and in 2015 the release of a smaller instrument with considerably higher throughput and lower cost was announced111.

Oxford Nanopore introduced their first device, the MinION, in May 2015. It is a USB stick sized device with the longest reported read lengths among all sequencing technologies, up to a few hundred kilobases104. The MinION is a sequencing device utilizing protein nanopores immobilized in a polymer membrane. An electric potential over the membrane drives an ionic current through the pore and characteristic changes in the current can be observed when molecules pass through the pore. Currently a DNA strand is passed through the pore at a speed of 70 bases per second. The current is recorded 3000 times per second and the signal is converted into events of consecutive 5-mers that are used to perform base calling through a statistical model. The error rate is approximately 15% for so called 2D (two directional) reads where both the strands in a double stranded DNA molecule are sequenced and then globally aligned to create one consensus sequence of higher quality than individual

21

reads112. Oxford Nanopore is also developing high throughput versions of their technology for future release.

A number of additional single molecule sequencing techniques are in various development stages. For example, at the a technology is being developed in collaboration with Illumina and a proof of concept paper was published in mid 2014113,114. Like the Oxford Nanopore technology, the methods works by recording changes in ion current as DNA is translocated through the pore but instead of 5-mers, 4-mers are used for base calling. The ion currents for all 256 possible 4-mers are measured in the proof of concept publication. And the technology is used to sequence the phi X genome generating reads up to 4500 bases. Oxford nanopore also supported a recent publication exploring optical sensing in combination with nanopore sequencing to enable more efficient scale up of the throughput115. Another interesting approach is the droplet sequencing technology currently being developed by Base4. Their single molecule sequencing device is a microfluidic chip where one DNA fragment is immobilized in a channel and the DNA molecule is broken down one nucleotide at a time. Each of the released nucleotides is encapsulated in droplets containing reagents, which initiate a chemical reaction resulting in fluorescent signals specific for each of the four bases. The sequence is then read by the order of the colored droplets116. Nabsys single molecule sequencing approach is based on detection of probe hybridization sites along a high molecular DNA that is made linear and stretched in channels on a chip. Successive rounds of hybridization using labeled probes with known sequences create several overlapping genome wide tag maps. Each base is queried several times and by combining the maps the sequence can be deduced117. Note that these techniques are all in development stages and not available for independent researchers to evaluate and report their performance.

Longer read lengths with lower error rates are naturally desired. Sample/library preparation methods exist that can be used to generate longer stretches of sequence from short reads though often at the cost of throughput as higher coverage is required. Furthermore, specific types of library preparations will be required depending on which sequencing technology that is employed. The next chapter will cover sample and library preparation preceding the DNA sequencing.

22 Erik Borgström

Library Preparation for Sequencing All DNA sequencing techniques require some kind of sample preparation (also called library preparation) prior to the sequencing reaction. Depending on the application and/or the specific biological question, different preparation procedures may be applied. All sample preparation procedures are not considered as part of library preparation (eg extraction of genomic DNA can be used prior to regular PCR) but all library preparations are some kind of sample preparation and the term will be used interchangeably throughout the chapter.

For the MPS techniques, general adapters at the ends of fragments are always needed to enable amplification and immobilization of each single molecule. The adapters are usually also hybridization sites for sequencing primers and in some cases contain sample specific barcodes. Traditionally, attachment of adapter sequences is achieved by ligation of oligonucleotides to the ends of fragmented genomic DNA (after an end repair reaction). Other approaches include amplification using primers with handle sequences at 5’-ends or insertion of adapters using a transposase118. In many cases the sample is enriched for fragments carrying adapter sequences by amplification with primers hybridizing to adapters.

Whole genome sequencing (WGS) may be performed on extracted DNA after addition of adapters and quality control. Still, selecting a specific range of library molecules based on fragment size is common. This selection step ensures a well-defined size interval and provides useful information during data analysis. Amplification is also a common step in all library preparations. However, it may introduce bias and as a consequence amplification-free library preparation has become an attractive approach119. In such protocols, loss of sample is minimized but the amount of required starting material is generally much higher120. Therefore, to reduce the required amount of DNA, many scientists still prefer an amplification step in the library preparation prior to sequencing.

In contrast to whole genome sequencing, whole exome sequencing (WES) is the sequencing of a library enriched for exonic regions. This is a common laboratory procedure that reduces the complexity of the library.

23

A complexity reduction might be desired to decrease costs, reduce the computational load of the analysis or simply because data from the whole genome is not needed. There are several methods to enrich for a smaller region of a genome. In the cases where the sequence of the targeted region is known, hybridization of synthetic complementary probes is a common approach. The probes are either immobilized at the time of hybridization (e.g. on a glass slide) or are in solution and will be captured after hybridization to the target (e.g. biotinylated probes captured by streptavidin coated beads). Thereby, the target molecules within the library with sequences similar to the capture probes will be retained while others are washed away121,122. This technique was used for exome genotyping of single fat cells in paper 2. Several other methods for enrichment of targeted genomic regions exist but will not be covered here.

Mate pair is another type of library preparation123. Instead of being a subset of the original library molecules, mate pair libraries add an extra level of information to the results. In the preparation steps, the two ends of each genomic fragment are joined (prior to adapter ligation). Thus, sequencing is performed on ends of fragments that originally were separated by several kilobases. In this approach a narrow size interval is selected for the original fragments which leads to acquisition of sequence pairs with a known (and long) distance between them. This long and relatively well-defined spacing between the sequences is useful to span repeat regions and connect contigs when performing de novo sequencing.

Library preparation could be designed to enrich for sequences with connection to specific functions or interactions. An example is to physically link proteins interacting with specific DNA sequences, by precipitating the DNA binding protein using affinity proteomics in a procedure called chromatin immunoprecipitation (ChIP)124. Another example is to chemically treat the DNA with bisulfite to convert cytosines to uracils. Methylated cytosines are resistant to this treatment and thus methylation of cytosines can be deduced from the resulting sequencing data125. Additionally, DNA sequencing can be used to analyze other kinds of molecules. One example is to detect ligation events resulting from protein interaction or colocalization126,127. Another example is RNA sequencing (RNA-Seq) where RNA is converted to complementary DNA

24 Erik Borgström

by a reverse transcriptase during the initial steps of the library preparation. However, direct sequencing of RNA is also possible128,129 even though it is not a common approach.

The amount of sequencing data needed per library varies considerably for different applications. Often the flow cell (or plate) where the sequencing reaction takes place is divided into a number of physically separated regions (often referred to as lanes). Several samples can thereby be run within the same instrument run (multiplexed) without pooling samples, which would result in mixed data. However, the output of the sequencing instruments is continuously increasing and in some cases many samples could be sequenced in one lane of the machine. To fully utilize the capacity of the instruments a unique sequence can be attached to each sample or library before pooling of the libraries. This way of encoding the origin of a sequence using a short known DNA sequence is often called barcoding (indexing or tagging). Barcoding allows for post sequencing separation of the mixed data in a process called de-multiplexing.

Barcodes can be added directly to the ends of the DNA fragments or as parts of adapters. Several methods have been published describing various barcoding procedures130,131,132. For example, sample identity can be encoded within a single sequence or as a combination of several. If two barcodes, one to each end, are added to the DNA fragment or amplicon, chimeric constructs can be detected. Using combinations of barcodes is also a straightforward way of increasing the number of samples. Thereby enabling multiplexing of thousands of samples with use of just a few hundred synthetic barcode sequences133. Apart from multiplexing, barcoding has also been used to uniquely identify individual molecules by attaching degenerate sequences to DNA molecules, allowing removal of PCR duplicates134 or counting of individual transcripts135.

Barcoding is also extensively used when trying to retain or phase information from single molecules. These methods rely on tracing short sequence reads back to the original (long) single molecule. In some cases the original fragment is reconstructed in silico (by assembly)137,140. In other cases lower sequencing depth is acquired and the connection between the reads is kept for haplotyping after mapping136,139. A common approach is sequencing a diluted set of molecules where each genomic

25

region is represented (at most) once139,140. Alternatively, all library molecules that stem from the same original molecule can be tagged with identical barcodes138,142. Both approaches are used to construct libraries for MPS where short reads can be connected to each other after sequencing. These methods can thereby be said to increase the apparent read length of the current MPS instruments.

A number of methods rely on keeping one end of the original fragment intact after amplification while the other end is shortened either enzymatically or by physical shearing136,137,138. This results in read pairs where one of the reads is used to identify the original molecule while the other read is spread over the length of the molecule, thereby allowing for reconstruction of the long sequence136,137,138. It is worth noting that some of these methods do not use barcoding through synthetic oligonucleotides (to keep track of molecules) but instead rely completely on the genomic sequence at one end of the fragment136,137.

The Long fragment read (LFR) method published in 2012 uses another approach139. Instead of trying to reconstruct the original fragments, the only aim is to retain the knowledge of which original molecule each read stems from. The method works by dilution and compartmentalization of the original fragments and thereby minimizes the likelihood that any genomic sequence is present in more than one copy in each compartment. After fragmentation, barcoded libraries are made from each compartment and sequenced. All reads that map to a specific genomic region and has identical barcodes can then be assumed to originate from the same fragment. The long read sequencing technology from Illumina (originally called Moleculo) also uses limiting dilution and is very similar to LFR though with the aim to reconstruct full fragments140,141. LFR and the Illumina technology use different sequencing platforms for readout and somewhat different procedures for library construction. Still, the main difference is the nature of the data analysis where LFR is designed for re- sequencing and alignment while the Illumina technology tries to reassemble the original fragments. By reconstruction of full length or longer partial sequences the technology can also be used for de novo genome sequencing. The company 10x Genomics recently announced their own platform for the same application. Their technology is based on compartmentalization of long molecules within emulsion droplets

26 Erik Borgström

generated in a microfluidic station. Within each droplet, a barcode and reagents for sequencing library preparation are also present142. Examination of currently available data indicates that their aim is more similar to the LFR technology but as overlapping reads from single molecules exist reconstruction of full fragments will most probably be achievable by using a greater sequencing depth. Pacific Biosciences is also developing a similar approach in collaboration with RainDance technologies. Here, long reads from the PacBio platform will be barcoded and thereby connected using a Raindance microfluidic platform to facilitate de novo assembly143. A new technology for massively parallel barcoding and phasing of single DNA molecules is presented in paper 4.

Single molecule sequencing techniques do not use adapters in both fragments ends as the traditional MPS techniques require. Still, known sequences are added to the fragments to enable immobilization and hybridization of general sequencing primers. In Helicos single molecule sequencing technology (discontinued) the fragments were polyadenylated on the 3’-ends in the sample preparation step. This reaction enabled immobilization through hybridization of the A-tails to a population of poly-T oligonucleotides attached to the flow cell surface (the poly-T oligonucleotides also functioned as sequencing primers). Pacific Biosciences ligates hairpin adapters (self-hybridizing ssDNA oligonucleotides) to both ends of the fragments, creating not only a target for a sequencing primer but also a circulated template that enables several rounds of sequencing of the same DNA molecule (given that the total read length is not too long). Similarly, a hairpin adapter is attached to one end of the fragment prior to nanopore sequencing using the MinION system. The hairpin adapter covalently links the 5’ and 3’ bases of one end and a regular dsDNA adapter is attached to the other end. In this way one end of each fragment is left “open” while the other end is joined. This molecular layout allows both strands to pass through the pore and thus, the sequence of each molecule is read twice.

27

Understanding the Sequences Sequences of Gs, As, Ts and Cs are not meaningful without interpretation and understanding of the data. If interpreted, correctly sequence data can be informative for understanding biochemical reactions, exploring the diversity of microbial environment and for diagnosing and curing diseases. Depending on the aim of the study, the data analysis can be very simple or quite complex. Routine tasks can often be performed using already available and widely acceptable tools. However, as soon as non- standard questions are to be answered, tools tailored for the specific problem such as in-house scripts and software are needed.

The first issue encountered when working with DNA sequencing data is the large amount. The raw sequences from small-scale instruments with only one or a few samples are easily handled on a laptop computer. However, when larger projects are to be analyzed the storage and computing power of regular laptops and desktops are not sufficient. Instead, so-called cloud computing clusters with many computational nodes and large arrays of storage media are common. The data generated is on par with or greater than other large computing resource consumers such as Twitter and YouTube and more efficient tools both in terms of hardware and software will be needed in the future144. Some very basic principles about the standard steps needed in DNA sequence analysis will be given in the next paragraphs. Details of algorithms used in various software are not within the scope of this thesis.

Raw reads generated directly from the sequencing instruments are often initially quality controlled to asses if the base calls are reliable enough to be mapped to a genome or if a larger than expected proportion of the read population has high/low GC, AT, adapter or otherwise unwanted sequence content. Depending on the result the raw data might need to be filtered or trimmed to remove unreliable or unwanted sequences.

The next step, after acquiring an as accurate data set as possible, is to place the reads in the right sequence context (either relative to a reference sequence or to each other) to get a representation of the genome(s) or sequence population within the sample. This is considered the most computationally intense problem in data analysis and stems from the fact

28 Erik Borgström

that all sequencing technologies available today introduce errors and provide reads shorter than full length genomes.

Assembly is traditionally used when no reference sequence is available. During assembly, overlaps between reads are identified and used to stitch the original genome together. Traditionally this was done using a hierarchical approach where the genome was sequenced and overlapped piece by piece in an ordered fashion using vector-based sample preparation and Sanger sequencing. The alternative approach, and assembly, is more extensively used when short read data from MPS technologies are used. In contrast to the hierarchical sequencing approach, shotgun sequencing involves fragmenting the complete genome and all resulting pieces are sequenced in parallel. The pieces are then stitched together simultaneously without prior information of order or overlaps145. Even though not common, de novo assembly of human genomes has been performed using Sanger sequencing, MPS, single molecule sequencing or mixes thereof (so called hybrid assemblies). Human genome assembly is an interesting future strategy for correctly identifying variation though quality and throughput of long read single molecule sequencing technologies need to improve before this will be routinely used146.

If a reference sequence exists, alignment of reads to this sequence is commonly performed to obtain a representation of the genome. Generally, alignment software search a reference sequence for locations where the minimum amounts of mismatches and insertions and/or deletions (indels) are found compared to the query. Some methods are weighted and give higher penalties to longer stretches of insertion and/or deletion and mismatch positions where the base calls are of high quality147. A large number of tools are available for DNA sequence alignment with large differences in their intended use. Software for everything from reference based high throughput alignment of short read data to pairwise alignment of full genome sequences exists.

Finally, when an adequate representation of the genome(s) or DNA population in a sample has been acquired, the next step is usually to compare it to other known sequences and/or expected patterns. In case of WGS or WES comparison is most commonly made towards the human

29

reference genome and/or a reference sample such as a tumor vs. normal sample. Variant callers use statistical modeling to determine the likelihood that a site is a true variant or not. These calculations are based on parameters such as read depth and calling qualities of the aligned reads and/or bases. Steps such as local realignment around indels and quality score recalibration are therefore common steps in variant calling pipelines for short indels and/or base substitutions148. If copy number variations (CNVs) based on differences in read coverage are to be identified, steps normalizing for mapping efficiency (e.g. GC bias) are usually performed prior to variant calling149. Identification of other types of structural variations such as larger insertions and translocations is usually based on unexpected alignments of the two reads in a pair149. Similar to assembly and alignment, several software are available for each type of variant calling. Even though many SNP calls overlap when exchanging one software for another150 a recently published study showed that results for somatic calling differ significantly depending on which software (and laboratory procedures) are used and the authors recommend using combinations of several mutation callers151.

The analysis of DNA sequencing data from other experimental setups such as Chip-Seq or RNA-Seq experiments generally use the same mapping or assembly steps. Downstream, other aspects than sequence variant calling are commonly of primary interest. These include analysis of expression levels and identification of isoforms and allele specific expressions. This type of analysis is not within the scope of this thesis.

If the data stems from an experiment performed during method development rather than standard genome sequencing, available assembly, alignment and/or variant calling approaches might not be sufficient (or needed). Instead, custom scripts for recognizing specific motifs or for identification and quantification of amplification artifacts might be required. Many open source software and libraries exist for simplifying such developments (and parsing of results from existing tools) for example Biopython152 and bioperl153.

Finally, interpretation of the results is sometimes the most exciting and challenging part. After variant calling, annotation of variants to determine their functional impact might be needed and, commonly,

30 Erik Borgström

biological or technical expertise is needed to determine the practical implications of the experiment. Except for annotation of variants, this part of the analysis needs to be tailored to each question asked.

In all studies included in this thesis custom scripts have been written (mostly in python) that either utilize the results from already available software to create informative summaries or search sequence data by simple algorithms to generate new results.

31

Present investigation The focus of my studies has been on improvement and development of methods for analysis of DNA from single DNA molecules and single cells. The thesis is based on several years of full time work involving many projects that has (so far) resulted in three publications and one manuscript. Paper 1 focuses on increasing the throughput of traditional procedures and an automation of sample preparation for MPS is presented. In paper 2, commercial kits for whole genome amplification of single cells are evaluated. Paper 3 is a study where single cell genotyping is performed through exome sequencing. Finally, in paper 4 a new method for maintaining single molecule DNA phase information that utilizes the throughput of MPS instruments is presented.

Paper 1 - Large Scale Library Generation for High Throughput Sequencing. This was my first project and resulted in a published paper in April 2011. The Division of Gene Technology at the Royal Institute of Technology had moved to for Life Laboratory (SciLifeLab) approximately one year earlier and at the same time the National Genomics Infrastructure (NGI) at SciLifeLab acquired their first Illumina HiSeq 2000 instrument. Two of the co-authors in this paper, Sverker Lundin and Joakim Lundeberg, had previously developed an automated version of a library preparation for 454 sequencing154. The idea was now to develop a similar strategy for the Illumina library preparation to be able to match the throughput of the HiSeq instrument. To make this possible the tedious size selection by gel electrophoresis step had to be replaced with an automated procedure. This was done through two sequential DNA precipitation reactions. First, large fragments were precipitated onto paramagnetic beads by using a poly ethylene glycol (PEG) and salt buffer. The beads with large DNA fragments were removed and by adding fresh beads and increasing the PEG concentration DNA fragments larger than a predefined size-cutoff were precipitated. The precipitated DNA was then re-suspended and used in downstream procedures. By varying the PEG concentrations we managed to control the size distribution of the selected DNA. A similar procedure using commercially available AMPure beads from Agencourt Biosciences was published one year earlier and the authors mentioned automation as a further development of the protocol155. In total, five libraries (three human cell line libraries and two spruce libraries) were

32 Erik Borgström

prepared with the automated protocol. All samples were part of two large sequencing efforts at SciLifeLab: the spruce genome sequencing project156 and a cell line characterization project157. Sequencing data from manually prepared libraries was also analyzed to enable a comparison of the automatic and manual procedures. The presented size selection procedure showed robust results and automated and manual libraries had comparable performance in terms of sequencing quality. The automated protocol (and an updated version) has been extensively used by the NGI group at SciLifeLab. It has also been an important part in all studies presented in this thesis as well as ongoing projects.

Paper 2 - Comparison of Whole Genome Amplification Techniques for Human Single Cell (Exome) Sequencing. Since my first year at SciLifeLab I have been part of a project where acquiring accurate single cell sequencing data, to be able to call somatic , is central to success. Furthermore, genotyping of single cells was the goal of my effort in paper 3. We realized early on that our ability to the call variants depended much on the method used for whole genome amplification and decided therefore to make a comparison of a number of commercially available kits. Four commercial kits was selected for the comparison: AMPLI1 (Silicon Biosystems), MALBAC (Yikongenomics), PicoPlex (Rubicon) and REPLI-g (Qiagen). The REPLI-g kit is a regular MDA optimized for small input amounts while both the PicoPlex and MALBAC kits utilize some type of preamplification prior to an exponential phase. The AMPLI1 kit is the commercialization of the SCOMP technique that fragments genomic DNA using restriction enzymes prior to ligation of adapters. For each kit two single cells were amplified and the WGA product was prepared for sequencing using the automated protocol described in paper 1 and then enriched for exome content. To make a fair comparison, all kits were compared to an unamplified gDNA sample prepared in the same manner. The largest differences between the kits were how well the targeted regions were covered and the evenness of the read-depth distribution over the covered regions. These results were reflected in the variant calling steps where considerable number of allele dropout events was observed for the samples that displayed low coverage and uneven read-depth. The results from this project helped us to select the MALBAC kit for future single cell whole genome amplification but AMPLI1 displayed similar results.

33

Furthermore, the project inspired us to test alternative amplification strategies while developing the experimental procedure used in paper 3.

Paper 3 - Transplanted Bone Marrow-Derived Cells Contribute to Human Adipogenesis. The contribution of bone marrow derived cells to adipogeniesis was investigated in a collaborative effort by two KI groups. This was investigated in individuals receiving bone marrow transplantation. If the transplanted bone marrow cells contributed to human adipogenesis, adipocytes (fat cells) in receiver’s fat tissue should contain the donor’s genotype. Otherwise, all adipocytes in receiver’s fat tissue should have the receiver’s genotype. The KI groups had earlier performed genotyping using microsatellite markers on pure adipocyte fractions of fat tissue samples. It was found that a small proportion of the cells displayed the donor genotype. However, single cell resolution of the results was lacking and I became involved in the project to contribute with single cell amplification and genotyping.

Single adipocytes are difficult to isolate by use of traditional methods and were thus retrieved by a technique developed by our collaborators at KI. Briefly, pure adipocyte fractions from white adipose tissue (WAT) were embedded in agarose and “agarose beads” containing single cells were isolated by LCM. The DNA content of these cells was amplified using the PicoPlex WGA kit. However, the genotyping of microsattelites used on bulk samples was not compatible with the short fragment sizes of the PicoPlex WGA products. Based on experience from paper 2, we tested other types of amplification on the beads. But changing the WGA method did not turn out to be a viable approach as all WGA kits tested except PicoPlex failed to amplify cells in agar-beads. To address this problem, we proposed Trinucleotide Threading (TnT)158 assay (a multiplex genotyping technique developed earlier in our lab) to genotype a few hundred of SNPs simultaneously. Variant calling using this approach would help to determine whether the cells originated from one or two individuals. I had spent the first year of my PHD studies working with the TnT method and was already familiar with the laboratory procedures. A few initial rounds of TnT genotyping indicated that there were cells of both donor and recipient origin as well as cells with mixed . It also became apparent that the dropout rate of the procedure (WGA+TnT) posed a

34 Erik Borgström

problem. Thus, only variants where the donor and recipient were homozygous and differing (e.g. AA and CC for a A/C locus) were considered informative. However, the set of 150 SNPs that were targeted with the TnT approach was too small to unambiguously determine the origin of the single cells. In order to increase the number of examined variants, the alternative exome sequencing approach was investigated. Low coverage sequencing of one recipient/donor pair and ten single cells indicated that a large number of informative variants were callable even though the recipient and donor were siblings. This experiment was followed by analysis of a total of three donor/recipient pairs and ~90 single cells. We observed two donor-derived fat cells and two cells displaying mixed genotypes (from two of the three recipient/donor pairs). The donor-derived cells, the mixed cells and two cells displaying the recipient genotype were also subjected to whole genome sequencing. The sequencing confirmed the results from exome sequencing. The results of this study indicate that transplanted bone marrow could contribute to mature fat cells in the recipient.

Paper 4 - Phasing of single DNA molecules by massively parallel barcoding. At the Single Cell Genomics conference in Stockholm 2014, several projects aiming at barcoding single cells or molecules within emulsion droplets were presented. During 2015 a number of papers describing such methods were published159,160,161,162. The method presented in paper 4 was one of these papers and the first of them targeting DNA molecules.

The seventh of June 2011 was the start of our project for tagging single molecules in emulsion droplets. Initially, the idea was to generate uniquely barcoded primer libraries by RCA (rolling circle amplification) of circularized oligonucleotides carrying a stretch of degenerate nucleotides. These primers would then be used to tag single molecules encapsulated in droplets. During the development process the RCA approach was replaced by use of beads covered with clonally amplified barcoded primers. In the published version of the method, an initial emulsion reaction generated amplified barcoded primers on beads. Single DNA molecules were then encapsulated together with the barcoded beads in water-in-oil droplets. Thereafter, a second round of emulsion PCR was started. The amplicons created in this reaction were linked to the

35

barcoded primer population. The products were subjected to massively parallel sequencing and the resulting data were sorted based on the barcodes. This allowed grouping reads sharing the same barcode and tracking them to a single DNA molecule trapped in a droplet. By amplifying several targets the method enables physical linking the amplicons.

The unique barcoding of each bead-bound primer library was validated by sequencing of the barcodes present on individual beads. The procedure for barcoding of amplicons from single molecules was then validated by investigating two variable regions in the bacterial 16S gene. The full procedure was tested on both a model system of four known bacterial genomes as well as a complex metagenomic sample. Comparing sequence data from the barcoded amplicons to known 16S genes showed that the majority of the active emulsion compartments successfully amplified targets from single DNA molecules with preserved phase information.

This study presents a system for barcoding of single DNA molecules with potential to target and tag more regions. Such approaches are currently under development in our group. One example is applying the method to the eight of the HLA-A gene. And a recently initiated project targets the biological contents of single cells with the ultimate goal of profiling circulating tumor cells. Another initiative is to perform DNA-assisted proteomics in emulsion droplets, a method called compartmentalized imuno-sequencing.

Future perspective Several single molecule sequencing techniques are available today. The throughput and quality of these platforms are continuously increasing while the cost is reduced. However, the greater availability and lower costs of the MPS technologies lead to the fact that these platforms will be central for most large scale projects in the coming years. A recent news piece summarizing technological advancements in 2015 predicts single molecule sequencing platforms and technologies that allow for performance of more complex experiments on MPS instruments as the most exciting areas of development for 2016163. Automation prior to MPS

36 Erik Borgström

and development of strategies for barcoding of single molecules and single cells were mentioned as examples of such “enabling technologies”.

Sample and library preparation and quality control will continue to be a prerequisite for DNA sequencing. The throughput of both traditional MPS platforms and single molecule technologies increases and the throughput of preparation procedures must follow this trend. To some extent this can be achieved by simpler and more efficient laboratory protocols but automation will definitely play a central role in this development. The method presented in paper 1 has been used in many projects, and development of automated procedures is continuously being done in our group.

Nature Methods denoted single cell sequencing as the method of the year in 2013 and the field has exploded during the last few years. Both laboratory and computational procedures for single cell sequencing are currently in extensive development. Continuous evaluation of new methods such as the one presented in paper 2 are needed to ensure that the best-suited method is applied in each project.

Barcoding methods are continuously being developed. A decade ago sequencing libraries were barcoded to increase multiplexing. Today RNA or DNA from single cells is barcoded. Further development of barcoding methods such as the one presented in paper 4 will allow encoding of even more information such as the spatial context of cells and better coverage of single cell genomes.

In conclusion, the studies presented in this thesis are highly relevant for solving the problems we face today and for a foreseeable future. Further ahead, the expectations for advancement within the field of nanopore sequencing is sky high. Speculations of how sequencing would be performed the next coming decades are only fantasies. Still, a perfect technology would not require amplification or assembly of data after analysis. Instead, direct analysis of the content of single cells through nanopores inserted directly into the cell membrane would present a close to optimal scenario.

37

38

aldrig glömma vårt manuskript, det var kämpigt men förbannat kul! Johanna Tyresö is tha shit!! och ja juste en sak till, skiter pingviner motsols? Pawel vi ses på zinken Joseph ta väl hand om jean claude, Henrik, Beata, Mahya, Kim, Linda, Joel S, José, Stefania, Niyaz, Erik, Sailendra, Maja, Jonas, Mengxiao, Florian, Kostis och tillslut Anneli “sillstryparen” mole-börger thanks for making the lab the best work environment in the world!

Extra thanks to the National Genomics Infrastructure (NGI). Especially Christian, Bahram and Kicki you have been essential for the completion of all my projects! Sebbe du kommer inte in på festen om du tar med en kolsyrepatron. Anna K the burnball is on! Joéll if I’m your tame impala, will you be mine? #MarioJustComeBack Also many thanks to Mos, Simon, Nemo, Luisa, Johannes! Yue, Emanuela, Hugo, Guillermo, Phil, Roman, Robin, the ST-group and the core people you make alfa3 the premium floor! (this applies to the clinical sequencing crew, tex Valtteri and Emma, as well, you’re still part of the floor, you can’t just move out like that)

Also thanks to all my coworkers at scilifelab not mentioned here by name.

Kursare, Kth & albanova folk och allmänna skönisar Hanna, Lisa, Anna H, Magnus, Sejsing, Frida och Filippa tack!

Stort tack till alla vänner och familj för att ni stöttar (nästan) alla dumheter jag hittar på! Erik F jag ska ta mig till Västerås snart! Majgel kommer du ihåg när vi va på Kth? Tack till Paul W! Ett stort tack till Cholitas Crew: Petter, Marcus, Niklas och Anders som höll mig över ytan även fast luren skönk…. nästa år är andraplatsen vår! (om inte Tarek tar den förstås, by the way så vill jag segla maxi nån gång). Min familj Oskar, Robin, Linnea, Mamma, Pappa, Farmor, Mormor, aunts, uncles and cousins och “the Hanssons”: Tommy, Agneta, Emelie och Helena tack för all hjälp och uppmuntran! Tack till innebandy gänget i Sofieberg utan er skulle jag vara en tjock labbråtta uppfödd på öl som inte orkade springa ett varv i ekorrhjulet!

Semlan och Humlan thanks for keeping me up at night by constantly running around and making funny noises, yeah and thank you for biting my ankles and scratching my eyes out … and thank you for keeping me company I love u!

till den helt fantastiska Johanna Tack för godiset! du är bäst!

40 Erik Borgström

References

1 Tjio, J. H. & Levan, A. The number of man. Hereditas 42, 1–6 (1956). 2 Bianconi, Eva et al. "An estimation of the number of cells in the human body." Annals of human 40.6 (2013): 463-471. 3 Hooke, Robert. Micrographia: or some physiological descriptions of minute bodies made by magnifying glasses, with observations and inquiries thereupon. 4 Mazzarello, Paolo. "A unifying concept: the history of cell theory." Nature Cell Biology 1.1 (1999): E13-E15. 5 Crick, F., Central dogma of . Nature, 1970. 227(5258): p. 561-3. 6 Choudhuri, S. (2003). The Path from Nuclein to Human Genome: A Brief History of DNA with a Note on Human Genome Sequencing and Its Impact on Future Research in Biology. Bulletin of Science, Technology and Society, 23(5), 360–367. doi:10.1177/0270467603259770 7 Watson, James D, and Francis HC Crick. "Molecular structure of nucleic acids." Nature 171.4356 (1953): 737-738. 8 Mendel, G. (1865) Versuche über Pflanzenhybriden. 9 Gerstein, M. B., Bruce, C., Rozowsky, J. S., Zheng, D., Du, J., Korbel, J. O., … Snyder, M. (2007). What is a gene , post- ENCODE? History and updated definition, 1905, 669–681. doi:10.1101/gr.6339607.Freely 10 U.S. National Library of , What is a genome? http://ghr.nlm.nih.gov/handbook/hgp/genome, retrived 2015- 11-10 11:22:29 11 Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in . Boston: Kluwer Academic; 2003. Chapter 1, Genomics: From Phage to Human.Available from: http://www.ncbi.nlm.nih.gov/books/NBK20263/ 12 Fiers, W., Contreras, R., Duerinck, F., Haegeman, G., Iserentant, D., Merregaert, J., … Ysebaert, M. (1976). Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature, 260(5551), 500–507. doi:10.1038/260500a0 13 Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, a R., Fiddes, C. a, … Smith, M. (1977). Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 265(5596), 687–695. doi:5,386 base pairs... 14 Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., … Szustakowki, J. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860–921. doi:10.1038/35057062 15 Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., … Zhu, X. (2001). The sequence of the human genome. Science, 291(5507), 1304–51. doi:10.1126/science.1058040 16 Hattori, M. (2005). Finishing the euchromatic sequence of the human genome. Tanpakushitsu Kakusan Koso. Protein, Nucleic Acid, Enzyme, 50(2), 162–168. doi:10.1038/nature03001 17 International, T., & Consortium, H. (2003). The International HapMap Project. Nature, 426(6968), 789–796. doi:10.1038/nature02168 18 Durbin, R. M., Altshuler, D. L., Durbin, R. M., Abecasis, G. R., Bentley, D. R., Chakravarti, A., … McVean, G. A. (2010). A map of human genome variation from population-scale sequencing. Nature, 467(7319), 1061–1073. doi:10.1038/nature09534 19 The ENCODE Project Consortium, Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C. a, … Lochovsky, L. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74. doi:10.1038/nature11247 20 Morris, K. V., & Mattick, J. S. (2014). The rise of regulatory RNA. Nature Reviews , 15(6), 423–437. doi:10.1038/nrg3722 21 McGettigan, P. A. (2013). Transcriptomics in the RNA-seq era. Current Opinion in Chemical Biology, 17(1), 4–11. doi:10.1016/j.cbpa.2012.12.008 22 de Castro, A. P., Fernandes, G. da R., & Franco, O. L. (2014). Insights into novel antimicrobial compounds and antibiotic resistance genes from soil metagenomes. Frontiers in , 5(September), 489. doi:10.3389/fmicb.2014.00489 23 Mason, O. U., Hazen, T. C., Borglin, S., Chain, P. S., Dubinsky, E. A., Fortney, J. L., … Jansson, J. K. (2012). Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to Deepwater Horizon oil spill. Isme J, 6(9), 1715–1727. doi:10.1038/ismej.2012.59 24 Lodato, M. A., Woodworth, M. B., Lee, S., Evrony, G. D., Mehta, B. K., Karger, A., … Walsh, C. A. (2015). Somatic mutation in single human neurons tracks developmental and transcriptional history. Science, 350(6256), 94–97. doi:10.1126/science.aab1785

41

25 Macaulay, I. C., & Voet, T. (2014). Single cell genomics: advances and future perspectives. PLoS Genetics, 10(1), e1004126. doi:10.1371/journal.pgen.1004126 26 Li, H., Gyllensten, U. B., Cui, X., Saiki, R. K., Erlich, H. A., & Arnheim, N. (1988). Amplification and analysis of DNA sequences in single human sperm and diploid cells. Nature, 335(6189), 414–417. Retrieved from http://dx.doi.org/10.1038/335414a0 27 Zhang, L., Cui, X., Schmitt, K., Hubert, R., Navidi, W., & Arnheim, N. (1992). Whole genome amplification from a single cell: implications for genetic analysis. Proceedings of the National Academy of of the of America, 89(13), 5847–51. doi:10.1073/pnas.89.13.5847 28 Gall, J. G., & Pardue, M. L. (1969). Formation and detection of RNA-DNA hybrid molecules in cytological preparations. Proceedings of the National Academy of Sciences of the United States of America, 63(2), 378–83. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=223575&tool=pmcentrez&rendertype=abstract, Accessed 20160109 29 Method of the Year 2013. (2014). Nat Meth, 11(1), 1. Retrieved from http://dx.doi.org/10.1038/nmeth.2801, 20151119 12:22:25 30 Eberwine, J., Sul, J.-Y., Bartfai, T., & Kim, J. (2013). The promise of single-cell sequencing. Nature Methods, 11(1), 25– 27. doi:10.1038/nmeth.2769 31 Gross, A., Schoendube, J., Zimmermann, S., Steeb, M., Zengerle, R., & Koltay, P. (2015). Technologies for Single-Cell Isolation. International Journal of Molecular Sciences, 16(8), 16897–16919. doi:10.3390/ijms160816897 32 El-Badry, H. (1963). Micromanipulators. In Micromanipulators and Micromanipulation SE - 3 (Vol. 3, pp. 22–72). Springer Vienna. doi:10.1007/978-3-7091-5551-6_3 33 Ishii, S., Tago, K., & Senoo, K. (2010). Single-cell analysis and isolation for microbiology and biotechnology: methods and applications. Applied Microbiology and Biotechnology, 86(5), 1281–1292. doi:10.1007/s00253-010-2524-4 34 Datta, S., Malhotra, L., Dickerson, R., Chaffee, S., Sen, C. K., & Roy, S. (2015). Laser capture microdissection: from small samples. Histology and Histopathology, 30(11), 1255–69. doi:10.14670/HH-11-622 35 Mr. Bain’s electric printing telegraph. (1844). Journal of the Franklin Institute, 38(1), 61–65. doi:10.1016/S0016- 0032(44)91150-6 36 Bonner, W. A. (1972). Fluorescence Activated Cell Sorting. Review of Scientific Instruments, 43(3), 404. doi:10.1063/1.1685647 37 Chou, H., Spence, C., Fu, A., Scherer, A., & Quake, S. (1998). Disposable Microdevices for Dna Analysis and Cell Sorting. Proc. Solid-State and Actuator Workshop, 11–14. 38 Fu, A. Y., Spence, C., Scherer, A., Arnold, F. H., & Quake, S. R. (1999). A microfabricated fluorescence-activated cell sorter. Nature Biotechnology, 17(11), 1109–11. doi:10.1038/15095 39 Marcus, J. S., Anderson, W. F., & Quake, S. R. (2006). Microfluidic single-cell mRNA isolation and analysis. Analytical Chemistry, 78(9), 3084–3089. doi:10.1021/ac0519460 40 Tan, S. J., Phan, H., Gerry, B. M., Kuhn, A., Hong, L. Z., Min Ong, Y., … Burkholder, W. F. (2013). A microfluidic device for preparing next generation DNA sequencing libraries and for automating other laboratory protocols that require one or more column steps. PloS One, 8(7), e64084. doi:10.1371/journal.pone.0064084 41 Fluidigm. Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation. TECHNICAL NOTE PN 100-9879 A1. 42 Brouzes, E., Medkova, M., Savenelli, N., Marran, D., Twardowski, M., Hutchison, J. B., … Samuels, M. L. (2009). Droplet microfluidic technology for single-cell high-throughput screening. Proceedings of the National Academy of Sciences of the United States of America, 106(34), 14195–200. doi:10.1073/pnas.0903542106 43 Medkova, M., Brouzes, E., Dahan, E., Kotsopoulos, S., Warner, J., Larson, J., … Samuels, M. (2009). Analyzing Cancer at Single Cell Resolution with Droplet Technology Targeted Sequencing of Cancer Gene Networks Droplet-Based Single Cell Genomics Droplet Methyl-Seq With Bisulfite Treated DNA Enrichment of Resequencing Targets Sequencing of Heterogeneous. Poster, retrieved from http://raindancetech.com/rdt/wp- content/uploads/2012/05/poster_analyzing_cancer_with_droplet_technology.pdf, Accessed 20151223 15:11:59 44 http://raindancetech.com/ 45 Milo et al. Nucl. Acids Res. (2010) 38 (suppl 1): D750-D753, RNA amount in cell - Human Homo sapiens - BNID 111205. (n.d.). Retrieved from http://bionumbers.hms.harvard.edu//bionumber.aspx?id=111205&ver=2, 20151119 15:44:42 46 Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., … Linnarsson, S. (2014). Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods, 11(1), 163–6. doi:10.1038/nmeth.2772

42 Erik Borgström

47 Saiki, R. K., Scharf, S., Faloona, F., Mullis, K. B., Horn, G. T., Erlich, H. A., & Arnheim, N. (1985). Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science , 230 (4732 ), 1350–1354. doi:10.1126/science.2999980 48 Mullis, K. B., & Faloona, F. A. (1987). Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods in Enzymology, 155, 335–50. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/3431465 49 Nelson, D. L., Ledbetter, S. a, Corbo, L., Victoria, M. F., Ramírez-Solis, R., Webster, T. D., … Caskey, C. T. (1989). Alu polymerase chain reaction: a method for rapid isolation of human-specific sequences from complex DNA sources. Proceedings of the National Academy of Sciences of the United States of America, 86(17), 6686–6690. 50 Czyz, Z. T., Kirsch, S., & Polzer, B. (2015). Principles of Whole-Genome Amplification. In Whole Genome Amplification (Vol. 1347, pp. 1–14). Retrieved from http://www.springerprotocols.com/Abstract/doi/10.1007/978-1-4939-2990-0_1, Accessed 2015 11 24 22:50:23 51 Zhang, L., Cui, X., Schmitt, K., Hubert, R., Navidi, W., & Arnheim, N. (1992). Whole genome amplification from a single cell: implications for genetic analysis. Proceedings of the National Academy of Sciences of the United States of America, 89(13), 5847–51. doi:10.1073/pnas.89.13.5847 52 Telenius, H., Carter, N. P., Bebb, C. E., Nordenskjöld, M., Ponder, B. a, & Tunnacliffe, a. (1992). Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics, 13(3), 718– 725. doi:10.1016/0888-7543(92)90147-K 53 Sermon, K., Lissens, W., Joris, H., Van Steirteghem, A., & Liebaers, I. (1996). Adaptation of the primer extension preamplification (PEP) reaction for preimplantation diagnosis: single blastomere analysis using short PEP protocols. Mol Hum Reprod, 2(3), 209–212. doi:10.1093/molehr/2.3.209 54 Dietmaier, W., Hartmann, a, Wallinger, S., Heinmöller, E., Kerner, T., Endl, E., … Rüschoff, J. (1999). Multiple mutation analyses in single tumor cells with improved whole genome amplification. The American Journal of Pathology, 154(1), 83–95. doi:10.1016/S0002-9440(10)65254-6 55 Kamberov, E., Sleptsova, I., Suchyta, S., Bruening, E. D., Ziehler, W., Seward Nagel, J., … Makarov, V. (2002). Use of in vitro OmniPlex libraries for high-throughput comparative genomics and molecular haplotyping. International Symposium on Biomedical Optics, 4626, 340–351. doi:10.1117/12.472062 56 Langmore, J. P. (2002). Rubicon Genomics, Inc., 3, 557–560. 57 Navin, N., Kendall, J., Troge, J., Andrews, P., Rodgers, L., McIndoo, J., … Wigler, M. (2011). Tumour evolution inferred by single-cell sequencing. Nature, 472(7341), 90–94. doi:10.1038/nature09807 58 Klein, C. A., Schmidt-Kittler, O., Schardt, J. A., Pantel, K., Speicher, M. R., & Riethmüller, G. (1999). Comparative genomic hybridization, loss of heterozygosity, and DNA sequence analysis of single cells. Proceedings of the National Academy of Sciences of the United States of America, 96(8), 4494–9. doi:10.1073/pnas.96.8.4494 59 Grothues, D., Cantor, C. R., & Smith, C. L. (1993). PCR amplification of megabase DNA with tagged random primers (T- PCR). Nucleic Acids Research, 21(5), 1321–1322. 60 Zong, C., Lu, S., Chapman, A. R., & Xie, X. S. (2012). Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science (New York, N.Y.), 338, 1622–6. doi:10.1126/science.1229164 61 Kurn, N., Chen, P., Heath, J. D., Kopf-Sill, A., Stephens, K. M., & Wang, S. (2005). Novel Isothermal, Linear Nucleic Acid Amplification Systems for Highly Multiplexed Applications. Clinical Chemistry, 51(10), 1973–1981. doi:10.1373/clinchem.2005.053694 62 Lee, C. I. P., Leong, S. H., Png, A. E. H., Choo, K. W., Syn, C., Lim, D. T. H., … Kon, O. L. (2006). An Isothermal Method for Whole Genome Amplification of Fresh and Degraded DNA for Comparative Genomic Hybridization, Genotyping and Mutation Detection. DNA Research, 13(2), 77–88. doi:10.1093/dnares/dsi029 63 Lizardi PM (2000) Multiple displacement amplification. US Patent No. 6,124,120 64 Dean, F. B., Hosono, S., Fang, L., Wu, X., Faruqi, a F., Bray-Ward, P., … Lasken, R. S. (2002). Comprehensive human genome amplification using multiple displacement amplification. Proceedings of the National Academy of Sciences of the United States of America, 99(8), 5261–5266. doi:10.1073/pnas.082089499 65 Lage, J. M., Leamon, J. H., Pejovic, T., Hamann, S., Lacey, M., Dillon, D., … Lizardi, P. M. (2003). Whole genome analysis of genetic alterations in small DNA samples using hyperbranched strand displacement amplification and array- CGH. Genome Research, 13(2), 294–307. doi:10.1101/gr.377203 66 de Bourcy, C. F. a, De Vlaminck, I., Kanbar, J. N., Wang, J., Gawad, C., & Quake, S. R. (2014). A quantitative comparison of single-cell whole genome amplification methods. PloS One, 9(8), e105585. doi:10.1371/journal.pone.0105585 67 Liu, C. L., Schreiber, S. L., & Bernstein, B. E. (2003). Development and validation of a T7 based linear amplification for genomic DNA. BMC Genomics, 4(1), 19. doi:10.1186/1471-2164-4-19 68 Marko, N. F., Frank, B., Quackenbush, J., & Lee, N. H. (2005). A robust method for the amplification of RNA in the sense orientation. BMC Genomics, 6(1), 27. doi:10.1186/1471-2164-6-27

43

69 Van Gelder, R. N., von Zastrow, M. E., Yool, a, Dement, W. C., Barchas, J. D., & Eberwine, J. H. (1990). Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proceedings of the National Academy of Sciences of the United States of America, 87(5), 1663–1667. doi:10.1073/pnas.87.5.1663 70 Liu, C. L., Schreiber, S. L., & Bernstein, B. E. (2003). Development and validation of a T7 based linear amplification for genomic DNA. BMC Genomics, 4(1), 19. doi:10.1186/1471-2164-4-19 71 Shankaranarayanan, P., Mendoza-Parra, M.-A., Walia, M., Wang, L., Li, N., Trindade, L. M., & Gronemeyer, H. (2011). Single-tube linear DNA amplification (LinDA) for robust ChIP-seq. Nature Methods, 8(7), 565–567. doi:10.1038/nmeth.1626 72 Zong, C., Lu, S., Chapman, A. R., & Xie, X. S. (2012). Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science (New York, N.Y.), 338, 1622–6. doi:10.1126/science.1229164 73 Chapman, A. R., He, Z., Lu, S., Yong, J., Tan, L., Tang, F., & Xie, X. S. (2015). Single Cell Transcriptome Amplification with MALBAC. Plos One, 10(3), e0120889. doi:10.1371/journal.pone.0120889 74 PicoPLEX WGA kit technology. Available at http://rubicongenomics.com/products/picoplex/, accessed 20151129 13:11:20 75 The Human Genome Project Completion: Frequently Asked Questions. Available at https://www.genome.gov/11006943, Acessed 20151117 16:08:29 76 Erika Check, James Watson's genome sequenced. Discoverer of the blazes trail for . Available at http://www.nature.com/news/2007/070528/full/news070528-10.html, acsessed 20151117 16:06:22 77 Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed 20151117 16:13:43 78 Maxam, a M., & Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the National Academy of Sciences of the United States of America, 74(2), 560–564. doi:10.1073/pnas.74.2.560 79 Sanger, F., Nicklen, S., & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America, 74(12), 5463–7. doi:10.1073/pnas.74.12.5463 80 Sanger, F., & Coulson, a R. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology, 94(3), 441–448. doi:10.1016/0022-2836(75)90213-2 81 Hutchison, C. A. (2007). DNA sequencing: bench to bedside and beyond. Nucleic Acids Research, 35(18), 6227–6237. doi:10.1093/nar/gkm688 82 Ronaghi, M., Uhlén, M., & Nyrén, P. (1998). A Sequencing Method Based on Real-Time Pyrophosphate. Science, 281(5375), 363–365. doi:10.1126/science.281.5375.363 83 Nyrén, P. (2007). The history of pyrosequencing. Methods in Molecular Biology (Clifton, N.J.), 373, 1–14. doi:10.1385/1-59745-377-3:1 84 Rothberg, J. M., & Leamon, J. H. (2008). The development and impact of 454 sequencing. Nature Biotechnology, 26(10), 1117–1124. doi:10.1038/nbt1485 85 Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., … Rothberg, J. M. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(September), 376–381. doi:10.1038/nature03959 86 GS FLX+ System, available at http://454.com/products/gs-flx-system/index.asp, accsessed 20151203 10:24 87 A genomeWeb staff reporter, Roche Shutting Down 454 Sequencing Business. GenomeWeb, Oct 15, 2013, Available at https://www.genomeweb.com/sequencing/roche-shutting-down-454-sequencing-business, accesed 2015-12-02 17:14:44 88 Shendure, J. (2005). Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome. Science, 309(5741), 1728–1732. doi:10.1126/science.1117389 89 Mardis, E. R. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics, 24(3), 133–141. doi:10.1016/j.tig.2007.12.007 90 Illumina Inc, History of Illumina Sequencing, available at: http://www.illumina.com/technology/next-generation- sequencing/solexa-technology.html, accessed 2015-12-03 11:05:40 91 Pandey, V., Nutter, R. C., & Prediger, E. (2008). SOLiD TM System: Ligation-Based Sequencing. In Next Generation Genome Sequencing: Towards (pp. 29–42). doi:10.1002/9783527625130.ch3 92 BGI-Shenzhen Completes Acquisition of Complete Genomics, available at: http://www.completegenomics.com/press- releases/press-release-6/, accesed 2015-12-03 18:37:53

44 Erik Borgström

93 Drmanac, R., Sparks, A. B., Callow, M. J., Halpern, A. L., Burns, N. L., Kermani, B. G., … Reid, C. a. (2009). Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science (New York, N.Y.), 1469(November), 1–7. doi:10.1126/science.1181498 94 Tawfik, D. S., & Griffiths, a D. (1998). Man-made cell-like compartments for molecular evolution. Nature Biotechnology, 16, 652–656. doi:10.1038/nbt0798-652 95 Nakano, M., Komatsu, J., Matsuura, S. I., Takashima, K., Katsura, S., & Mizuno, A. (2003). Single-molecule PCR using water-in-oil emulsion. Journal of Biotechnology, 102(2), 117–124. doi:10.1016/S0168-1656(03)00023-3 96 Dressman, D., Yan, H., Traverso, G., Kinzler, K. W., & Vogelstein, B. (2003). Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proceedings of the National Academy of Sciences of the United States of America, 100(15), 8817–8822. doi:10.1073/pnas.1133470100 97 Rothberg, J. M., Hinz, W., Rearick, T. M., Schultz, J., Mileski, W., Davey, M., … Bustillo, J. (2011). An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356), 348–352. doi:10.1038/nature10242 98 Bernadette Toner (2012) In Sequence Survey: Illumina Holds Two-Thirds of Sequencing Market, Splits Desktop Share with Ion PGM. Available at https://www.genomeweb.com/sequencing/sequence-survey-illumina-holds-two-thirds- sequencing-market-splits-desktop-share, Accessed 20160113 99 Lakdawalla, A. and Vansteenhouse, H. (2008) Illumina Genome Analyzer II System, in Next Generation Genome Sequencing: Towards Personalized Medicine (ed M. Janitz), Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, . doi: 10.1002/9783527625130.ch2 100 HiSeq 3000/HiSeq 4000 Technology. Avialable at http://www.illumina.com/systems/hiseq-3000- 4000/technology.html, accesed 2015-12-08 13:57:47 101 Illumina Two-Channel SBS Sequencing Technology. Retrieved from http://www.illumina.com/content/dam/illumina- marketing/documents/products/techspotlights/techspotlight_two-channel_sbs.pdf, accessed 2015-12-08 13:56:08 102 Procedure & Checklist - 500 bp Template Preparation and Sequencing. Retrieved from http://www.pacb.com/wp- content/uploads/2015/09/Procedure-Checklist-500bp-Template-Preparation-and-Sequencing.pdf, Accesed 20151208 19:29:54 103 Procedure and Checklist - 20 kb Template Preparation Using BluePippinTM Size-Selection System. Retrieved from http://www.pacb.com/wp-content/uploads/2015/09/Procedure-Checklist-20-kb-Template-Preparation-Using- BluePippin-Size-Selection.pdf, 20151208 19:30:34 104 Specifications. Available at https://www.nanoporetech.com/community/specifications, Acessed 20151211 10:08:04 105 Braslavsky, I., Hebert, B., Kartalov, E., & Quake, S. R. (2003). Sequence information can be obtained from single DNA molecules. Proceedings of the National Academy of Sciences of the United States of America, 100(7), 3960–4. doi:10.1073/pnas.0230489100 106 Kircher, M., & Kelso, J. (2010). High-throughput DNA sequencing - Concepts and limitations. BioEssays, 32(6), 524– 536. doi:10.1002/bies.200900181 107 Helicos BioSciences Files for Chapter 11 Bankruptcy Protection, Available at https://www.genomeweb.com/sequencing/helicos-biosciences-files-chapter-11-bankruptcy-protection, Acessed 20151208 21:56:19 108 Karow, J. (2010). PacBio Reveals Beta System Specs for RS; Says Commercial Release is on Track for First Half of 2011. Available at https://www.genomeweb.com/sequencing/pacbios-q3-losses-widen-it-readies-launch-rs-system, Accessed 20151210 15:04:54 109 Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., … Turner, S. (2009). Real-time DNA sequencing from single polymerase molecules. Science (New York, N.Y.), 323(5910), 133–138. doi:10.1126/science.1162986 110 New Products: PacBio's RS II; Cufflinks. Available at https://www.genomeweb.com/sequencing/new-products- pacbios-rs-ii-cufflinks, acessed 20151210 16:56:14 111 Pacific Biosciences Launches New Sequencing Platform Based on Its SMRT Technology. Available at http://www.pacb.com/press_releases/pacific-biosciences-launches-new-sequencing-platform-based-on-its-smrt- technology/, Acessed 20151210 17:01:43 112 Ip CLC, Loose M, Tyson JR et al. MinION Analysis and Reference Consortium: Phase 1 data release and analysis [version 1; referees: 2 approved] F1000Research 2015, 4:1075 (doi: 10.12688/f1000research.7201.1) 113 Monica Heger, Pre-print Study Demonstrates Nanopore Sequencing of Bacteriophage Genome with MspA Pore, Jun 24, 2014, Available at: https://www.genomeweb.com/sequencing/pre-print-study-demonstrates-nanopore-sequencing- bacteriophage-genome-mspa-pore, accesed 20151215 16:27:18

45

114 Laszlo, A. H., Derrington, I. M., Ross, B. C., Brinkerhoff, H., Adey, A., Nova, I. C., … Gundlach, J. H. (2014). Decoding long nanopore sequencing reads of natural DNA. Nature Biotechnology, 32(8), 829–834. doi:10.1038/nbt.2950 115 Huang, S., Romero-Ruiz, M., Castell, O. K., Bayley, H., & Wallace, M. I. (2015). High-throughput optical sensing of nucleic acids in a nanopore array. Nature Nanotechnology, 10(11), 986–991. doi:10.1038/nnano.2015.189 116 Microdroplet sequencing. Available at http://www.base4.co.uk/microdroplet-sequencing/, Acessed 20151223 21:32:47 117 Technology. http://nabsys.com/Technology.aspx, Accesed 20151223 23:09:55 118 Adey, A., Morrison, H. G., (no last name), A., Xun, X., Kitzman, J. O., Turner, E. H., … Shendure, J. (2010). Rapid, low- input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biology, 11(12), R119. doi:10.1186/gb-2010-11-12-r119 119 Kozarewa, I., Ning, Z., Quail, M. A., Sanders, M. J., Berriman, M., & Turner, D. J. (2009). Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nature Methods, 6(4), 291–295. doi:10.1038/nmeth.1311 120 TruSeq DNA PCR-Free Library Preparation Kit. Available at http://www.illumina.com/products/truseq-dna-pcr-free- library-prep-kits.html, Accessed 20151215 22:00:27 121 Albert, T. J., Molla, M. N., Muzny, D. M., Nazareth, L., Wheeler, D., Song, X., … Gibbs, R. A. (2007). Direct selection of human genomic loci by microarray hybridization. Nature Methods, 4(11), 903–905. doi:10.1038/nmeth1111 122 Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E. M., Brockman, W., … Nusbaum, C. (2009). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature Biotechnology, 27(2), 182–189. doi:10.1038/nbt.1523 123 Korbel, J. O., Urban, A. E., Affourtit, J. P., Godwin, B., Grubert, F., Simons, J. F., … Snyder, M. (2007). Paired-end mapping reveals extensive structural variation in the human genome. Science (New York, N.Y.), 318(5849), 420–6. doi:10.1126/science.1149504 124 Johnson, D. S., Mortazavi, A., Myers, R. M., & Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science (New York, N.Y.), 316(5830), 1497–502. doi:10.1126/science.1141319 125 Frommer, M., McDonald, L. E., Millar, D. S., Collis, C. M., Watt, F., Grigg, G. W., … Paul, C. L. (1992). A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proceedings of the National Academy of Sciences of the United States of America, 89(5), 1827–31. doi:10.1073/pnas.89.5.1827 126 Darmanis, S., Nong, R. Y., Vänelid, J., Siegbahn, A., Ericsson, O., Fredriksson, S., … Landegren, U. (2011). ProteinSeq: High-Performance Proteomic Analyses by Proximity Ligation and Next Generation Sequencing. PLoS ONE, 6(9), e25583. doi:10.1371/journal.pone.0025583 127 Dezfouli, M., Vickovic, S., Iglesias, M. J., Schwenk, J. M., & Ahmadian, A. (2014). Parallel barcoding of antibodies for DNA-assisted proteomics. Proteomics, 14(21-22), 2432–6. doi:10.1002/pmic.201400215 128 Ozsolak, F., Platt, A. R., Jones, D. R., Reifenberger, J. G., Sass, L. E., McInerney, P., … Milos, P. M. (2009). Direct RNA sequencing. Nature, 461(7265), 814–818. doi:10.1038/nature08390 129 Vilfan, I. D., Tsai, Y.-C., Clark, T. A., Wegener, J., Dai, Q., Yi, C., … Korlach, J. (2013). Analysis of RNA base modification and structural rearrangement by single-molecule real-time detection of reverse transcription. Journal of , 11(1), 8. doi:10.1186/1477-3155-11-8 130 Binladen, J., Gilbert, M. T. P., Bollback, J. P., Panitz, F., Bendixen, C., Nielsen, R., & Willerslev, E. (2007). The Use of Coded PCR Primers Enables High-Throughput Sequencing of Multiple Homolog Amplification Products by 454 Parallel Sequencing. PLoS ONE, 2(2), e197. doi:10.1371/journal.pone.0000197 131 Parameswaran, P., Jalili, R., Tao, L., Shokralla, S., Gharizadeh, B., Ronaghi, M., & Fire, A. Z. (2007). A pyrosequencing-tailored nucleotide barcode design unveils opportunities for large-scale sample multiplexing. Nucleic Acids Research, 35(19), e130–e130. doi:10.1093/nar/gkm760 132 Cronn, R., Liston, A., Parks, M., Gernandt, D. S., Shen, R., & Mockler, T. (2008). Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Research, 36(19), e122. doi:10.1093/nar/gkn502 133 Neiman, M., Lundin, S., Savolainen, P., & Ahmadian, A. (2011). Decoding a substantial set of samples in parallel by massive sequencing. PLoS One, 6(3), e17785. doi:10.1371/journal.pone.0017785 134 Casbon, J. A., Osborne, R. J., Brenner, S., & Lichtenstein, C. P. (2011). A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res, 39(12), e81. doi:10.1093/nar/gkr217 135 Kivioja, T., Vaharautio, A., Karlsson, K., Bonke, M., Enge, M., Linnarsson, S., & Taipale, J. (2012). Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods, 9(1), 72–74. doi:10.1038/nmeth.1778

46 Erik Borgström

136 Sorber, K., Chiu, C., Webster, D., Dimon, M., Ruby, J. G., Hekele, A., & DeRisi, J. L. (2008). The long march: a sample preparation technique that enhances contig length and coverage by high-throughput short-read sequencing. PloS One, 3(10), e3495. doi:10.1371/journal.pone.0003495 137 Hiatt, J. B., Patwardhan, R. P., Turner, E. H., Lee, C., & Shendure, J. (2010). Parallel, tag-directed assembly of locally derived short sequence reads. Nature Methods, 7(2), 119–122. doi:10.1038/nmeth.1416 138 Lundin, S., Gruselius, J., Nystedt, B., Lexow, P., Käller, M., & Lundeberg, J. (2013). Hierarchical molecular tagging to resolve long continuous sequences by massively parallel sequencing. Scientific Reports, 3. doi:10.1038/srep01186 139 Peters, B. A., Kermani, B. G., Sparks, A. B., Alferov, O., Hong, P., Alexeev, A., … Drmanac, R. (2012). Accurate whole- genome sequencing and haplotyping from 10 to 20 human cells. Nature, 487(7406), 190–195. doi:10.1038/nature11236 140 Voskoboynik, A., Neff, N. F., Sahoo, D., Newman, A. M., Pushkarev, D., Koh, W., … Quake, S. R. (2013). The genome sequence of the colonial chordate, Botryllus schlosseri. eLife, 2, e00569. doi:10.7554/eLife.00569 141 Long-Read Sequencing Technology, Available at http://www.illumina.com/technology/next-generation- sequencing/long-read-sequencing-technology.html, Accesed 20151215 19:57:28 142 http://10xgenomics.com/, Accessed 20151220 13:31:37 143 Pacific Biosciences and RainDance Technologies Partner to Co-Develop and Commercialize Novel Solution For de novo Whole Genome Assembly. http://investor.pacificbiosciences.com/releasedetail.cfm?releaseid=910963, Acessed 20151209 11:26:43 144 Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., … Robinson, G. E. (2015). Big Data: Astronomical or Genomical? PLOS Biology, 13(7), e1002195. doi:10.1371/journal.pbio.1002195 145 Simpson, J. T., & Pop, M. (2015). The Theory and Practice of Genome Sequence Assembly. Annual Review of Genomics and , 16(1), 153–172. doi:10.1146/annurev-genom-090314-050032 146 Chaisson, M. J. P., Wilson, R. K., & Eichler, E. E. (2015). Genetic variation and the de novo assembly of human genomes. Nature Reviews Genetics, 16(11), 627–640. doi:10.1038/nrg3933 147 Reinert, K., Langmead, B., Weese, D., & Evers, D. J. (2015). Alignment of Next-Generation Sequencing Reads. Annual Review of Genomics and Human Genetics, 16(1), 133–151. doi:10.1146/annurev-genom-090413-025358 148 Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443–451. doi:10.1038/nrg2986 149 Zhao, M., Wang, Q., Wang, Q., Jia, P., & Zhao, Z. (2013). Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC , 14 Suppl 1(Suppl 11), S1. doi:10.1186/1471-2105-14-S11-S1 150 Altmann, A., Weber, P., Bader, D., Preuß, M., Binder, E. B., & Müller-Myhsok, B. (2012). A beginners guide to SNP calling from high-throughput DNA-sequencing data. Human Genetics, 131(10), 1541–1554. doi:10.1007/s00439-012-1213- z 151 Alioto, T. S., Buchhalter, I., Derdak, S., Hutter, B., Eldridge, M. D., Hovig, E., … Gut, I. G. (2015). A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nature Communications, 6, 10001. doi:10.1038/ncomms10001 152 Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., … de Hoon, M. J. L. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England), 25(11), 1422–3. doi:10.1093/bioinformatics/btp163 153 Stajich, J. E., Block, D., Boulez, K., Brenner, S. E., Chervitz, S. A., Dagdigian, C., … Birney, E. (2002). The Bioperl toolkit: Perl modules for the life sciences. Genome Research, 12(10), 1611–8. doi:10.1101/gr.361602 154 Lundin, S., Stranneheim, H., Pettersson, E., Klevebring, D., & Lundeberg, J. (2010). Increased throughput by parallelization of library preparation for massive sequencing. PLoS One, 5(4), e10029. doi:10.1371/journal.pone.0010029 155 Lennon, N. J., Lintner, R. E., Anderson, S., Alvarez, P., Barry, A., Brockman, W., … Nicol, R. (2010). A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biology, 11(2), R15. doi:10.1186/gb-2010-11-2-r15 156 Nystedt, B., Street, N. R., Wetterbom, A., Zuccolo, A., Lin, Y.-C., Scofield, D. G., … Jansson, S. (2013). The Norway spruce genome sequence and conifer genome evolution. Nature, 497(7451), 579–84. doi:10.1038/nature12211 157 Akan, P., Alexeyenko, A., Costea, P. I., Hedberg, L., Solnestam, B. W., Lundin, S., … Lundeberg, J. (2012). Comprehensive analysis of the genome transcriptome and proteome landscapes of three tumor cell lines. Genome Medicine, 4(11), 86. doi:10.1186/gm387 158 Pettersson, E., Lindskog, M., Lundeberg, J., & Ahmadian, A. (2006). Tri-nucleotide threading for parallel amplification of minute amounts of genomic DNA. Nucleic Acids Research, 34(6), e49. doi:10.1093/nar/gkl103

47

159 Rotem, A., Ram, O., Shoresh, N., Sperling, R. A., Schnall-Levin, M., Zhang, H., … Weitz, D. A. (2015). High- Throughput Single-Cell Labeling (Hi-SCL) for RNA-Seq Using Drop-Based Microfluidics. Plos One, 10(5), e0116328. doi:10.1371/journal.pone.0116328 160 Klein, A. M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., … Kirschner, M. W. (2015). Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells. Cell, 161(5), 1187–1201. doi:10.1016/j.cell.2015.04.044 161 Macosko, E. Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., … McCarroll, S. A. (2015). Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell, 161(5), 1202–1214. doi:10.1016/j.cell.2015.05.002 162 Spencer, S. J., Tamminen, M. V, Preheim, S. P., Guo, M. T., Briggs, A. W., Brito, I. L., … Alm, E. J. (2015). Massively parallel sequencing of single cells by epicPCR links functional genes with phylogenetic markers. The ISME Journal. doi:10.1038/ismej.2015.124 163 Monica Heger. Technology Improvements, Long Reads Lead NGS Market Advancements in 2015, Available at https://www.genomeweb.com/sequencing-technology/technology-improvements-long-reads-lead-ngs-market- advancements- 2015?utm_source=SilverpopMailing&utm_medium=email&utm_campaign=Sequencing%20Technology%20Bulletin:%20 Technology%20Improvements,%20Long%20Reads%20Lead%20NGS%20Market%20Advancements%20in%202015%20- %2001/05/2016%2003:30:00%20PM, Acessed 20160107 17:37:50

48