GG16CH04-Xie ARI 21 July 2015 11:16

Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications

Lei Huang,1 Fei Ma,1 Alec Chapman,2,3 Sijia Lu,4 and Xiaoliang Sunney Xie1,2

1Biodynamic Optical Imaging Center (BIOPIC), School of Life Sciences, , 100871, 2Department of Chemistry and Chemical Biology and 3Graduate Program in Biophysics, Harvard University, Cambridge, Massachusetts 01238; email: [email protected] 4Yikon Co. Ltd., Taizhou, Jiangsu 225300, China

Annu. Rev. Genomics Hum. Genet. 2015. Keywords 16:79–102 single-cell genomics, cancer genomics, WGA, DOP-PCR, MDA, First published online as a Review in Advance on June 11, 2015 MALBAC, CTC, PGD, PGS, in vitro fertilization The Annual Review of Genomics and Human Genetics Abstract is online at genom.annualreviews.org We present a survey of single-cell whole-genome amplification (WGA) This article’s doi: 10.1146/annurev-genom-090413-025352 methods, including degenerate oligonucleotide–primed polymerase chain

Access provided by Harvard University on 09/01/15. For personal use only. reaction (DOP-PCR), multiple displacement amplification (MDA), and Copyright c 2015 by Annual Reviews. All rights reserved multiple annealing and looping–based amplification cycles (MALBAC). The key parameters to characterize the performance of these methods are de- Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org fined, including genome coverage, uniformity, reproducibility, unmappable rates, chimera rates, allele dropout rates, false positive rates for calling single- nucleotide variations, and ability to call copy-number variations. Using these parameters, we compare five commercial WGA kits by performing deep se- quencing of multiple single cells. We also discuss several major applications of single-cell genomics, including studies of whole-genome de novo mu- tation rates, the early evolution of cancer genomes, circulating tumor cells (CTCs), meiotic recombination of germ cells, preimplantation genetic di- agnosis (PGD), and preimplantation genomic screening (PGS) for in vitro– fertilized embryos.

79 GG16CH04-Xie ARI 21 July 2015 11:16

INTRODUCTION Individual cells are the fundamental units of life. DNA that carries genetic information exists as single molecules in individual cells. In biology and medicine, there is often a need to characterize genomes of individual cells for several reasons: (a) Some cells are precious and exist in low numbers [for example, human oocytes and circulating tumor cells (CTCs)]; (b) every cell is unique in its genome (for example, every sperm cell of an individual is different because of recombination); (c) the genomes of individual cells undergo stochastic changes with time, and hence single cells’ genomes at particular times can reveal their temporal evolution; and (d ) the genomes of individual cells in the same sample are heterogeneous (as in the case of primary cancer tissues), and conse- quently one often needs to determine the distribution rather than the average of a large ensemble of cells. Changes in the genome of a single cell include single-nucleotide variations (SNVs) and struc- tural variations, which result in copy-number variations (CNVs). SNVs are single-base insertions, deletions, or mutations, either transitions (e.g., C→T, A→G) or transversions (e.g., A→T, C→G). Structural variations are genomic rearrangements, including insertions or deletions (indels), du- plications, inversions, and translocations. Structural variations at large genome scales are linked to CNVs. Because of technical difficulties (chimeras; see below), structural variations are difficult to observe at the single-cell level, whereas CNVs ranging in size from hundreds of kilobases to megabases can be detected relatively easily. The copy number of a particular gene in a human somatic cell is normally two because of the two alleles (one from each parent). Although it had long been known that gene copy numbers could deviate from two in humans, genome-wide measurements of CNVs (27, 56) prompted extensive investigations of the biology and clinical consequences of CNVs, which often result from double-strand breaks that cannot be repaired perfectly (21). We note that (e.g., the methylation of DNA) represents another type of genomic change. The single-cell methylome has been determined (19), but this topic is beyond the scope of our review. Advances in next-generation DNA sequencing technologies have enabled individual human genomes to be sequenced at affordable costs (4, 43, 69, 72). With the development of single-cell whole-genome amplification (WGA) techniques—i.e., amplification of all DNA molecules after the cell has been lysed—single-cell genomics has emerged as an exciting field of its own. In this article, we review the principles and compare the performance of the existing methodologies for single-cell WGA. We note that, in parallel with single-cell genomics, single-cell transcriptomics has seen rapid developments as well; these techniques are beyond the scope of this article, and Access provided by Harvard University on 09/01/15. For personal use only. we refer readers to other recent reviews for more information (55, 57, 60). We provide examples to illustrate the utility of single-cell WGA in biomedical applications, including fundamental

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org research on genome stability and the biology of meiosis, as well as clinical perspectives for cancer diagnosis and reproductive medicine.

SINGLE-CELL WHOLE-GENOME AMPLIFICATION METHODS Although sequencing individual DNA molecules with lengths of thousands of bases has become possible (20), currently there is not a method that can collect and sequence all DNA fragments from a single cell. Therefore, sequencing an entire genome of a cell requires WGA. Given the trace amount of DNA from a single human cell (a few picograms), extreme care should be taken to avoid contamination. Indeed, a major difficulty in single-cell genomics is substantial contamination from

80 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

the environment and operators. Our experience is that WGA should be conducted in a dedicated clean room with controlled air pressure and quality. As such, the bacterial contamination can be controlled below 0.1% of the amount of DNA in a human cell (78). Alternatively, microfluidic devices can be used to minimize contamination (14, 31, 68), which is particularly important for WGA of a bacterial cell. Regardless of whether the WGA is done in a polymerase chain reaction (PCR) tube or a mi- crofluidic device, the chemistry is the most important aspect. Several different chemistries are avail- able for single-cell WGA, and here we review three major ones: the degenerate oligonucleotide– primed polymerase chain reaction (DOP-PCR) (61), multiple displacement amplification (MDA) (12), and multiple annealing and looping–based amplification cycles (MALBAC) (78).

Degenerate Oligonucleotide–Primed Polymerase Chain Reaction (DOP-PCR) PCR has had a major impact on biology and medicine in the past 30 years (54). It offers single- copy sensitivity—that is, a single copy of DNA can generate a signal detectable by PCR. To detect the existence of one mutation in a particular gene, one can design a PCR primer to amplify the gene locus. Attempts have been made to use PCR to amplify the entire genome using a set of primers. Primer extension preamplification PCR (PEP-PCR) was among the first single-cell WGA methods used (3, 45, 77), followed by the more widely adopted DOP-PCR (61) (Figure 1).

5'

3' N N N N N N Primer binding 5' 3'

3' N N N N N N 5'

5' 3' Primer extension

3' N N N N N N 5'

ssDNA template Access provided by Harvard University on 09/01/15. For personal use only. 3' N N N N N N 5'

3' Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org Primer binding N N N N N N 5' 3' N N N N N N 5'

5' N N N N N N 3' Primer extension

Figure 1 Schematic of the degenerate oligonucleotide–primed polymerase chain reaction (DOP-PCR), which uses degenerate oligonucleotide primers for whole-genome amplification (45). Additional abbreviation: ssDNA, single-stranded DNA.

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 81 GG16CH04-Xie ARI 21 July 2015 11:16

The principle of DOP-PCR is to use degenerate primers containing a random six-base sequence at the 3 end and a fixed sequence at the 5 end. For the initial amplification, the primers bind to the DNA template at a low annealing temperature. Strand extension is then achieved at a raised temperature. During the second stage of PCR amplification, the previous products are amplified with a primer targeting the 5 fixed sequence at a higher annealing temperature. The concentrations of the primers and polymerase directly affect the result of DOP-PCR. DOP-PCR has been used to amplify picogram quantities of human genomic DNA for genotypic analyses (7). DOP-PCR often yields low genome coverage, which is pertinent to the exponential ampli- fication of PCR. Any small differences in the amplification factors among different sequences are exponentially enlarged, causing overamplified regions and underamplified regions in the genome, and hence low coverage. Although lacking completeness in accessing the whole genome, DOP-PCR can be well suited for measuring CNVs on a large genomic scale with large bin sizes (1 million bases) (47).

Multiple Displacement Amplification (MDA) MDA was developed in 2001 by Lasken and coworkers (12) using a random hexamer as a primer and φ29 DNA polymerase, a highly processive DNA polymerase (5) with strong strand displace- ment activity. φ29 has a high replication fidelity because of its 3→5 exonuclease activity and proofreading activity (17, 51). Under isothermal conditions, MDA extends the random primers and produces branched structures, which are extended by other primers and eventually form multibranched structures (Figure 2). The DNA fragments are 50–100 kb long. MDA offers much higher genome coverage than DOP-PCR. However, like DOP-PCR, MDA is an exponential amplification process. This results in sequence-dependent bias, causing over- amplification in certain genomic regions and underamplification in other regions. However, such sequence-dependent bias of MDA is not reproducible along the genome from cell to cell, render- ing CNV measurements noisy and normalization ineffective. Nevertheless, MDA has been widely applied since its invention (37). Access provided by Harvard University on 09/01/15. For personal use only. Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org

Branched DNA

Genomic DNA Random primers ϕ29 DNA polymerase

Figure 2 Schematic of multiple displacement amplification (MDA). Random primers are used for isothermal amplification with φ29 DNA polymerase, which has strong strand displacement activity (35).

82 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

3' 5'

MALBAC Annealing of primers primer Genomic DNA Semiamplicon Synthesis Full amplicon

Displaced strand

Denaturing

(m+1) × n

PCR Quenching Looping amplification at 0°C at 58°C MALBAC product to be sequenced m × n2 m × n

Extension Melting at 65°C at 94°C

m × n

Figure 3 Schematic of multiple annealing and looping–based amplification cycles (MALBAC). Random primers with a fixed sequence are used in a temperature cycle in which only the original genomic DNA and semiamplicons are linearly amplified, and full amplicons are protected from further amplification by the formation of DNA loops owing to the complementarity of the fixed sequences at the 3 and 5 ends. The DNA loops are polymerase chain reaction (PCR) amplified at the final stage (78). Here, m is the number of temperature cycles (m = 0 ∼ 10), and n is the number of primers bound; (m + 1) × n is the number of semiamplicons present at the mth cycle, and m × n2 is the number of full amplicons generated in the mth cycle.

Multiple Annealing and Looping–Based Amplification Cycles (MALBAC) Access provided by Harvard University on 09/01/15. For personal use only. MALBAC was first reported in 2012 by Zong et al. (78) for single-cell WGA; this method has the unique feature of quasi-linear amplification, which reduces the sequence-dependent bias

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org exacerbated by exponential amplification. It has recently been applied to single-cell transcriptome measurement as well (6). The key to MALBAC is to not make copies of copies, and instead make copies only of the original genomic DNA by protecting the amplification products (Figure 3). The specially designed MALBAC primers have a common 27-nucleotide sequence at the 5 end and 8 random nucleotides at the 3 end, which can evenly hybridize to the template when the tem- perature is lowered (to 15–20◦C). At the beginning, semiamplicons with variable lengths (0.5–1.5 kb) are made when the temperature is elevated (to 70–75◦C). The semiamplicons are melted off from the templates (at 95◦C), from which full amplicons are made with complementary ends, causing the formation of hairpins when the temperature is lowered (to 58◦C), preventing their further amplification. This cycle is repeated 8–12 times. The quasi-linear amplification at these first few cycles is critical for avoiding the sequence-dependent bias exacerbated by exponential

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 83 GG16CH04-Xie ARI 21 July 2015 11:16

amplification. MALBAC uses a thermally stable DNA polymerase with strand displacement activity. This important preamplification stage is then followed by exponential amplification of the full amplicons by PCR, generating the amount of DNA required for next-generation sequencing. MALBAC is not a mere combination of DOP-PCR and MDA, but is fundamentally different because of its quasi-linear, as opposed to exponential, amplification. This results in two major advantages: accuracy for CNV detection and a low false negative rate for SNV detection. MALBAC is not free from sequence-dependent bias. Unlike MDA, however, MALBAC’s sequence-dependent bias is reproducible along the genome from cell to cell. Therefore, signal normalization for CNV noise reduction can be carried out. As shown below, after signal normal- ization with a reference cell, MALBAC offers the best CNV accuracy. MALBAC exhibits the lowest false negative rates for SNV calling. However, MALBAC has a higher false positive rate for SNV detection than MDA because the DNA polymerase currently used has a lower fidelity than the ϕ29 polymerase.

Characterization and Comparison of Whole-Genome Amplification Methods It is necessary to provide a critical comparison of different WGA methods. Here we choose key features to compare, such as coverage, uniformity, reproducibility, allele dropout rate, false positive rate, chimera rate, and unmappable rate, as defined below. We compared the performance of five commercial single-cell WGA kits that use one of the three methods discussed above: the Sigma- Aldrich GenomePlex Single Cell Whole Genome Amplification Kit (DOP-PCR), the Qiagen REPLI-g Single Cell Kit (MDA), the General Electric (GE) illustra Single Cell GenomiPhi DNA Amplification Kit (MDA), the Yikon Genomics Single Cell Whole Genome Amplification Kit (MALBAC), and the Rubicon Genomics PicoPLEX WGA Kit (MALBAC-like). De Bourcy et al. (11) recently compared several WGA methods for single-cell analyses of Escherichia coli. For mammalian cells, issues related to CNVs and heterozygosity of alleles are germane to genomic analyses, which prompted us to perform a comparison of human cells. We note that human cancer cell lines often exhibit aneuploidy, which seriously complicates analyses of genome coverage and allele dropout rate. Here, we chose a diploid human cell line, BJ primary human foreskin fibroblast. The lack of aneuploidy of this cell line makes it an ideal system to characterize amplification performance. To perform this comparison, we amplified several single cells and sequenced to exclude cells in metaphase (in which DNA is replicated). The amplified DNA products from each cell were fragmented by sonication to ∼250 base pairs. The sequencing libraries were prepared by using the

Access provided by Harvard University on 09/01/15. For personal use only. NEBNext Ultra DNA Library Prep Kit before sequencing on an Illumina HiSeq X Ten platform with 150-base paired-end reads. Table 1 summarizes our characterization of key parameters that are defined and discussed Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org below for the five tested kits for DOP-PCR, MDA, and MALBAC. Below, we define the parameters used in the comparison and present and discuss the results.

Coverage. At this point there has not yet been a complete human genome sequence—even the reference genome has gaps. In comparing the genome coverage of single cells with different WGA kits, we used the bulk sequencing without amplification at 30× depth as the reference and assumed it has 100% coverage. The comparison was done using the total raw data of 80–100 Gb (30×) for each cell (Table 1). The single-cell sequencing data from certain kits have a large fraction of unmappable reads, which is of course undesirable. Comparison at identical sequencing depths would be impractical

84 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

Table 1 Comparison of key parameters of five single-cell whole-genome amplification kits WGA Raw Kit method data Coverage CV Reproducibility ADO FPR CR UMR Sigma- DOP-PCR 79 Gb 39% 0.14 0.93 76% 9.6 × 10−4 15% 64% Aldrich Qiagen MDA 78 Gb 84% 0.21 0.68 33% 1.3 × 10−4 2% 18% GE MDA 105 Gb 82% 0.17 0.31 38% 8.2 × 10−5 3% 44% − Yikon MALBAC 96 Gb 72% 0.10 0.87 21% 3.8 × 10 4 5% 22% Rubicon MALBAC- 80 Gb 52% 0.13 0.98 28% 2.4 × 10−4 13% 66% like

Kits: Sigma-Aldrich, GenomePlex Single Cell Whole Genome Amplification Kit; Qiagen, REPLI-g Single Cell Kit; GE, illustra Single Cell GenomiPhi DNA Amplification Kit; Yikon, Single Cell Whole Genome Amplification Kit; Rubicon, PicoPLEX WGA Kit. Abbreviations: ADO, allele dropout; CR, chimera rate; CV, coefficient of variation; DOP-PCR, degenerate oligonucleotide–primed polymerase chain reaction; FPR, false positive rate; MALBAC, multiple annealing and looping–based amplification cycles; MDA, multiple displacement amplification; UMR, unmappable rate; WGA, whole-genome amplification.

because of the high cost. Considering that the coverage is dependent on the actual sequencing depth, Figure 4 gives the genome coverage of the single cells referenced to the bulk as a function of the effective sequencing depth. The low effective sequencing depth for the Sigma-Aldrich DOP-PCR kit was mainly due to the high number of universal adapter reads. For Rubicon’s MALBAC-like kit, in addition to the primer and adapter reads, the high number of unmappable reads includes large contributions from the small DNA fragments (<30 base pairs) inserted into the sequencing adapters, which were too short to be mapped to the human reference genome. The coverage of Qiagen’s MDA kit is higher than that of Yikon’s MALBAC kit, which may result from MDA’s lack of reproducibility of the sequence-dependent bias for the two different alleles in the diploid cell (36). Suffice it to say

100

Sigma-Aldrich 80 Qiagen

Access provided by Harvard University on 09/01/15. For personal use only. 60 GE

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org 40 Yikon Coverage (%) Rubicon 20 Bulk

0 0102030 40 Effective sequencing depth Figure 4 Genome coverage versus effective sequencing depth for the five commercial single-cell whole-genome amplification kits.

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 85 GG16CH04-Xie ARI 21 July 2015 11:16

that MDA and MALBAC kits have comparable levels of coverage, both of which are significantly higher than the coverage of DOP-PCR.

Uniformity. Uniform amplification is important for accurate measurements of CNV. Figure 5a shows the raw read density of 23 chromosomes, clearly illustrating the sequence-dependent bias along the genome. The variation is at the 1,000-kb scale. If the variation along the genome is the same from cell to cell, then a normalization factor can be calculated by averaging the read density of several cells in each bin and using this to normalize the single-cell data. If the purpose of the investigation is to measure CNV, then reference cells without aneuploidy should be chosen. Figure 5b shows the data normalized by the average of three cells with no aneuploidy. We characterize the uniformity by coefficients of variation of the read density (the genome- wide standard deviation divided by the mean). The normalization is done by the average of a few reference cells (as in Figure 5b) in a certain bin size (1,000 kb). Figure 6a shows the uniformity comparison of the five kits. It is evident from Figure 5a that, of the five methods, DOP-PCR gives the flattest CNV raw data without normalization. MDA creates variations along the genome that are not reproducible from cell to cell and cannot be smoothed via normalization. MALBAC’s sequence-dependent bias is reproducible from cell to cell, giving the flattest CNV after normalization.

Reproducibility. Reproducibility is important in single-cell measurements, and measurement noise needs to be minimized. We use Pearson’s cross-correlation coefficient of the read densities along the genome between two identical cells to characterize the reproducibility. Figure 6b shows our measurements of the cell-to-cell reproducibility of a WGA method by the cross-correlation coefficients for different WGA methods. It is apparent that the reproducibility of DOP-PCR and MALBAC kits is better than that of MDA methods that exhibit stochasticity in WGA.

Allele dropout rate. The allele dropout rate is one of the most important characteristics of WGA, particularly for medical applications. Allele dropout arises from uneven WGA, which needs to be improved. If a diploid cell has a heterozygous mutation, the lack of amplification in one of the two alleles causes allele dropout (Figure 7a). Allele dropout is the primary cause of false negatives of SNV calling. The allele dropout rate is measured by the ratio of the undetected and the actual heterozygous SNVs in a single cell. The latter is often approximated by the bulk measurement of Access provided by Harvard University on 09/01/15. For personal use only. identical cells. Figure 6c shows our measurements of the allele dropout rates of the five kits. It is evident that

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org the MALBAC methods have lower allele dropout rates than the other WGA methods.

False positive rate. A false positive of base calling can arise from either a sequencing error or an amplification error. Whereas random sequencing errors can be avoided with high sequencing −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→

Figure 5 Read density of reads from single diploid human cells with no aneuploidy from the BJ primary human foreskin fibroblast cell line, using the five commercial single-cell whole-genome amplification kits. (a) Raw data of single cells at a sequencing depth of 2× with a bin size of 1,000 kb, mapped to the human reference genome. The chromosomes are shown in alternating red and blue colors. (b) The same single-cell data normalized by the mean of three identical cells in each bin, with corresponding coefficients of variation (CVs) reflecting uniformity.

86 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16 Chromosome Chromosome Access provided by Harvard University on 09/01/15. For personal use only. Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org 4 1 . 0

=

3 V 1 1 . C 2 0

. 0 ,

1 0 . h h

=

7 c c 0 =

i i

1 V r r . = V

C d d

0 l l

C , V

, = A A n n C

- - n n , o o 12345678910111213141516171819202122XY 12345678910111213141516171819202122XY V a a e e c c n n i i C g g

m m o o , b b a a k k g g i i E E u u i i i i Sigma-Aldrich, CVS Sigma-Aldrich, = 0.14 Q Qiagen, CV = 0.21 a S Sigma-Aldrich Q Qiagen G GE Y Yikon b R Rubicon CVG GE, = 0.17 CV = 0.10 Y Yikon, CV = 0.13 R Rubicon, 6 4 2 0 6 4 2 0 6 4 2 0 6 4 2 0 6 4 2 0 6 4 2 0 6 4 2 0 6 4 2 0 6 4 2 0 6 4 2 0

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 87 GG16CH04-Xie ARI 21 July 2015 11:16

a Coefficient of variation b Reproducibility 0.25 1.0

0.20 0.8

0.15 0.6

0.10 0.4

0.05 Reproducibility 0.2

Coefficient of variation of Coefficient 0.00 Sigma- Qiagen GE Yikon Rubicon0.0 Sigma- Qiagen GE Yikon Rubicon Aldrich Aldrich

c Allele dropout rate d False positive rate 100% 1.2 × 10–3 1.0 × 10–3 80% 8.0 × 10–4 60% 6.0 × 10–4 40% 4.0 × 10–4

20% –4

False positive rate positive False 2.0 × 10 Allele dropout rate dropout Allele 0% Sigma- Qiagen GE Yikon Rubicon0.0 Sigma- Qiagen GE Yikon Rubicon Aldrich Aldrich

e Chimera rate f Unmappable rate 20% 100%

15% 80%

60% 10% 40%

Chimera rate Chimera 5% 20% Unmappable rate

0% 0% Sigma- Qiagen GE Yikon Rubicon Sigma- Qiagen GE Yikon Rubicon Aldrich Aldrich

Figure 6 Comparison of key parameters of the five commercial single-cell whole-genome amplification kits, using data generated from three single cells from the BJ primary human foreskin fibroblast cell line. (a) Coefficient Access provided by Harvard University on 09/01/15. For personal use only. of variation, calculated as the ratio of the standard deviation of read density to its average in three individual cells. (b) Reproducibility, measured by Pearson’s cross-correlation coefficient of the read densities between two identical cells. (c) Allele dropout rate, defined as the percentage of homozygous sites in the single-cell Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org samples where the bulk sample is heterozygous at the same nucleotide site. (d ) False positive rate, defined as the number of heterozygous site calls in the single cell divided by the number of sites in the bulk sample that are homozygous or have a different allele at the same nucleotide site. (e) Chimera rate, defined as the number of reads that are improperly connected (including abnormal fragment size and interchromosomal connection) divided by the total number of mappable reads. ( f ) Fraction of reads that are unmappable to the reference genome.

depth, WGA errors often dominate, occur during the first or second cycle of exponential am- plification, and are dependent on the error rate of the DNA polymerase (Figure 7b). Figure 6d compares the false positive rates of the WGA kits. The Qiagen and GE kits perform better than the MALBAC approaches, mainly because of the higher accuracy of the φ29 polymerase in the former.

88 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

a Two alleles b Two alleles c Two separate DNA in a single cell in a single cell segments in a single cell

A A

C A

A A

C* Allele dropout False positives Chimera formation Figure 7 Schematics for three types of errors that can arise in whole-genome amplification. (a) Allele dropout. The lack of amplification of only one of the two alleles makes a heterozygous point mutation appear to be a homozygous one, causing false negative errors. (b) False positive errors arising from point mutations generated by the whole-genome amplification. (c) Chimera formation, in which different parts of the original genome are artificially connected by whole-genome amplification.

Chimera rate. Chimera formation is an artifact of WGA that generates reads that can be mapped to different parts of the genome that are not physically linked (Figure 7c). It prevents the identification of true structural variations, and single-cell structural variations are still challenging to detect owing to this technical problem. The chimera rate is defined as the percentage of reads that exhibit improper connections in the reference genome, i.e., those that show an abnormal fragment size or indicate interchromosomal connections. Figure 6e shows the chimera rates of the WGA kits. The chimera rates of the Sigma-Aldrich and Rubicon kits are much higher than those of the other kits.

Unmappable rate. The unmappable rate is the probability that junk sequences are generated in the WGA process, which can arise from the formation of primer dimers, short DNA fragments,

Access provided by Harvard University on 09/01/15. For personal use only. and/or nonspecific incorporations. A large fraction of unmappable reads reduces the cost effective- ness of genome sequencing and the completeness of the genome coverage. Figure 6f shows the un- mappable rates of the WGA kits. The Qiagen and Yikon kits showed the lowest unmappable rates. Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org We would like to point out that all of the comparisons described above were done using the existing kits available to us and that each would improve with time. As important as the specific results is how we performed the comparison. It is fair to say that each method or kit has its own advantages and disadvantages, and that selection should be based on specific applications. For accurate CNV determination, DOP-PCR and MALBAC kits offer the highest repro- ducibility and lowest coefficients of variation. MALBAC kits and Qiagen’s MDA kit are well suited for simultaneous measurements of SNVs and CNVs. For calling de novo SNV, the false negative and false positive rates need to be considered. Yikon’s MALBAC kit has the lowest false negatives (the allele dropout rate), while Qiagen’s MDA kit has the lowest false positive rate. When sequencing cost effectiveness is considered, a low unmappable rate is desirable for high effective coverage.

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 89 GG16CH04-Xie ARI 21 July 2015 11:16

We performed our comparison using small sample tubes instead of microfluidic devices; the per- formance parameters might change under microfluidic conditions and might be different for bacte- rial (small) and human (large) genomes. Although expensive, a similar comprehensive comparison of the chemistries for human genomes would be worth doing under microfluidic conditions.

STUDY OF GERM CELLS

Phasing of Human Genomes with Single-Cell Genomics One central goal of human genetics is to characterize the genetic variations of individual hu- man beings and study their correlations with function-related traits (1, 44). Human genomes are diploid in nature, and human genes often contain multiple coding and regulatory elements with variations among different populations. Different phenotypic consequences can often arise de- pending on whether the genetic variations are associated on the same chromosome (in cis)oron opposite homolog chromosomes (in trans) (2, 62). One example is the phenomenon of compound heterozygosity (28, 66), in which the two heterozygous nonsynonymous mutations in a certain gene could result in either one normal version (as in cis) or no normal version (as in trans)ofthe gene. A complete understanding of genetic variations and their consequences in human diseases cannot be obtained without resolving the combination of genetic variations at different loci on the same chromosome—that is, the haplotypes (16, 29). Without the haplotype information, the description of individual human genomes is incomplete, and the functional interpretation can be error prone. However, owing to the limited read lengths of the currently available sequencing techniques, it is often challenging to separate the two homolog chromosomes in genetic analyses to obtain the association information between the genetic variations. Molecular cloning technologies have been developed for phasing individual genomes (33, 38, 59, 76). However, the cloning procedures are often lengthy and labor intensive and therefore may not be scalable with the increasing demand for individual genome sequencing. Single-cell genome sequencing provides an alternative way to obtain phasing information on human individuals. Individual metaphase chromosomes have been isolated from actively dividing cells by techniques such as laser microdissection (41), microfluidics (14), or flow cytometry (74) before being whole-genome amplified and genotyped to obtain the association information for genetic variations on individual chromosomes. Peters et al. (50) further developed a low-cost DNA haplotyping process by diluting genomic DNA into 384 wells, each containing subcellular-level

Access provided by Harvard University on 09/01/15. For personal use only. genomic DNA. DNA molecules from each well were then separately whole-genome amplified and sequenced to obtain the phasing information. Hou et al. (24), Kirkness et al. (32), and Lu et al. (40) used whole-genome-amplified individual germ cells for haplotype analyses, further increas- Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org ing the haplotype block size to the chromosome-spanning level (Figure 8). These two categories of haplotyping approaches do not require molecular cloning, cell culturing, or sophisticated in- strumentation for chromosome isolation, potentially enabling haplotype-resolved comprehensive genetic studies and clinical applications in the near future.

Study of Meiotic Recombination in Human Germ Cells Meiotic recombination is essential to the proper segregation of homolog chromosomes, which results in the exchange of genetic information through crossover events and creates diversity for evolution (8). Abnormality in generating crossovers of homolog chromosomes is the leading

90 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

Unphased personal genome

A/T C/G A/T C/G A/C

Single sperm

Sperm 1 A C A C Sperm 2 A A G Sperm 3 C G C Sperm 4 A T A G Sperm 5 A C A x C A Sperm 6 T T C A Sperm 7 T T C Sperm 8 T G T A

Phased genome ACAGC

TGTCA Figure 8 Phasing the genome of an individual using the single-nucleotide variation (SNV) linkage information from his single sperm cells. The diploid genome was sequenced and found to contain five heterozygous single- nucleotide polymorphisms (SNPs), with unknown linkage information shown in purple. Individual sperm cells were then sequenced via whole-genome amplification using the multiple annealing and looping–based amplification cycles (MALBAC) method, from which SNP linkage information in each sperm was used to phase the individual’s diploid genome. The light blue “T” is a whole-genome amplification or sequencing error; the black “x” represents a crossover in recombination, the switching point between the paternal and maternal DNA. Figure adapted from Reference 40.

cause of miscarriage and birth defects (13). Population analyses such as linkage disequilibrium and pedigree studies are widely used in studying the highly uneven pattern of meiotic recombination across the human genome. However, population analyses yield results that are averaged among individuals and affected by evolutionary pressures, therefore often masking the quickly evolving or individual-specific features of the recombination-active regions (23, 30, 34). Single-germ-cell genome sequencing offers a novel approach for studying meiotic recombi- Access provided by Harvard University on 09/01/15. For personal use only. nation at the level of individual human beings. Each germ cell is unique in genome construction because of the differences in recombination, and sequencing multiple single germ cells from

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org an individual allows investigators to map individual recombination events and study germline genome instability. Single-cell studies have been published using donor sperm (40, 68) as well as oocytes (24) for mapping recombination in individual human beings. Wang et al. (68) utilized a microfluidic device to separate and sequence single sperm, thereby mapping the recombination distributions and studying the gene conversion and de novo mutation events of an individual. Lu et al. (40) sequenced ∼100 single sperm cells using the MALBAC method and mapped the recom- bination events of an individual at high resolution (Figure 9). This study revealed the correlation of a decreased crossover frequency with an increase of autosomal aneuploidy rate from human sperm, recapitulating the importance of meiotic crossover in maintaining genome stability in germ cells.

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 91 GG16CH04-Xie ARI 21 July 2015 11:16

Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

0 100 0 100 0 100 0 100 0 100

0 100 0 100 0 100 0 100 0 100 0 100

0 100 0 100 0 100 0 100 0 100 0 100 0 100 Paternal haplotype Centromere region 0 100 Maternal haplotype Crossover 0 100 Unresolved haplotype Percentage of reads covering paternal SNPs 0 100 Unassembled gap Percentage of reads covering maternal SNPs 0 100

0 100

Figure 9 Crossovers between paternal ( green)andmaternal(red ) DNA in the 23 chromosomes of a sperm cell. Figure adapted from Reference 40.

Hou et al. (24) utilized the first and the second polar bodies of individual oocytes to determine the crossover maps of female individuals and to study the chromatid interference effect in oocytes. This study also utilized the information of the two polar bodies to deduce the genomes of the oocyte pronuclei, enabling noninvasive preimplantation genetic diagnosis (PGD) and preimplantation genomic screening (PGS) for in vitro–fertilized embryos (Figure 10).

Preimplantation Genetic Diagnosis and Genomic Screening in In Vitro Fertilization In vitro fertilization is a technique of assisted reproductive technology by which an egg is fertilized Access provided by Harvard University on 09/01/15. For personal use only. by sperm outside the human body. The fertilized egg is cultured in vitro for 2–6 days before being implanted into the uterus with the intention of creating a successful pregnancy. Nucleic acid Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→

Figure 10 Deduction of viability for preimplantation genomic screening for in vitro fertilization. (a) A human oocyte with homologous recombination between the paternal (blue)andmaternal(red ) DNA (only one chromosome is shown). Upon fertilization with a sperm cell, the first and second polar bodies, which are dispensable for embryo development, can be safely biopsied by a micropipette, subject to whole-genome amplification with multiple annealing and looping–based amplification cycles (MALBAC). By sequencing the two polar bodies or a few cells in the blastocyst stage of the embryo, one can avoid chromosome abnormality (middle case) as well as undesirable point mutations from a disease-carrying parent (right case) and increase in vitro fertilization’s successful rate for healthy babies (left case). (b) Deduction of the copy number of chromosomes from a female pronucleus based on its two polar bodies. The deduction is based on the fact that the total number of four chromatids is conserved, which was verified by sequencing the female pronucleus with the donor’s consent. Panel adapted from Reference 24.

92 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

a Sperm MAN OOC HU YTE

Chr1 C

Paternal haplotype A

Maternal haplotype C

A Fertilization

Polar body 1 Polar body 2 Polar body 1 Polar body 2 Polar body 1 Polar body 2

C A A A A C A C

Male pronucleus

Female pronucleus

A

Healthy baby Birth difficulty or Baby with dominant baby with genetic disorder disease allele b

4 First polar body 3 2 1 0 4 Second polar body 3 2 1 Access provided by Harvard University on 09/01/15. For personal use only. 0 4 Female pronucleus predicted 3 2 Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org 1 0 4 Female pronucleus confirmed 3 2 1 0

4 3 2 Total chromatid copy numbers conserved in the whole oocyte 1 0 12345678910111213141516171819202122X Chromosome

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 93 GG16CH04-Xie ARI 21 July 2015 11:16

materials obtained from the in vitro–fertilized embryos can be used to perform PGD and PGS to avoid the inheritance of pathogenic mutations and chromosome abnormalities, respectively. PGD/PGS can be performed on the two polar bodies of the oocytes, one of the blastomeric cells from day-3 embryos, or several trophectoderm cells from day-5 blastocyst embryos, in which only one or a few cells can be obtained for PGD/PGS purposes. Performing WGA on these materials enables comprehensive chromosome analyses on various genome analytical platforms, such as a comparative genomic hybridization array (52), single-nucleotide polymorphism (SNP) array (63), or multiplex quantitative PCR (65). The rapid development of high-throughput sequencing techniques further reduced the cost and increased the precision and resolution of chromosome- level PGD/PGS (24, 26, 71). Treff et al. (64) utilized targeted high-throughput sequencing to perform PGD of monogenic disease, and efforts were made to perform both point mutation PGD and chromosomal PGS in the same embryo (10, 58). By using MALBAC next-generation sequencing, we were able to report a fully integrated pipeline for combined monogenic PGD and chromosomal PGS and a live birth free of a specific pathogenic mutation carried by the child’s parents (22). We envision that the advancement of WGA techniques will enable clinical trials with fully integrated pipelines for combined monogenic PGD and chromosomal PGS in the near future.

STUDY OF GENOME EVOLUTION IN CANCER Cancer is a genomic disease (67). Recent advances in single-cell WGA methods have allowed determination of CNVs at a large scale (48, 70) and at the single-gene level (15) as well as SNVs of a single cancer cell (18, 70, 75), in primary tissue, or in CTCs in blood (9, 39, 49). A pair of companion studies used single-cell WGA and exome sequencing to examine tumor heterogeneity (25, 73). These studies found hundreds of SNVs—which differ from cell to cell, revealing the genetic complexity of these tumors—and applied population genetic analysis to study their clonal composition. However, our independent analysis of these data indicates that the vast majority of these SNVs are due to sequencing artifacts or possibly contaminations, and only a handful, if any, may be attributed to true heterogeneity of the tumor samples (A. Chapman, C. Zong & X.S. Xie, unpublished results). Of the 711 tumor-specific SNVs reported in one of the studies (25), we found that 42% of them were actually present at a lower level in the normal tissue as well. These are likely to be SNPs that were mistakenly not identified in the normal tissue (false negatives) rather than tumor-specific SNVs. Furthermore, 58% of the remaining SNVs reported were present in dbSNP, a database of germline mutations known to occur in the human population. Because new mutations in a tumor arise randomly, it would be highly unlikely that a

Access provided by Harvard University on 09/01/15. For personal use only. significant number of them happen to coincide with known mutations present in the general pop- ulation. Instead, an alternative explanation for these mutations could be contaminations from the operators or other human sources. Indeed, we observed that of those remaining SNVs not found Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org in dbSNP, 84% were found in unrelated samples—either normal control cells from the same study or cells from the companion study. In summary, more than 96% of the SNVs identified as being tumor specific in that report (25) appear likely to be contaminants or sequencing artifacts. We now give a few examples of new information available from single-cell genomics.

Measurement of Spontaneous Whole-Genome Mutation Rate In calling SNVs from a single cell by WGA (24, 68, 73), one challenge is high false negative rates caused by allele dropout, and another is false positives associated with amplification and sequencing errors (either random or systematic) (42). Figure 11 shows 35 unique SNVs in a human cancer cell line (SW480) newly acquired during 20 cell divisions. Adjusting for a 70% detection efficiency for

94 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

abSingle ancestor cell

10 11 12 8 9 7 6 5 4 3 ~20 generations 1 2 Bulk DNA

22 21 20 19 18 17 16 15 14 13 X Y

c Chromosome 2: 57,736,349

G G G G G G G G Kindred G G G single cells G G C1 C2 C3 C4 C5 C6 G G G G G G G G MALBAC G G G G G G G G G Whole-genome G G G G sequencing G G G C1 C2 C3 Bulk

Figure 11 Detecting newly acquired single-nucleotide variations (SNVs) with no false positives and estimation of the mutation rate of a human cancer cell line (SW480). (a) Experimental design. A single ancestor cell was chosen and cultured for ∼20 generations. The vast majority of cells were used to extract DNA for bulk sequencing to represent the ancestral cell’s genome. A single cell from this culture was chosen for another expansion of four generations. The kindred cells were isolated for single-cell whole-genome amplification. Single-cell samples C1, C2, and C3 were used for high-throughput sequencing. (b) Locations of the 35 newly acquired SNVs on the chromosomes of a single cell. (c) Next-generation sequencing data of a newly acquired SNV. The SNV (C→G) existed in the high-throughput data of all three kindred cells but not in the bulk data. Additional abbreviation: MALBAC, multiple annealing and

Access provided by Harvard University on 09/01/15. For personal use only. looping–based amplification cycles. Figure adapted from Reference 78.

heterozygous SNVs, Zong et al. (78) estimated that ∼49 mutations occurred in the 20 generations,

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org yielding a mutation rate of ∼2.5 nucleotides per cell generation, which is consistent with the estimate based on the bulk data. At this rather slow spontaneous mutation rate, cancer development would take a long time. What happens at the very beginning of cancer development has been a long-standing question, and single-cell WGA is well suited to help answer it.

Does Copy-Number Variation Precede Single-Nucleotide Variation? In a bulk tissue sample at an early stage of cancer development, detection of abnormal CNVs is often difficult, especially when the number of cells with abnormal CNVs is small. Single-cell genomic analysis is essential in evaluating the relationship, if any, between CNVs and SNVs.

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 95 GG16CH04-Xie ARI 21 July 2015 11:16

Abnormal a cell growth

Hyperproliferation Adenomatous High-grade dysplasia Adenocarcinoma Invasive cancer polyps

b 4 3 CCellell 1 2 1 0 4 3 CCellell 2 2 1 0 4 3 CCellell 3 2 1 0 4 3 CCellell 4 2 1 0 4 3 CCellell 5 2 1 0 4 3 CCellell 6 2 1 0 4 3 CCellell 7 2 1 0 4 3 CCellell 8 2 1 0 12345678910111213141516171819202122XY Chromosome c 4 3 CCellell 1 2 1 0 4 3 CCellell 2 2 d 47,900,760 base pairs 1

Access provided by Harvard University on 09/01/15. For personal use only. 0 4 RefSeq G G AAT C G AT G C 3 CCellell 3 SVI S 2 KCNB1 1 0 4 Cell 1 Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org 3 CCellell 4 2 1 Cell 2 0 4 3 CCellell 5 2 Cell 3 1 0 4 Cell 4 3 CCellell 6 2 1 Cell 5 0 4 3 CCellell 7 2 Cell 6 1 0 4 Cell 7 3 CCellell 8 2 1 Cell 8 0

Chromosome 5

96 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

We have carried out single-cell genomic analyses of colonoscopy biopsies at different adenoma stages (L. Huang, S. Ding & X.S. Xie, unpublished results). Some single cells in stage II adenoma have CNVs in the tumor suppresser gene APC (a reduction in copy number from two to one) as well as SNVs in numerous cancer-related genes. Interestingly, we found that although some single cells exhibited CNV reduction in APC without SNVs, all single cells with SNVs that have been reported as somatic mutations of colon and other cancers in the Catalogue of Somatic Mutations in Cancer (COSMIC) database showed the CNV reduction in APC. Moreover, we did not see any single cell that has colon cancer–related SNVs but no CNVs in an adenoma. Figure 12 shows such data for the COSMIC gene KCNB1, which encodes an ion channel. These data indicate that the CNV in APC precedes the SNV in colon cancer development, at least in the particular colony examined. Furthermore, we found that the single cells from the same adenoma exhibited CNV patterns reproducible among all the cells, indicating that these cells might be derived from a single stem or progenitor cell in which the CNVs first arose. Thus, we have established a correlation between the SNVs and CNVs and propose that the SNVs are generated as a consequence of abnormal CNVs in the genome, or arise after the abnormal CNVs are acquired in the original populating cell. If confirmed to be general for driving SNVs, this result could have significant implications for the genesis of cancer. Suffice it to say that this experiment underscores the importance of single-cell genomics in understanding cancer.

Circulating Tumor Cells Originating from primary tumors, CTCs enter peripheral blood and seed metastases, which ac- count for 90% of cancer-related deaths. The genome sequencing of CTCs could offer noninvasive prognosis or even diagnosis but has been hampered by low single-cell genome coverage of scarce CTCs. Ni et al. (49) applied MALBAC for WGA of single CTCs from lung cancer patients and ob- served characteristic cancer-associated SNVs and indels in exomes of CTCs. These mutations provided information needed for individualized therapy (53), such as drug resistance and pheno- typic transition, but were heterogeneous from cell to cell. Ni et al. (49) also discovered that, unlike the highly heterogeneous point mutations, the CNV patterns of CTCs are reproducible from CTC to CTC within a patient and even within different patients of the same cancer type, but are distinctly different among different cancer types

Access provided by Harvard University on 09/01/15. For personal use only. (Figure 13). Furthermore, the reproducible CNV patterns of CTCs are similar to those of the metastatic tumor. This result raised intriguing questions about the genesis of metastasis. It is evident that gains and losses in copy numbers of certain chromosome regions are selected for Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org ←−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Figure 12 Temporal sequence of copy-number variations (CNVs) and single-nucleotide variations (SNVs) of colon polyps. (a) Normal intestinal epithelium and adenoma. (b) CNVs detected in eight single cells (3,000-kb bin size). Some cells exhibit changes in chromosome 5 copy number from two to one. This is an ∼8-Mb region. (c) CNVs seen when zooming into chromosome 5 (1,000-kb bin size). A reduction of copy number from two to one is apparent in the 5q region containing the APC-coding gene (chromosomal region 5q21.3–23.1). The same region is seen for five different single cells, indicating that they came from the same stem cell. (d ) Four of eight single cells exhibit point mutations in the KCNB1 gene, which has been reported as a somatic mutation of colon cancer in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. Serine (S), valine (V), and isoleucine (I) are the amino acids translated by the triplet codon. The KCNB1 gene is a protein-coding gene that encodes an ion channel protein. Interestingly, SNVs occur only in cells with abnormal CNVs.

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 97 GG16CH04-Xie ARI 21 July 2015 11:16

a Chromosome 12 3 4 5 6 7 8 91310 11 12 14151617 18 19 20 21 22 X Y P1 6 4 Primary 2 0 6 4 Metastasis 2 0 6 CTC 1 4 2 0 6 CTC 2 4 2 0 6 CTC 3 4 2 0 6 CTC 4 4 2 0 6 CTC 5 4 2 0 6 4 CTC 6 2 0 6 4 CTC 7 2 0 6 4 CTC 8 2 0

b Chromosome 1 2 3 4 5 6 7 8 91310 11 12 14151617 18 19 20 21 22 X Y 6 P1 4 2 (ADC SCLC) 0 6 P2 4 2 (ADC) 0 6 P3 4 2 (ADC) 0 6 4 P4 2 (ADC) 0 6 4

Access provided by Harvard University on 09/01/15. For personal use only. P5 2 (ADC) 0 6 P6 4 2

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org (ADC) 0 6 P7 4 2 ADC+SCLC 0

c 500

200

0 P7_1 P7_2 P1_4 P1_6 P1_1 P1_2 P1_5 P1_7 P1_3 P1_8 P3_5 P3_2 P3_3 P3_1 P3_4 P2_3 P2_4 P2_5 P2_6 P2_1 P2_2 P6_1 P5_1 P5_2 P4_5 P4_4 P4_1 P4_2 P4_3

98 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

←−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Figure 13 Reproducible copy-number variation (CNV) patterns of seven lung cancer patients. Patient 1 (P1) experienced a phenotypic transition from adenocarcinoma (ADC) in the lung to small-cell lung cancer (SCLC) in the metastatic liver, patients 2–6 (P2–6) have ADC, and patient 7 (P7) has a mix of ADC and SCLC. (a) Reproducible CNV patterns of eight circulating tumor cells (CTCs) in P1. The patterns are different from that of the primary tissue ( first row) but the same as that of the metastatic tissue (second row). (b)CNV patterns of CTCs from the seven patients. (c) Clustering analyses of CTCs based on the CNVs. CTCs from P1 and P7 (SCLC patients) were well separated from CTCs from P2–P6 (ADC patients). The y axis is the cluster distance constructed using Ward’s method based on Euclidean distances between the patients’ CNVs. Figure adapted from Reference 49.

metastases. The finding that the CNV patterns of CTCs are cancer or tissue dependent offers the potential for noninvasive cancer diagnosis based on the CNV patterns.

DISCLOSURE STATEMENT S.L. and X.S.X. are coauthors on a patent applied for by Harvard University for MALBAC tech- nology and are cofounders of Yikon Genomics.

ACKNOWLEDGMENTS We thank Chenghang Zong for his contribution to the invention of MALBAC. We thank Yaqiong Tang, Dandan Cao, and Hongshan Guo for technical assistance in data collection and Guangyu Zhou for help with data analysis. We are indebted to our collaborators Fuchou Tang, Jie Qiao, Xiaohui Ni, Fan Bai, Wei Fan, Liya Xu, Jie Wang, Ning Zhang, Youyong Lu, Liying Yan, Yu Hou, Zhe Su, Yan Gao, Shigang Ding, Hejun Zhang, Ruiqiang Li, Mingyu Yang, Jinsen Li, Jessica Sang, and Yanyi Huang for their contributions to the work reviewed in this article. We are grateful for funding from the National Natural Science Foundation of China (21327808), US National Institutes of Health (NIH) National Human Genome Research Institute grants (HG005097-1 and HG005613-01), and an NIH Director’s Pioneer Award (5DP1CA186693) to X.S.X. as well as a National Cancer Institute grant (5R33CA174560). The work at the Biodynamic Optical Imaging Center (BIOPIC) has been supported by Peking University 985 special funding for collaboration with hospitals and funding from Guangxi Wuzhou Zhongheng Group Co., Ltd.

LITERATURE CITED

Access provided by Harvard University on 09/01/15. For personal use only. 1. 1000 Genomes Proj. Consort. 2010. A map of human genome variation from population-scale sequencing. Nature 467:1061–73 2. Bansal V, Tewhey R, Topol EJ, Schork NJ. 2011. The next phase in human genetics. Nat. Biotechnol. Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org 29:38–39 3. Barrett MT, Reid BJ, Joslyn G. 1995. Genotypic analysis of multiple loci in somatic cells by whole genome amplification. Nucleic Acids Res. 23:3488–92 4. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53–59 5. Blanco L, Bernad A, Lazaro´ JM, Martin G, Garmendia C, Salas M. 1989. Highly efficient DNA synthesis by the phage φ29 DNA polymerase: symmetrical mode of DNA replication. J. Biol. Chem. 264:8935–40 6. Chapman AR, He Z, Lu S, Yong J, Tan L, et al. 2015. Single cell transcriptome amplification with MALBAC. PLOS ONE 10:e0120889 7. Cheung VG, Nelson SF. 1996. Whole genome amplification using a degenerate oligonucleotide primer allows hundreds of genotypes to be performed on less than one nanogram of genomic DNA. PNAS 93:14676–79

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 99 GG16CH04-Xie ARI 21 July 2015 11:16

8. Coop G, Przeworski M. 2007. An evolutionary view of human recombination. Nat. Rev. Genet. 8:23–34 9. Dago AE, Stepansky A, Carlsson A, Luttgen M, Kendall J, et al. 2014. Rapid phenotypic and genomic change in response to therapeutic pressure in prostate cancer inferred by high content analysis of single circulating tumor cells. PLOS ONE 9:e101777 10. Daina G, Ramos L, Obradors A, Rius M, Martinez-Pasarell O, et al. 2013. First successful double-factor PGD for Lynch syndrome: monogenic analysis and comprehensive aneuploidy screening. Clin. Genet. 84:70–73 11. de Bourcy CF, De Vlaminck I, Kanbar JN, Wang J, Gawad C, Quake SR. 2014. A quantitative comparison of single-cell whole genome amplification methods. PLOS ONE 9:e105585 12. Dean FB, Nelson JR, Giesler TL, Lasken RS. 2001. Rapid amplification of plasmid and phage DNA using phi29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res. 11:1095–99 13. Epstein CJ. 2007. The Consequences of Chromosome Imbalance: Principles, Mechanisms, and Models. Cambridge, UK: Cambridge Univ. Press 14. Fan HC, Wang J, Potanina A, Quake SR. 2011. Whole-genome molecular haplotyping of single cells. Nat. Biotechnol. 29:51–57 15. Francis JM, Zhang CZ, Maire CL, Jung J, Manzo VE, et al. 2014. EGFR variant heterogeneity in glioblastoma resolved through single-nucleus sequencing. Cancer Discov. 4:956–971 16. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–61 17. Garmendia C, Bernad A, Esteban JA, Blanco L, Salas M. 1992. The bacteriophage φ29 DNA polymerase, a proofreading . J. Biol. Chem. 267:2594–99 18. Gawad C, Koh W, Quake SR. 2014. Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics. PNAS 111:17947–52 19. Guo H, Zhu P, Wu X, Li X, Wen L, Tang F. 2013. Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation bisulfite sequencing. Genome Res. 23:2126–35 20. Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, et al. 2008. Single-molecule DNA sequencing of a viral genome. Science 320:106–9 21. Hastings PJ, Lupski JR, Rosenberg SM, Ira G. 2009. Mechanisms of change in gene copy number. Nat. Rev. Genet. 10:551–64 22. Heger M. 2014. Peking University reports first success in trial of single-cell sequencing for PGD. GenomeWeb,Oct.1.http://www.genomeweb.com/sequencing/peking-university-reports-first- success-trial-single-cell-sequencing-pgd 23. Hinch AG, Tandon A, Patterson N, Song Y, Rohland N, et al. 2011. The landscape of recombination in African Americans. Nature 476:170–75 24. Hou Y, Fan W, Yan L, Li R, Lian Y, et al. 2013. Genome analyses of single human oocytes. Cell 155:1492– 506 25. Hou Y, Song L, Zhu P, Zhang B, Tao Y, et al. 2012. Single-cell exome sequencing and monoclonal Access provided by Harvard University on 09/01/15. For personal use only. evolution of a JAK2-negative myeloproliferative neoplasm. Cell 148:873–85 26. Huang J, Yan L, Fan W, Zhao N, Zhang Y, et al. 2014. Validation of multiple annealing and looping- based amplification cycle sequencing for 24-chromosome aneuploidy screening of cleavage-stage embryos. Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org Fertil. Steril. 102:1685–91 27. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, et al. 2004. Detection of large-scale variation in the human genome. Nat. Genet. 36:949–51 28. Ingles J, Doolan A, Chiu C, Seidman J, Seidman C, Semsarian C. 2005. Compound and double mutations in patients with hypertrophic cardiomyopathy: implications for genetic testing and counselling. J. Med. Genet. 42:e59 29. Int. HapMap Consort. 2005. A haplotype map of the human genome. Nature 437:1299–320 30. Jeffreys AJ, Neumann R. 2002. Reciprocal crossover asymmetry and meiotic drive in a human recombi- nation hot spot. Nat. Genet. 31:267–71 31. Kalisky T, Quake SR. 2011. Single-cell genomics. Nat. Methods 8:311–14 32. Kirkness EF, Grindberg RV, Yee-Greenbaum J, Marshall CR, Scherer SW, et al. 2013. Sequencing of isolated sperm cells for direct haplotyping of a human genome. Genome Res. 23:826–32

100 Huang et al. GG16CH04-Xie ARI 21 July 2015 11:16

33. Kitzman JO, MacKenzie AP, Adey A, Hiatt JB, Patwardhan RP, et al. 2011. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29:59–63 34. Kong A, Thorleifsson G, Gudbjartsson DF, Masson G, Sigurdsson A, et al. 2010. Fine-scale recombination rate differences between sexes, populations and individuals. Nature 467:1099–103 35. Lasken RS. 2012. Genomic sequencing of uncultured microorganisms from single cells. Nat. Rev. Microbiol. 10:631–40 36. Lasken RS. 2013. Single-cell sequencing in its prime. Nat. Biotechnol. 31:211–12 37. Lasken RS, Egholm M. 2003. Whole genome amplification: abundant supplies of DNA from precious samples or clinical specimens. Trends Biotechnol. 21:531–35 38. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, et al. 2007. The diploid genome sequence of an individual human. PLOS Biol. 5:e254 39. Lohr JG, Adalsteinsson VA, Cibulskis K, Choudhury AD, Rosenberg M, et al. 2014. Whole-exome sequencing of circulating tumor cells provides a window into metastatic prostate cancer. Nat. Biotechnol. 32:479–84 40. Lu S, Zong C, Fan W, Yang M, Li J, et al. 2012. Probing meiotic recombination and aneuploidy of single sperm cells by whole-genome sequencing. Science 338:1627–30 41. Ma L, Xiao Y, Huang H, Wang Q, Rao W, et al. 2010. Direct determination of molecular haplotypes by chromosome microdissection. Nat. Methods 7:299 42. MacArthur D. 2012. Methods: face up to false positives. Nature 487:427–28 43. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. 2005. Genome sequencing in microfab- ricated high-density picolitre reactors. Nature 437:376–80 44. Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, et al. 2004. Genetic analysis of genome-wide variation in human . Nature 430:743–47 45. Mueller E, Brueck C. 2015. Whole genome amplification for single cell biology. Tech. Doc., Sigma-Aldrich, St. Louis, MO. http://www.sigmaaldrich.com/technical-documents/articles/life-science-innovations/ whole-genome-amplification.html 46. Nat. Methods Eds. 2014. Method of the Year 2013. Nat. Methods 11:1 47. Navin NE. 2014. Cancer genomics: one cell at a time. Genome Biol. 15:452 48. Navin NE, Kendall J, Troge J, Andrews P, Rodgers L, et al. 2011. Tumour evolution inferred by single-cell sequencing. Nature 472:90–94 49. Ni X, Zhuo M, Su Z, Duan J, Gao Y, et al. 2013. Reproducible copy number variation patterns among single circulating tumor cells of lung cancer patients. PNAS 110:21083–88 50. Peters BA, Kermani BG, Sparks AB, Alferov O, Hong P, et al. 2012. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487:190–95 51. Pugh TJ, Delaney AD, Farnoud N, Flibotte S, Griffith M, et al. 2008. Impact of whole genome amplifi- cation on analysis of copy number variants. Nucleic Acids Res. 36:e80 52. Rubio C, Rodrigo L, Mir P, Mateu E, Peinado V, et al. 2013. Use of array comparative genomic hy- bridization (array-CGH) for embryo assessment: clinical results. Fertil. Steril. 99:1044–48 Access provided by Harvard University on 09/01/15. For personal use only. 53. Ruiz C, Li J, Luttgen MS, Kolatkar A, Kendall JT, et al. 2015. Limited genomic heterogeneity of circu- lating melanoma cells in advanced stage patients. Phys. Biol. 12:016008 54. Saiki RK, Gelfand DH, Stoffel S, Scharf SJ, Higuchi R, et al. 1988. Primer-directed enzymatic amplifi- Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org cation of DNA with a thermostable DNA polymerase. Science 239:487–91 55. Sandberg R. 2014. Entering the era of single-cell transcriptomics in biology and medicine. Nat. Methods 11:22–24 56. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, et al. 2004. Large-scale copy number polymorphism in the human genome. Science 305:525–28 57. Shapiro E, Biezuner T, Linnarsson S. 2013. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Genet. 14:619–30 58. Shen J, Cram DS, Wu W, Cai L, Yang X, et al. 2013. Successful PGD for late infantile neuronal ceroid lipofuscinosis achieved by combined chromosome and TPP1 gene analysis. Reprod. BioMed. Online 27:176– 83 59. Suk EK, McEwen GK, Duitama J, Nowick K, Schulz S, et al. 2011. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21:1672–85

www.annualreviews.org • Single-Cell Whole-Genome Amplification and Sequencing 101 GG16CH04-Xie ARI 21 July 2015 11:16

60. Tang F, Lao K, Surani MA. 2011. Development and applications of single-cell transcriptome analysis. Nat. Methods 8(Suppl.):S6–11 61. Telenius H, Carter NP, Bebb CE, Nordenskjo M, Ponder BA, Tunnacliffe A. 1992. Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics 13:718–25 62. Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. 2011. The importance of phase information for human genomics. Nat. Rev. Genet. 12:215–23 63. Tobler KJ, Brezina PR, Benner AT, Du L, Xu X, Kearns WG. 2014. Two different microarray tech- nologies for preimplantation genetic diagnosis and screening, due to reciprocal translocation imbalances, demonstrate equivalent euploidy and clinical pregnancy rates. J. Assist. Reprod. Genet. 31:843–50 64. Treff NR, Fedick A, Tao X, Devkota B, Taylor D, Scott RT. 2013. Evaluation of targeted next-generation sequencing–based preimplantation genetic diagnosis of monogenic disease. Fertil. Steril. 99:1377–84 65. Treff NR, Tao X, Ferry KM, Su J, Taylor D, Scott RT. 2012. Development and validation of an accu- rate quantitative real-time polymerase chain reaction–based assay for human blastocyst comprehensive chromosomal aneuploidy screening. Fertil. Steril. 97:819–24 66. Van Driest SL, Vasile VC, Ommen SR, Will ML, Tajik AJ, et al. 2004. Myosin binding protein C mutations and compound heterozygosity in hypertrophic cardiomyopathy. J. Am. Coll. Cardiol. 44:1903– 10 67. Vogelstein B, Kinzler KW. 2004. Cancer genes and the pathways they control. Nat. Med. 10:789–99 68. Wang J, Fan HC, Behr B, Quake SR. 2012. Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell 150:402–12 69. Wang J, Wang W, Li R, Li Y, Tian G, et al. 2008. The diploid genome sequence of an Asian individual. Nature 456:60–65 70. Wang Y, Waters J, Leung ML, Unruh A, Roh W, et al. 2014. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512:155–60 71. Wells D, Kaur K, Grifo J, Glassner M, Taylor JC, et al. 2014. Clinical utilisation of a rapid low-pass whole genome sequencing technique for the diagnosis of aneuploidy in human embryos prior to implantation. J. Med. Genet. 51:553–62 72. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452:872–76 73. Xu X, Hou Y, Yin X, Bao L, Tang A, et al. 2012. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell 148:886–95 74. Yang H, Chen X, Wong WH. 2011. Completely phased genome sequencing through chromosome sorting. PNAS 108:12–17 75. Yu C, Yu J, Yao X, Wu WK, Lu Y, et al. 2014. Discovery of biclonal origin and a novel oncogene SLC12A5 in colon cancer by single-cell sequencing. Cell Res. 24:701–12 76. Zhang K, Zhu J, Shendure J, Porreca GJ, Aach JD, et al. 2006. Long-range polony haplotyping of individual

Access provided by Harvard University on 09/01/15. For personal use only. human chromosome molecules. Nat. Genet. 38:382–87 77. Zhang L, Cui X, Schmitt K, Hubert R, Navidi W, Arnheim N. 1992. Whole genome amplification from a single cell: implications for genetic analysis. PNAS 89:5847–51

Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org 78. Zong C, Lu S, Chapman AR, Xie XS. 2012. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338:1622–26

102 Huang et al. GG16-FrontMatter ARI 22 June 2015 17:20

Annual Review of Genomics and Human Genetics Contents Volume 16, 2015

A Mathematician’s Odyssey Walter Bodmer ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp1 Lessons from modENCODE James B. Brown and Susan E. Celniker pppppppppppppppppppppppppppppppppppppppppppppppppppppp31 Non-CG Methylation in the Human Genome Yupeng He and Joseph R. Ecker pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp55 Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications Lei Huang, Fei Ma, Alec Chapman, Sijia Lu, and Xiaoliang Sunney Xie pppppppppppppppp79 Unraveling the Tangled Skein: The Evolution of Transcriptional Regulatory Networks in Development Mark Rebeiz, Nipam H. Patel, and Veronica F. Hinman pppppppppppppppppppppppppppppppp103 Alignment of Next-Generation Sequencing Reads Knut Reinert, Ben Langmead, David Weese, and Dirk J. Evers ppppppppppppppppppppppppp133 The Theory and Practice of Genome Sequence Assembly Jared T. Simpson and Mihai Pop ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp153 Addressing the Genetics of Human Mental Health Disorders in Model Organisms Access provided by Harvard University on 09/01/15. For personal use only. Jasmine M. McCammon and Hazel Sive pppppppppppppppppppppppppppppppppppppppppppppppppp173 Advances in Skeletal Dysplasia Genetics Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org Krista A. Geister and Sally A. Camper pppppppppppppppppppppppppppppppppppppppppppppppppppp199 The Genetics of Soft Connective Tissue Disorders Olivier Vanakker, Bert Callewaert, Fransiska Malfait, and Paul Coucke ppppppppppppppp229 Neurodegeneration with Brain Iron Accumulation: Genetic Diversity and Pathophysiological Mechanisms Esther Meyer, Manju A. Kurian, and Susan J. Hayflick ppppppppppppppppppppppppppppppppp257 The Pathogenesis and Therapy of Muscular Dystrophies Simon Guiraud, Annemieke Aartsma-Rus, Natassia M. Vieira, Kay E. Davies, Gert-Jan B. van Ommen, and Louis M. Kunkel ppppppppppppppppppppppppppppppppppppppp281

v GG16-FrontMatter ARI 22 June 2015 17:20

Detection of Chromosomal Aberrations in Clinical Practice: From Karyotype to Genome Sequence Christa Lese Martin and Dorothy Warburton ppppppppppppppppppppppppppppppppppppppppppppp309 Mendelian Randomization: New Applications in the Coming Age of Hypothesis-Free Causality David M. Evans and George Davey Smith pppppppppppppppppppppppppppppppppppppppppppppppp327 Eugenics and Involuntary Sterilization: 1907–2015 Philip R. Reilly ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp351 Noninvasive Prenatal Genetic Testing: Current and Emerging Ethical, Legal, and Social Issues Mollie A. Minear, Stephanie Alessi, Megan Allyse, Marsha Michie, and Subhashini Chandrasekharan ppppppppppppppppppppppppppppppppppppppppppppppppppppppp369

Errata

An online log of corrections to Annual Review of Genomics and Human Genetics articles may be found at http://www.annualreviews.org/errata/genom Access provided by Harvard University on 09/01/15. For personal use only. Annu. Rev. Genom. Human Genet. 2015.16:79-102. Downloaded from www.annualreviews.org

vi Contents