Genome-Wide Nucleosome Map and Cytosine Methylation Levels of an Ancient Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
Genome-wide Nucleosome Map and Cytosine Methylation Levels of an Ancient Human Genome Jakob Skou Pedersen, Eivind Valen, Amhed M. Vargas Velazquez, Brian J. Parker, Morten Rasmussen, Stinus Lindgreen, Berit Lilje, Desmond J Tobin, Theresa K. Kelly, Søren Vang, Robin Andersson, Peter A. Jones, Cindi A. Hoover, Alexei Tikhonov, Egor Prokhortchouk, Edward M. Rubin, Albin Sandelin, M. Thomas P. Gilbert, Anders Krogh, Eske Willerslev, Ludovic Orlando. Table of Contents Section SI1. Sequence datasets 3-8 1.1. Saqqaq Palaeo-eskimo sequence dataset 3 1.2. Control genome dataset 4 1.3. Ancient Aborigine sequence dataset 4 1.4. Ancient horse sequence dataset 5 1.5. Ancient polar bear sequence dataset 6 1.6. Modern human hair sequence dataset 6 1.7. Size distributions 6 Figure S1.1 8 Section SI2. Nucleosomes 9-25 2.1. Read depth 9 Table S2.1 9 Table S2.2 9 2.2. Control of mapping and sequencing biases 10 2.3. Correlation with other nucleosome data and predictions 10 Figure S2.1 12 2.4. Nucleosome occupancy at anchor sites 13 Figure S2.2 13 2.5. Fourier transform 13 Figure S2.3 14 Figure S2.4 15 Figure S2.5 16 Figure S2.6 17 Figure S2.7 17 2.6. Phasograms 17 Figure S2.8 18 2.7. Genome-wide map of nucleosomes 19 Figure S2.9 19 2.8. Estimating false positives 20 Figure S2.10 21 2.9. Nucleotide and di-nucleotide distributions across nucleosomes 22 Figure S2.11 23 Figure S2.12 24 Figure S2.13 25 Section SI3. Methylation footprint of the Saqqaq genome 26-60 3.1. Background 26 Figure S3.1 26 3.2. Nucleotide mis-incorporation patterns 26 Figure S3.2 28 3.3. Quantifying ancient regional methylation levels 29 Figure S3.3 30 ! "! Figure S3.4 31 Figure S3.5 32 3.4. Methylation-based unsupervised hierarchical clustering 33 Table S3.1 34 Table S3.2 35 Figure S3.6 37 Figure S3.7 37 Figure S3.8 37 Figure S3.9 38 Figure S3.10 39 Figure S3.11 40 Figure S3.12 41 Figure S3.13 42 3.5. Promoter and first exon methylation levels 43 Table S3.3 44 Table S3.4 46 Table S3.5 49 Table S3.6 54 3.6. Predicting the Saqqaq age at death using age-dependent CpG 57 signatures Figure S3.14 57 Figure S3.15 58 3.7. Methylation profiles across nucleosomes 59 Table S3.7 60 Section SI4. Expression analyses 61-89 Table S4.1 64 Table S4.2 68 Table S4.3 79 Section SI5. References 90-93 ! #! Section SI1. Sequence datasets Most of the analyses were performed on previously released sequence data sets, consisting of: 1) the first complete ancient human genome [Rasmussen et al. 2010] (section SI1.1); 2) a modern control data set based on a panel of modern samples [Green et al. 2010, Reich et al. 2010] (section SI1.2). 3) the first ancient Aborigine genome [Rasmussen et al. 2011] (section SI1.3); and 4) genomic reads of a 110,000-130,000 year-old polar bear [Miller et al. 2012] (section SI1.5). In addition, we generated genomic reads from an ancient horse dating back to ca. 4.5 ka (section SI1.4) and from modern human hairs (section SI1.6). Those were analyzed to check for the specificity of the patterns observed in hairs and their possible extension to other types of ancient samples, such as calcified bones that represent the vast majority of fossil remains excavated. !"!"#$%&&%&#'%(%)*+),-./*#,)&0)12)#3%4%,)4# The sequence data have been originally generated as part of our characterization of the Saqqaq Palaeo-Eskimo genome [Rasmussen et al. 2010]. Overall, ancient DNA extracts from a 4,000-year-old permafrost-preserved hair were built into Illumina libraries and deep-sequenced at about a 20X-coverage on a GAIIx sequencing platform and Single End sequencing. The vast majority of the sequencing reads (242 lanes out of a total of 245, corresponding to ca. 3.5 billions of reads) were recovered following library amplification with Phusion DNA polymerase (Finnzymes, hereafter referred to as Phusion). A total of 3 GAIIx lanes, however, were generated following library amplification with a Taq platinum high-fidelity DNA polymerase (Life technologies, hereafter referred to as Hifi) and resulted in a total number of 50.1 millions of reads [Ginolhac et al. 2011]. Contrarily to Hifi, Phusion has been demonstrated to show poor activity at uracil residues (Figure S3.1) [Fogg et al. 2002]. The latter are generated following post-mortem deamination of cytosine residues, mainly at single-stranded overhanging ends. As chemical analogs of thymines, Uracils yield to the mis-incorporation of AT base pairs (instead of GC) during library preparation and amplification, resulting in an excess of GC!AT mismatches when ancient genomes are compared to modern reference genomes [Stiller et al. 2006; Brotherton et al. 2007; Gilbert et al. 2007; Briggs et al. 2007]. Thus, that Phusion, rather than Hifi, has been used for generating the first complete sequence of an ancient genome greatly contributed to the overall quality of the genome by limiting the impact of damage-driven misincorporations in downstream analyses [Rasmussen et al. 2010]. For the nucleosome analysis the original set of 999.3 millions of reads was used, as defined in [Rasmussen et al. 2010]. For the methylation analysis, a mapping procedure allowing for indels was followed as described below (this was performed because the alignments available from the original study were performed with SESAM and were indel-free; [Rasmussen et al. 2010]). Read ends were trimmed for potentially residual adapter sequences, and after the first base showing the poorest quality score (C) using the software AdapterRemoval available for download at http://code.google.com/p/adapterremoval/ [Lindgren 2012]. Trimmed reads were subsequently mapped against nuclear human genome reference (hg18) using bwa 0.5.9 [Li and Durbin 2010] and default values for optional parameters. Unmapped sequences, together with the high-quality reads previously identified, were remapped against the available mitochondrial genome sequence of this individual [Gilbert et al. 2008] using similar parameters; in addition, sequencing reads with a minimal length of 25 nucleotides, a minimal mapping quality of 30, and showing no alternative hits ! $! were further selected using a combination of the samtools suite and awk command lines. Finally, sequence duplicates were collapsed using the sort and rmdup commands available in SAMtools [Li and Durbin 2010], resulting in a final BAM file of unique mitochondrial reads that was used for some of the analyses presented below aiming at detecting cytosine methylation-driven nucleotide mis-incorporation patterns (section SI3). The final dataset consisted of a final number of 696.0 millions and 26.3 millions of non-clonal unique sequences originating from Phusion and Taq Platinum high fidelity (Hifi) library amplification, respectively. !"5"#6*147*(#8)1*/)#3%4%,)4# We constructed a data set with the same properties as the Saqqaq data set in terms of number of reads and read length distribution, except reads were randomly selected (and truncated to match the Saqqaq size distribution) from a panel of sequencing runs from modern genomes. This data set allowed us to (1) evaluate if the Saqqaq data set observations could be ascribed to mapping biases and also to (2) more generally represent a null set to evaluate the significance of the statistical measurements made during the analysis. To construct this set, we used the combined set of modern genomes sequenced at low read depth as part of the Neanderthal and Desinova projects, which were downloaded from UCSC [Green et al. 2010, Reich et al. 2010]. All samples originate from lymphoblastoid cell lines of the Human Genome Diversity Project (HGDP). A main criterion for the data sets used for the Control was that they were made using the same technology, but based on modern samples that were not degraded or fragmented. All reads, including unmapped, were extracted from the downloaded bam files and converted to fastq format. These were then shuffled to avoid any ordering relative to genomic coordinates. We then constructed a data set with the same length distribution as the Saqqaq, by truncating individual modern reads to the appropriate size, though we initially included about 1.2 times more reads than present in the Saqqaq data set. We then mapped each fragment size separately to hg18 using BWA [Li et al. 2010]. For each fragment-size set of unsorted mapped reads, we then extracted the exact same number of reads of the given size as present in the Saqqaq data set. Finally, all the reads were combined to make our final Control data set. !"9"#:12.)14#:;*7.8.1)#,)&0)12)#3%4%,)4# A total number of 311.0 millions of Illumina reads recovered from a ca. 100-years old hair sample originating an ancient Aborigine individual [Rasmussen et al. 2011]. These reads were generated following a procedure similar to the one used for characterizing the complete genome of the Saqqaq individual, except that DNA libraries were amplified using Taq Gold DNA polymerase. The initial unprocessed data set consisted of 2.1 billions of reads (100 nucleotides long), which was reduced by 85.2% following index filtering, adaptor trimming, and selection of uniquely mapping and non-clonal reads. These span 60% of the human genome with an average read depth of 11x (6.4x overall). The average length of endogenous trimmed and unique reads consisted of 64 nucleotides and mapping was done using BWA and standard parameters. ! %! # !"<"#:12.)14#=*7,)#,)&0)12)#3%4%,)4# The ancient horse sample was provided by one of the co-authors (Dr.