<<

Genomic sequencing and genealogical analysis of DNA from a prehistoric dog

Pontus Skoglund

Degree project in biology, Master of science (2 years), 2009 Examensarbete i biologi 45 hp till masterexamen, 2009 Biology Education Centre and Department of Evolutionary Biology, Uppsala University Supervisor: Dr. Anders Götherström

Cover illustration by Tobias Skoglund, after Henri Breuil’s depiction of a ~14 000 year old carving in the Font de Gaume (). In: Capitan L, Breuil H, Peyrony D, (1910) La Caverne de Font-de-Gaume aux Eyzies (Dordogne), Monaco, pl. XXXVII Abstract

Next generation sequencing stands as one of the most promising advances in the field of paleogenetics. In this study, high-throughput shotgun sequencing of amplified metagenomic libraries was used to generate low-coverage genome data from an approximately 5000 year old dog specimen from Ajvide (Gotland, Sweden). I show that a genealogical approach can be used to estimate the split time between an ancient population and two current populations for which medium- to high-coverage genome assemblies are available, without the need for assumptions on mutation rate. Applied to the presented data, the analysis indicates that there was extensive structure between the Scandinavian dog population and the ancestors of modern boxers and poodles. This observation is best explained either by an early Upper origin of domestic dogs, multiple events, or extensive backcrossing with their lupine progenitors. The study shows that as reference genomic data from extant populations accumulate, multilocus data generated from a single -preserved specimen analyzed using this approach can provide powerful insights into prehistoric demography and evolutionary history.

Abbreviations

SNP Single Nucleotide Polymorphism LD Linkage Disequilibrium kya Kilo years ago mya Million years ago BP Before Present DNA Deoxyribonucleic acid aDNA Ancient DNA mtDNA Mitochondrial DNA nuDNA Nuclear DNA rRNA Ribosomal RNA PWC Pitted Ware Culture TRB Funnel beaker culture MRCA Most Recent Common Ancestor TMRCA Time to Most Recent Common Ancestor PCR Polymerase Chain Reaction emPCR Emulsion PCR qPCR Quantitative PCR WGA Whole Genome Amplification SBS Sequencing by Synthesis cM Centi Morgan bp Basepairs kb Kilo basepairs BLAST Basic Local Alignment Search Tool Indel insertion/deletion polymorphism

2

Table of Contents

Introduction ...... 4 Origin and evolution of the domestic dog ...... 4 The archaeological record ...... 4 Genetic studies ...... 6 Canine genomics ...... 7 Dogs in Scandinavian ...... 7 Study sites ...... 8 Genetic analysis of ancient DNA ...... 9 Authenticity and retrieval ...... 9 Second generation sequencing and paleogenomics ...... 10 The 454 sequencing platform ...... 11 Illumina sequencing ...... 12 DNA diagenesis ...... 13 Amplification of ancient DNA ...... 14 Theoretical background ...... 15 The coalescent ...... 15 Materials and Methods ...... 18 Molecular methods ...... 18 Ancient DNA precautions ...... 19 DNA Extraction ...... 19 PCR amplification ...... 20 Contamination assay ...... 20 Metagenomic amplification ...... 21 454 and Illumina sequencing...... 23 Data analysis ...... 23 Sequence processing ...... 23 Metagenomic analysis ...... 24 Evolutionary analysis ...... 24 Results ...... 26 Optimization of the metagenomic amplification protocol ...... 26 Sequencing results ...... 28 The Neolithic dog ...... 28 The Sima de los Huesos ...... 28 Evolutionary analysis ...... 30 Discussion ...... 34 Metagenomic amplification of ancient DNA ...... 34 Authenticity of canine sequences ...... 36 Sequence analysis ...... 36 Evolutionary analysis ...... 38 Conclusions ...... 42 Acknowledgements ...... 43 References ...... 44

3

Introduction

Genetic data from multiple time points in the history of extant and extinct populations can provide information about demography and evolutionary history that may be hard to gain with modern samples alone (Hofreiter et al. 2001; Pääbo et al. 2004). However, the technical difficulties involved with obtaining 'ancient' DNA from archaeological samples have for long precluded large-scale analysis of genomic data from serial samples (Millar et al. 2008). Lately, advances in sequencing technology have together with the availability of genomic reference sequence from humans and other primates enabled population genetic analysis of multilocus sequence data from Neandertal humans (Noonan et al. 2006; Green et al. 2006; but see Wall & Kim 2007), which provided novel insights into hominid evolutionary history. As more genomic reference data becomes available, studies of this magnitude may also be extended to non-model organisms, but there is a lack of a clear paradigmatical approach for generating and analyzing paleogenomic data (Wall & Kim 2007; Hofreiter 2008; Millar et al. 2008).

The aim of this study was to evaluate and implement methods to obtain and analyze nuclear genetic data from prehistoric dogs (Canis lupus familiaris) in a manner that makes efficient use of the amount of template molecules in archaeological samples and allows for statistical testing of explicit demographic models which is insensitive to the unique problems associated with ancient DNA. Informative genetic data from a time before the bottlenecks associated with breed creation could potentially provide crucial insights into the early demographic history of the species, and help to resolve questions regarding their origin and domestication. For instance, while it is clear that the wild progenitors of domestic dogs were Eurasian grey wolves (Canis lupus lupus) (Vilà et al. 1997), the manner and date of domestication remains controversial (Wayne & Ostrander 2007), with archaeological (Germonpré et al. 2009), phylogenetic (Savolainen et al. 2002) and population genetic (Lindblad-Toh et al. 2005) estimates ranging between 8 000 - 40 000 years ago. As an emerging model species in genetic and biomedical research (Karlsson & Lindblad-Toh 2008), studies of dog evolution and demographic history are also becoming increasingly important (e.g. Gray et al. 2009), as a proper demographic null model is a prerequisite for conducting association mapping studies of genetic traits and diseases.

Origin and evolution of the domestic dog: the archaeological record

Though wolf remains have been found in association with archaic European hominids as far back as 400 000 BP (Clutton-Brock 1995) and in 300 000 BP (Olsen 1985), domestication is widely considered to have occurred at a much later date, and instigated by anatomically modern humans (Homo sapiens). Owing to the close relationship between wolves and prehistoric dogs, canid archaeological remains can be difficult to place taxonomically. Prehistoric dogs are generally considered to have been smaller than contemporary wolves but this is usually not sufficient to distinguish between domestic dogs, juvenile wolves or smaller subspecies of wolves (Clutton- Brock 1995). However, joint analysis of multiple morphological characters, such as the

4 observation of a snout that is shorter and wider at the base in domestic dogs than wolves, can allow taxonomic identification of even partial remains (Germonpré et al. 2009).

Until recently, the oldest remains of canids with clear dog-like features had been found in Siberia and dated to ~14 000 years ago (Sablin & Kchlopachev 2002, 2003). However, Germonpré and others (2009) classified a skull (Fig. 1) from the Goyet cave in France, with an estimated age of ~31 000 years, as hailing from a domestic dog, making it the oldest suggested canine so far. Their discovery implies that domestication was already underway in an Upper Paleolithic culture such as the , but the mysterious ~17 000 year gap in the record of dogs seems difficult to explain. While no remains from this era have been recovered, the gap is possibly interrupted by two curious trails of footprints from the Chauvet painted cave in France, which together with accompanying torch swipes suggest that a human child and a large canid seem to have navigated the tunnels side-by-side ~25 000 years ago (Garcia 2005).

Fig. 1 Dorsal view of the dog and wolf skulls described by Germonpré and coworkers (2009). To the left is the Goyet canid (a) dated to ~31 000 years ago. The other two skulls are from prehistoric gray wolves (b, c) (from Germonpré et al. 2009).

5

Upper Paleolithic finds of dog-like canids younger than 14 000 years have been reported from several sites in Western (Nobis 1979; Musil 1984 [reviewed by Clutton-Brock 1995]; Benecke 1987; Germonpré et al. 2009) which coincide with a cultural change in the hunting strategy of Paleolithic humans to using primitive bow and over stone (Clutton-Brock 1995). In Palestine, a ~12 000 year old grave containing remains from a human and a juvenile canine (Davis & Valla 1978) provides together with several younger finds from the pre-agricultural human culture known as the Natufian some of the strongest pieces of archaeological evidence for a Paleolithic domestication to date (Dayan 1994; Clutton-Brock 1995). In Scandinavia, the earliest documented presence of dogs date to the Mesolithic, 9000 years ago (Arnesson-Westerdahl 1983).

Outside of Eurasia, the earliest dog remains date to 6500 - 8500 BP, and are found in southern and Alaska (Olsen 1985). Together with genetic evidence that paleoindian dogs trace their ancestry to Eurasia rather than being the product of an independent domestication event (Leonard et al. 2002), these finds suggest that dogs were able to spread rapidly with humans. More importantly, the gathering evidence for a human colonization of the Americas earlier than 15 000 years ago (e.g. Gilbert et al. 2008) suggests that if dogs did indeed accompany the very first expansion (Fiedel 2005) domestication must have occurred prior to this date.

Genetic studies

A pioneering study in dog evolution by Vilà and colleagues (1997) not only identified the Eurasian grey wolf as the sole canid ancestor of the domestic dog, but also provided a framework for future studies in the form of a phylogeny of the mitochondrial control region from a diverse group of dog breeds and wolf populations. The dog mtDNA sequences in this dataset fell into four major clades, interspersed with wolf sequences (Vilà et al. 1997). Such a pattern can be accounted for by either post-domestication introgression from female wolves, or insufficient time since divergence for complete lineage sorting to occur. The major haplogroup showed levels of genetic variation that suggested to the authors a Middle divergence from wolves, an origin much older than the archaeological record indicated (Vilà et al 1997). A later phylogeographic study extended the number of phylogenetic clades to seven and observed a higher degree of mtDNA sequence diversity in East Asian dogs, but estimated a single origin of the global dog population just 15 000 years ago (Savolainen et al. 2002).

However, due to the inherent variability of genealogies evolving in a population, studies of single locus datasets have only limited power to infer demographic history (Nielsen & Beaumont 2009). Indeed, paleogenetic data have shown that even very recent human activities have changed the phylogeographic distribution of mtDNA variation in dog populations significantly (Leonard et al. 2002; Malmström et al. 2008; Deguilloux et al. 2009; Skoglund 2009; Castroviejo-Fisher S, Skoglund P, Vilà C & Leonard J.A, unpublished data), which casts further doubt on how accurate current phylogeographic patterns depict population history.

6

Canine genomics

In their analysis of genome-wide patterns of linkage disequilibrium across several breeds, Lindblad-Toh and others (2005) found that the best fitting model posited an ancestral dog population domesticated approximately 9000 generations ago, with an effective population size (Ne) of 13 000 and a modest degree of inbreeding (F = 0.12). They hypothesized that this pre-breed population had short haplotype blocks, extending roughly 10 kb (compared to ~20 kb in the biologically younger population of modern humans) with 4-5 distinct haplotypes in each region. Lindblad-Toh and others (2005) also noted that their model does not exclude a more complex population history with several domestication events (Vilà et al. 1997; Savolainen et al. 2002) and/or a low rate of continual gene flow from grey wolves (Vilà et al 2003; Vilà et al. 2005). Widespread breed creation might have been initialized as late as 30- 90 generations ago, and while imposing quite severe bottlenecks on the genetic variation of many of the resulting breeds, it did not cause rampant fixation of ancestral haplotypes (Lindblad-Toh et al. 2005). Most modern dog breeds show a biphasic decay of linkage disequilibrium that is likely due to the two major bottlenecks in their demographic history, with the long-range LD reflecting the prehistoric domestication process and the short-range pattern being a result of breed creation (Sutter et al. 2004; Lindblad-Toh et al. 2005; Karlsson et al. 2007).

Prior to the publication of the dog genome, a study using 96 microsatellites was able to identify genetic structure in 85 breeds (Parker et al. 2004). In both clustering and phylogenetic analyses, a group of putatively ancient breeds were identified which were distinct from all other breeds surveyed and displayed a closer evolutionary relationship with wolves (Sutter & Ostrander 2004). The breeds in this group were from surprisingly disparate geographical locations, with Chinese Spitz-type breeds having the closest genetic similarity to wolves followed by the African Basenji, Alaskan Spitz-type breeds and Central Asian Afghan hound and Saluki (Parker et al. 2004). Even though Scandinavian Spitz-type dogs were only represented by the Norwegian Elkhound in this analysis, the clustering of this breed within the genetic diversity of other European breeds is in stark contrast to the inferences made based on the phylogenetic distribution of mitochondrial haplogroup D (Vilà et al. 1997; Savolainen et al. 2002), leaving the possibility of an independent lupine origin of the population of modern Scandinavian breeds unlikely. Several other supposedly 'ancient' breeds with a deep archaeological record also displayed evidence of a relatively recent genetic origin, prompting Parker and colleagues (2004) to suggest that their morphologies have been recreated in more recent times (first suggested by Vilà et al. 1999). In an analysis using fewer breeds but with the larger set of markers on the 27k Affymetrix canine SNP array, only the Chinese Akita and Shiba Inu breeds retained their basal position on the intraspecific phylogeny, whilst all the other breeds formed a relatively homogenous cluster (Karlsson et al. 2007).

Dogs in Scandinavian prehistory

The observation that subfossil remains of wolves and dogs have been found in close association and the observation of a high frequency of a certain mitochondrial

7 haplogroup in Scandinavia (Savolainen et al. 2002) was for a while the basis of a hypothesized independent domestication event in the region. Recent analyses of mitochondrial (Malmström et al. 2008) and Y-chromosomal markers (Girdland-Flink 2008) in prehistoric and medieval dogs have refuted that argument, but while a bona fide domestication of Northern European wolves might be unlikely, backcrossing between domestic dogs and their wild ancestor population is expected to have occurred several times since their initial divergence (Vilà et al. 2005; Anderson et al. 2009) and still occurs sporadically in sympatric regions (Randi & Lucchini 2002; Vilà et al. 2003).

While widespread in continental Europe much earlier, Scandinavia was reached by the cultural transition from a nomadic hunting lifestyle to a settled economy with agriculture and animal husbandry that is known as the relatively late (~6000 BP). The purveyors of this lifestyle into the Baltic region are thought to have been the people giving rise to the Funnelbeaker culture (TRB, from German Trichterbecherkultur), whose characteristic ceramics are found in several agricultural settlements in Southern Sweden from between 6000 and 3800 years ago. However, the appearance of the contrasting Pitted Ware culture (PWC) marked a of the mesolithic hunter-gatherer lifestyle in the region, perhaps driven by increased salinity in the Baltic which led to a higher abundance of fish and seals (Martinsson- Wallin 2008). The PWC communities were supported by husbandry but found their main sustenance by hunting terrestrial and marine mammals. Recent mtDNA analysis of skeletal remains from the Baltic island of Öland belonging to both these contemporary human cultures have suggested considerable structure and perhaps different origins of these populations (Linderholm 2008), but the vast historical implications of such migrations in the Neolithic calls for more evidence. Since large- scale multilocus population genetic studies of ancient human remains are compromised by the presence of modern contaminants, domestic animals of ancient human cultures, such as dogs, might serve as a proxy for population structure and migration patterns at the time.

Study sites

The island of Gotland is thought to have been exposed for the first time in the Holocene about 9500 years ago. The earliest human settlement has been attributed to ~9000 year old Mesolithic remains from the islet Stora Karlsö off the western coast, but it is believed that the majority of Neolithic settlements arose on the main island about 6000 years ago (Österholm 1989). The major Neolithic excavation sites in Ajvide, Västerbjers and Visby all display PWC identity and economy. Several studies of dog specimens found on these sites have yielded reproducible mtDNA (Malmström et al. 2005, 2007, 2008) and Y-chromosomal sequences (Girdland-Flink 2008) previously. Many of the samples from Ajvide have also been subjected to quantitative PCR analysis (Malmström et al 2005), showing an exceptionally high level of DNA preservation which is likely due to the limestone bedrock of the island (A. Götherström pers. comm.).

Also, to investigate the potential of the amplification method described below for

8 increasing the yield of DNA by shotgun sequencing extremely degraded samples, two over 500 000 year old (Bishkoff et al. 2007) samples from the Sima de los Huesos site in Sierra de Atapuerca, , were used (Arsuaga et al. 1993; Garcia et al. 1997). Previously, mitochondrial SNP markers from a cave bear specimen from this site were some of the oldest ancient DNA molecules ever to be analyzed and independently replicated (Valdiosera et al. 2006). The phylogenetic relationship between the Middle Pleistocene cave bear (Ursus deningeri) and the later Ursus spelaeus is also controversial (Garcia et al. 1997), a question which would benefit from more genetic data. Additionally, a sample obtained from a fossil of the Sima de los Huesos hominids () (Arsuaga et al. 1997) was analyzed. The role of H. heidelbergensis in Pleistocene hominid evolution is unclear, controversial and of obvious interest for the emergence of later species such as Homo sapiens and Homo neanderthalensis, both of which represent possible descendants of H. heidelbergensis (Arsuaga et al. 1993, 1997; Rightmire 1998).

Genetic analysis of ancient DNA

The study of DNA from postmortem tissues was born during a time when bacterial cloning was the sole means of working with the molecule (Pääbo 1985) and the advent of polymerase chain reaction (PCR) coupled with commercially available Sanger sequencing was seen as a major breakthrough that would propel the ancient DNA research field into an age of seemingly endless possibilities (Wayne et al. 1999; Pääbo et al. 2004). However, the stochastic nature of PCR amplification was soon discovered to be fraught with difficulties, making it necessary to sequence multiple bacterial clones from each PCR product to rule out contamination and artifact substitutions. Because mitochondrial DNA is present in approximately 150-200 copies for each nuclear molecule in ancient subfossil bones (Noonan et al. 2006; Poinar et al. 2006; Green et al. 2008) and possibly even higher levels in preserved hair shafts (Gilbert et al. 2007) mitochondrial sequences have been used extensively in ancient DNA research and to this day comprise the majority of well authenticated studies (Gilbert et al. 2005; Millar et al. 2008). However, current and future technological advancements are expected to increase the amount of published data from nuclear DNA (Millar et al. 2008), which will allow functional and population genetic studies from archaeological material.

Authenticity and retrieval

The revelation that DNA can survive for prolonged periods in the subfossil tissue of many organisms led to an enthusiastic surge in genetic studies of extinct organisms and populations (Wayne et al. 1999; Hofreiter et al. 2001). Even in the optimistic youth of the field, it was clear that DNA was present only in minute amounts in postmortem tissues, making molecular cloning of authentic DNA compromised by the presence of exogenous contaminant molecules, especially with regard to human specimens. Many of the results from this time, including the inaugural study by Pääbo (1985), are therefore now widely considered to stem from contamination (Pääbo et al. 2004) The advent of polymerase chain reaction (PCR) ameliorated this in some regard, but the stochastic nature of PCR amplification still posed problems in the form

9 of artefactual results and lack of reproducibility, which called for the adoption of stringent criteria for authentication of results (Cooper & Poinar 2000; Hofreiter et al. 2001) which has formed the paradigm of the field since the turn of the millennium.

The eight criteria suggested as measures for avoiding false positives, mainly in the form of contamination, include (i) spatially isolated facilities (ii) general biochemical preservation of specimens and associated remains (iii) no drastically unexpected phylogenetic signature in the results (iv) independent replication of results in a separate lab (v) appropriate inclusion of controls in the form of mock extracts and PCR blanks at all stages (vi) molecular behavior expected of old DNA, such as short fragment length (vii) reproducibility by repeated extraction, amplification or sequencing of multiple clones of a studies fragment (viii) quantification of the number of starting DNA molecules (Cooper & Poinar 2000). However, most high-profile studies from leading research groups have employed but a subset of these criteria (e.g. Gilbert et al. 2008; Green et al. 2008; Poinar et al. 2008; Miller et al. 2008) and there have been calls for a more sensible approach to data authentication (Gilbert et al. 2005), where careful consideration of the methods used and the risk of contamination for the specific taxonomic group studied outweighs mindless ticking of a checklist. By this logic, human or hominid samples are to be treated with extra caution to meticulously supervise the risk of contamination while domestic animals or animals whose tissues can be used in laboratory reagents or materials are placed in the 'medium' risk category (Gilbert et al. 2005).

Second generation sequencing and paleogenomics

Second generation massively parallel sequencing such as the 454 (Roche, Basel, ), Solexa (Illumina) and SOLiD (Applied Biosystems) platforms have only been commercially available for five years but are garnering a huge interest from the scientific community due to the promise of conducting relatively cheap studies or non-model organisms at the genomic scale (Ellegren 2008; Mardis 2008). At their present state, the major weakness -- the short read length compared to traditional Sanger sequencing -- is only a minor burden in the ancient DNA context since the material is expected to be extensively fragmented anyway. The promises they hold for paleogenetic research are all the greater, and by and large depend on the advantage of the possibility to identify very similar clones that still differ in some respect. In addition, these technologies offer a dramatically increased throughput compared to capillary Sanger sequencing of PCR products or bacterial clones (up to two orders of magnitude more raw sequence per run) which ameliorates the fact that the proportion of authentic endogenous DNA rarely exceeds 5 % of the total DNA content in fossil bones (Noonan et al. 2005; Poinar et al. 2006; Green et al. 2006;) with preserved ancient hair shafts having higher yield in some cases (Gilbert et al. 2007).

Immediate applications that have been employed in paleogenetic studies include sequencing of PCR products to identify artifact substitutions due to miscoding lesion and level of contemporary contamination (Brotherton et al. 2007; Gilbert et al. 2007; Briggs et al. 2007), assembly of whole mitochondrial sequences from shotgun data

10

(Gilbert et al. 2007, 2008a; Miller et al. 2009), metagenomic studies of fossil microbial communities (e.g. Green et al. 2006; Miller et al. 2009) and whole genome sequencing projects (Poinar et al. 2006; Miller et al. 2008). Vital to the success of the latter metagenomic approaches -- a term used for shotgun sequencing of DNA from more than one source organism (Venter et al. 2003; Riesenfeld et al. 2004) -- is the availability of genomic reference sequences from the target species or a closely related species. For instance, at the time of Noonan and coworkers' (2005) sequencing of a Pleistocene cave bear (Ursus spelaeus) using the traditional Sanger approach, the closest available reference genome was that of the domestic dog (Lindblad-Toh et al. 2005), which left many sophisticated evolutionary analyses of nuclear loci unattempted. Now, three years later the complete genome of the giant panda has been obtained using Illumina technology (Giant Panda Genome Consortium, unpublished) which illustrates the current explosive growth of genomic research.

The 454 pyrosequencing platform

The first massively parallel sequencing technology to be made popular in both paleogenomics (Poinar et al. 2006; Green et al. 2006; Millar et al. 2008) and conventional genomics (Ellegren 2008; Mardis 2008) was the 454 GS20 (now a part of Roche). In contrast to the chain-terminating method of traditional Sanger sequencing (Sanger 1977), 454 utilizes a sequencing-by-synthesis (SBS) approach known as pyrosequencing (Ronaghi et al. 1996, 1998) in which the incorporation of added nucleotides complementary to a single-stranded template molecule is monitored in real-time by the release of pyrophosphate and emission of light by luciferase (Fig. 2).

This approach allows high fidelity genotyping of very small fragments and bases very close to the sequencing primer, where Sanger sequencing usually performs badly. Also, in implementations other than the 454 approach, the relative frequencies of heterozygotic alleles in a PCR product can be extrapolated from the luminescence signal and quantified, allowing a rough estimate of the original template proportions. These features together with general ease-of-use makes pyrosequencing well suited for ancient DNA, in particular with regards to single SNP-typing (e.g. Götherström et al. 2005; Svensson et al. 2007; Gilbert et al. 2008b).

While this technique has been used commercially for quite some time, the 454 GS20 and its successor the GS FLX use an emulsion PCR step which attaches a library of adaptor-ligated single-stranded DNA templates to microscopic agarose beads. Each bead is kept separate in tiny drops of oil and contains only a single template molecule which is then PCR-amplified to cover the entire bead (Nakano et al. 2003). The beads are then deposited into picolitre sized in which the pyrosequencing reaction commences (Marguiles et al. 2005), adding adenine, guanine, cytosine and thymine nucleosides one at the time, and recording which molecule is incorporated to the elongating strand.

11

Fig. 2 Comparison of 454 and Illumina sequencing-by-synthesis methods. In 454 pyrosequencing (A), the sequence of clonal DNA template molecules attached on microscopic beads is obtained by sequential addition of nucleotides and recording pyrophosphate release via luciferase action. In Illumina sequencing (B), all four nucleotides are instead added simultaneously, and the sequence is obtained by fluorescent labels (Adapted from 454 Roche and Illumina Inc. 2009).

A standing issue with pyrosequencing is the decreased accuracy when calling long homopolymers of the same base. If two or more nucleotides of the same type are repeated, the light emission signal that follows that addition of that type of dNTP will be stronger, but the saturation of the signal becomes significant for homopolymers >6 bases long, causing severe problems to determine the length of these repeats (Margulies et al. 2005). This is usually resolved by designing primers complementary to the surrounding sequences and determination of repeat length by Sanger sequencing (e.g. Green et al. 2008), but for whole genome sequencing projects that can be a major undertaking in itself. The specific usage of an array of picolitre reactors is the main cause for the sequence output of current 454 sequencers to be significantly lower than competing technologies using random amplification, but the 250 bp average read length of the GS FLX and 350 – 400 bp read length of the newly released titanium kit is still unmatched, and allows assembly of novel chromosomes (Mardis 2008) as well as aDNA applications such as determination of the fragment length distribution in a sample (Millar et al. 2008). The 454 protocol is currently being fine-tuned to accommodate the characteristics of ancient DNA samples (Meyer et al. 2007; Maricic & Pääbo 2009) and protocols that enable parallel sequencing of targeted PCR amplicons from multiple sources are also in development (Binladen et al. 2007).

Illumina sequencing

Contrary to the microreactor approach of 454, the Illumina Genome Analyzer (formerly Solexa) is based on parallel pyrosequencing of single-stranded DNA fragment clusters attached to a solid, surface. Similar to 454, the process begins with

12 the building of a fragment library ligated to method-specific adaptors. These adaptors facilitate binding of each fragment molecule to a sealed glass microfabricated flow cell, on which bridge amplification is initiated, forming clusters of approximately one million copies of each template. In contrast to the 454 method, the sequencing-by- synthesis is performed by adding all four nucleotides simultaneously and detecting which is incorporated with fluorescent label unique for each base. Importantly, the 3'- OH of the nucleotides is blocked, making sure that elongation is restricted to one base at the time. This allows for high-fidelity sequencing through homopolymeric regions, another advantage over the 454 technique (and others).

Presently, there are no published studies utilizing the Solexa/Illumina Genome Analyzer (Illumina) on subfossil material but several are underway, including the high- profile endeavor to sequence the Neandertal (Homo neanderthalensis) nuclear genome (Green et al. 2006), which was initially planning to use the 454 GS 20 and FLX machines, but ultimately resulted in at least two-thirds of the data being generated by Illumina machinery (Pennisi 2009). Its main advantage to current 454 implementations is the sheer amount of sequence data generated: while the FLX produces approximately 100 million basepairs of DNA sequence per run, the Solexa output is on average 3 billion bp. The major caveat is the short read length of 35-76 bp which is just above the threshold for what can be reliably assembled or mapped onto a reference genome. Still, this massive number of reads might allow detection of DNA templates so minute in concentration that they are difficult to amplify with standard PCR protocols, raising prospects of extending the time span from which authentic DNA can be obtained.

DNA diagenesis

DNA degradation ensues shortly after the death of an organism but in favorable conditions, the endogenous exonucleases of the cell are themselves inactivated before degradation of intact chromosomes into mononucleotides is complete (Hofreiter et al. 2001). What is left is short fragments, rarely exceeding 150-250 basepairs, which can remain for times upward 500 000 years in cold and dry environments (Willerslev & Cooper 2005; Valdiosera et a. 2006). However, DNA concentration is drastically reduced, causing samples of even moderate chronological age to sometimes not contain any amplifiable DNA from the target species at all (Hofreiter et al. 2001; Pääbo et al. 2004; Millar et al. 2008). Several chemical processes are believed to contribute to the degradation of DNA through time, of which hydrolysis is likely the most prominent, but oxidation of the phosphate backbone might also play a major role (Lindahl 1993). These processes lead to miscoding lesions, double-stranded breaks, single-strand nicks and crosslinks between adjacent strands (Hofreiter et al. 2001; Willerslev & Cooper 2005).

Even when a DNA fragment of sufficient length for analysis can be obtained, it has been documented that the sequence may differ from when the organism was alive, and even differ between repeated PCRs (Hofreiter et al. 2001). Prior to the high- throughput sequencing era, postmortem cytosine deamination was identified as the major contributing process, due to an observed excess of C to T (and G to A)

13 substitutions in ancient DNA sequences (Hofreiter et al. 2001). More recent analyses of sequence libraries generated by the 454 GS 20 and FLX systems (Briggs et al. 2007; Brotherton et al. 2007; Gilbert et al. 2007) have confirmed this damage source, but identified some previously unknown patterns as well. For instance, a significant proportion of strand breaks in DNA from a Neandertal could be explained by depurination of guanine and adenine bases 5' of the observed breakage (Briggs et al. 2007) which is known to lead to an increased susceptibility of hydrolysis of the sugar backbone.

Characterizing these patterns in detail is important not only for the sake of authenticity, but post mortem mutations can potentially inflate population genetic measures of diversity and skew the site-frequency spectrum (Axelsson et al. 2008; Rambaut et al. 2008), causing incorrect conclusions to be made on demographic history.

Amplification of ancient DNA

Several commercially available kits for whole genome amplification (WGA) have been available to the research community for some time. These have been shown to increase the success rate of amplification from low-quality samples (Short et al. 2005; Björnerfeldt & Vilà 2007) but the first application to truly archaeological samples was reported by Poulakakis and coworkers (2006), who reported the astounding news that a short DNA sequence had been obtained from an 800 000 year old elephant sample using the GenomiPhi WGA Kit (Amersham). However, intense scrutiny by the ancient DNA community soon revealed that the authenticity of the sequence could not be validated (Binladen et al. 2007; Orlando et al. 2007) and to my knowledge no further application of commercial kits such as GenomiPhi on samples greater than 500 years old has been reported. The main problem with the majority of commercial WGA kits with regards to severely degraded ancient DNA is the fact that they are not well suited for short fragmented molecules, which is the main reason for developing alternative methods of amplification, such as in this study or the emPCR approach taken by Blow and coworkers (2008).

One area to which protocols to amplify the DNA content of ancient samples is the fast rise of technology to screen a large amount of single nucleotide polymorphisms (SNPs) in parallel, driven by the promise of association mapping for identifying the genetic basis of disease in humans (Fan et al. 2006). Lately this has been expanded to meet the demands of the agricultural and manufacturers have now made available DNA genotyping arrays with genome-wide coverage for murine, equine, porcine, canine, ovine and bovine systems (Fan et al. 2006). These 'SNP chips' hold great promise for application on ancient material, since short genotype-specific probes are hybridized to the template DNA, thus avoiding one of the major issues in paleogenetic research -- the short fragment size of degraded DNA. However, the characteristically low concentration will likely require some sort of non-discriminant amplification of the template material prior to hybridization. The utility of whole- genome amplification routines for molecular analysis of canine DNA samples have been shown using existing commercial techniques (Thompson et al. 2005; Chang et al.

14

2007) and while most samples used were modern in origin, the successful typing of samples in which only 3-15 % of the total DNA content was canine is promising for the application on fossil material.

Theoretical background

With the majority of domestication studies, and particularly those including ancient DNA data, having employed phylogeographic reasoning based on the distribution of animal populations on the trees of mitochondrial and, to a lesser extent, Y- chromosomal haplotypes (Bruford et al. 2003), some basic theoretical population genetic results are important to emphasize. Since the entire mitochondrial genome and a vast region on the Y-chromosome do not undergo recombination i.e. is completely linked, it can be assumed that all genetic markers on a mitochondrion or Y-chromosome share a common genealogical history. This assumption is very helpful in a phylogenetic context because it allows us to apply relatively simple DNA substitution models to estimate the underlying genealogy of a sample of sequences with some accuracy. However, the history of a single genetic marker such as these does not give a conclusive account of the evolutionary history and relationships between the sampled individuals (Nordborg & Rosenberg 2002; Nielsen & Beaumont 2009). In the presence of recombination however, each genetic position can in principle be viewed as having its own unique genetic ancestry, with only linkage and linkage disequilibrium causing correlation between them. This allows for repeated sampling of different outcomes of the inherently stochastic process that is the transmission of genetic copies through time in finite populations (Hudson 1990; Rosenberg & Nordborg 2002; Wakeley 2008) and underlines the necessity of multiple nuclear loci for obtaining statistical power to infer demographic history.

The coalescent

This realization has been furthered by recent developments of a mathematical model of genealogical ancestry in populations known as the coalescent. Introduced by Kingman (1982a, 1982b), the n-coalescent model describes the stochastic merging of lineages in a population backwards in time. Modeling the ancestry of a sample of genetic markers (e.g. DNA sequences) in this way has allowed an increased number of useful numerical results to be reached (Hudson 1990; Wakeley 2008) and perhaps more importantly, has allowed efficient computation of complex population genetic models (Beaumont et al. 2002; Rosenberg & Nordborg 2002). While the results of original implementations of Kingman's coalescent were obtained for populations with an effective size in the limit as the effective population size Ne goes to infinity, and that the lineages on which the samples traced back their ancestry to the most recent common ancestor (MRCA) were exchangeable (i.e. having an equal probability of reproductive success), numerous theoretical advances have shown its robustness and ability to accommodate additional biologically relevant parameters (reviewed by Rosenberg & Nordborg 2002).

15

Fig. 3 The probability of discordant gene genealogies depends on the internode time between two divergence events.

The divergence and differentiation of populations has been one of the most intriguing areas of theoretical research utilizing the coalescent framework, and perhaps one of its most influential implementations. Methods to reconstruct allopatric and sympatric speciation incorporating migration (Wakeley & Hey 1997; Nielsen & Wakeley 2001) and fluctuations in effective population size (Hey & Nielsen 2004) that utilize population-wide data have been developed and implemented in powerful Bayesian statistical approaches (Beaumont et al. 2002; Thornton & Andolfatto 2006). While these allow model parameters to be approximated for datasets where frequentist approaches would be too computationally intensive, some of the essential properties of the coalescent model allow full-likelihood estimates to be obtained for samples of a more modest size (Wakeley 2008).

Fundamentally, the most recent common ancestor of lineages sampled in different populations always predates the split of the populations, in the absence of migration. In the coalescent backward-in-time terminology, lineages can coalesce only when they find themselves in the same population, but the rate at which they do so is dependent on the effective size of the population (Wakeley 2008). In the case of multiple successive divergence events, tracing the genealogies back in time will sometimes lead to a genealogical topology, or 'gene tree', which does not the evolutionary relationship of the populations. This is expected to remain long after the initial divergence two populations, with reciprocal monophyly or the absolute concordance of all gene trees not being expected to occur until after 9-12 Ne generations (Hudson & Coyne 2002).

Assuming a divergence model with immediate isolation followed by no migration between populations, we can think of a scenario were two divergence events follow one another, at some interval, resulting in three isolated populations. Following two lineages sampled from the populations produced by the more recent speciation event backwards in time, they can only coalesce once they find themselves in the same

16 population, in which case the probability of coalescence is 1/N each generation. The probability that they do not coalesce is then 1-(1/N), or rescaled by N: e-T where T is the time since divergence, measured on the coalescent time scale of the ancestral population. Interestingly, the probability of discordance then depends only on the time until the time of divergence of the third population is reached, and is simply

where is the probability that no coalescent event occurs during a given internode time T and 2/3 represent that, when the ancestral population is reached, two of the three possible combinations of lineages result in a discordant gene tree (Fig. 3) (Takahata 1989; Rosenberg 2002).

Importantly in the context of ancient DNA from historical time points, the model only assumes no migration between the earlier diverged population and the ancestral population of the other two, which is effectively satisfied when sampling from a specimen that died earlier in history. Moreover, while inferring the genealogy of a sample of sequences in practice requires that some mutation event happened on the lineages, there is no need for assumptions on mutation rate, which can be especially problematic when sequences are sampled from different points in time.

17

Materials and methods

Archaeological material from a wide range of contexts and dates were screened for DNA preservation and selected for amplification. Ten Neolithic dog specimens dated to 4500 - 5300 BP from excavation sites in Ajvide on the island of Gotland and Korsnäs (Sweden) were used, as well as two Medieval dogs from Skara and Stockholm (Sweden) (Malmström et al. 2005, 2008) (Fig. 4). In addition, a historical dog sample, dated to the 20th century was included for reference. Cattle samples were from Visby (~1000 BP), Marstrand (300-400 BP) and Lödöse (900-1000 BP) (Svensson et al. 2007). Cave bear and hominid samples were from Sima de los Huesos in Sierra de Atapuerca, Spain, and has been dated to ~600 000 years ago (Arsuaga et al. 1993; Garcia et al. 1997; Bishkoff et al. 2007).

Fig. 4 Map of sampling localities. Stockholm, Korsnäs, Skara and Ajvide in Sweden. Sierra de Atapuerca in Spain.

18

Ancient DNA precautions

All pre-PCR work on ancient samples were done in a spatially isolated facility, dedicated to molecular research on archaeological specimens and in compliance with accepted authentication criteria (Cooper & Poinar 2000; Gilbert et al. 2005). Routines in the laboratory include daily cleaning of work areas with bleach and UV-irradiation of the entire lab for 4 hours each night. A positive air pressure is constantly maintained with an isolated ventilation system. Access to the lab is possible only through an airlock and is restricted to authorized personnel only, all of whom must wear full body suits with a zip hood, face mask and double layers of gloves when inside the clean room. The lab is never entered by a person who has been in a post- PCR work area the same day. Extraction, PCR setup and WGA-reactions were conducted in dedicated fume hoods, which were cleaned with DNA AWAY (Molecular BioProducts, San Diego, USA) and irradiated with UV-light between sessions. All reagents except for DNA oligos were irradiated with 6 J/m2 in a UV-crosslinker (Techtum Lab, Umeå, Sweden) prior to use. Filter pipette tips were used exclusively.

Contamination was monitored with negative controls in the form of no-template blanks as well as samples from ancient cattle which were given the same treatment and processed alongside the canine, ursine and hominid samples during the extraction, PCR and WGA processes. During the extractions, at least one cattle sample was extracted for each 2 dog specimens. One blank was also kept for each round of extractions, which never totaled more than 8 reactions. During the WGA experiments 1-2 ancient bovine samples and 2-4 no-template blanks blank was processed alongside the ancient samples.

DNA extraction

The outer surface of each specimen was first decontaminated with an approximate dose of 1 J/m2 irradiation in a UV-crosslinker (Techtum Lab). A Dremmel automated drill was used to generate 100-200 mg bone powder from the specimen, which was subsequently incubated for 24 h at 38° C in 1 ml buffer containing 0.5 M EDTA (pH 8.0), 2 M urea and 100 µg/ml proteinase K. The sample was then briefly spun at 2000 rpm for 5 min before transferring the supernatant to an Amicon Ultra-15 centrifugal filter (Millipore, MA, USA) which was centrifuged at 4000g for 15 minutes, or until 50- 100 µl remained in the column. The extract was then purified using QIAquickTM silica- based spin columns (Qiagen, Hilden, ) based on the method of Yang and others (1997). First, the extract was added to a spin column together with approximately 5 times its volume of PB buffer. Following centrifugation at 13 000 rpm for 1 minute, 750 µl PE buffer was added and the column was centrifuged as before until there was no visible liquid left. DNA was then eluded by adding 50 µl EB buffer to the column membrane and centrifuging for 1 minute at 13 000 rpm.

19

Table 1. Oligonucleotides used.

Oligo name Sequence (5' - 3')

Dog-F CCATCAGCACCCAAAGCTG Dog_R_111_bp AGAAGGGTTTACCTGGAGATACTGACA mt16SWR-F1 AAGTTACCCTAGGGATAACAGCG mt16SWR-R1 biotin-CATCGAGGTCGTAAACCCTATT mt16SWR-S1 AACAGCGCAATCCTAT ad1_m13t GTAAAACGACGGCCAGTT ad2_revm13 ACTGGCCGTCGTTTTAC m13_universal GTAAAACGACGGCCAGT

PCR amplification

To verify that the extractions contained canine DNA, a 111 bp fragment of the mitochondrial control region was amplified as in Malmström et al. (2007) with the primers Dog-F and Dog_R_111_bp (Table 1). The PCR was in 55 µl volumes with 1x reaction buffer (Naxo, Estonia), 2,5 mM MgCl2, 0,2 mM of each dNTP, 0,18 µM of each primer, 1U Smart-Taq DNA polymerase (Naxo), 1-2 µl extract and ddH2O to a final volume of 55 µl. Contamination was monitored with 1 no-template controls for every 3 reactions. Reactions were run on a PTC-225 DNA engine tetrad (MJ Research, Waltham, USA) with an initial 7 minute denaturation step at 94° C, then 45-50 cycles of 94° C for 1 min, 54° C for 1 minute and 72° C for 1 minute. The program also included a final extension step at 72° C for 7 minutes. As positive control, a modern dog sample was used with the same conditions and reagents, but handled in a separate facility.

Contamination assay

To investigate the relative levels of carnivoran DNA to human contaminants in the extracts, a pyrosequencing primer system with conserved primers, amplifying a region between positions 2377 and 2424 in the canine mitochondrion (human positions 2946-2993) containing two discriminant SNPs in the 16S rRNA gene, was developed. The ability of the system to gauge the proportion of human to canine DNA was tested on a gradient of mixtures from modern humans and dogs. The specificity of the primer system was checked by MegaBlast alignment (Altschul et al. 1990) to all non- redundant nucleotide entries in GenBank, and while possible spurious alignments to carnivores not found in the Baltic region was observed, no alignments to domestic animals or other likely contaminants were found. A forward primer (mt16SWR-F1), a biotinylated reverse primer (mt16SWR-R1) and an internal sequencing primer (mt16SWR-S1) (Table 1) were designed using the PSQTM 96MA SNP software (Biotage, Uppsala, Sweden).

20

Amplification was conducted as above and PCR products were prepared by immobilizing 30-50 µl of the biotinylated product on streptavidin-coated Sepharose beads (Amersham Pharmacia Biotech, Uppsala, Sweden) by incubating at room temperature for 10 minutes in 1X PSQ binding buffer (5 mM Tris-HCl, 1 M NaCl, 0.5 mM EDTA, 0.05 % Tween 20 [pH 7.6]). The binding buffer was replaced with denaturation solution by removing all liquid with vacuum, the sample was then incubated for 1 min and washed twice with 150 µl Washing buffer. The sequencing primer was added in a 55 µl volume of 35 uM primer in 1X annealing buffer (20 mM Tris-Acetate, 5 mM MgAc2 [pH 7.6]) and heated for 2 minutes at 80° C. Pyrosequencing (Ronaghi et al. 1998) was carried out on a PSQTM 96MA pyrosequencer using the SNP software and SNP reagent kit (Biotage) according to the manufacturers instructions. The PSQTM 96MA SNP software (Biotage) was used to retrieve the nucleotide dispensation order as well as analyze the pyrograms and automatically assay SNP-genotype and surrounding sequence and, in the contamination assay, quantify the respective alleles.

Metagenomic amplification

Extracts that yielded mitochondrial d-loop amplicons of the expected size and showed an approximate DNA concentration >10 ng/ul were used in whole genome amplification (WGA) reactions alongside negative controls. The protocol was designed to avoid further degradation of DNA that is frequent in many commercial kits using restriction enzymes, and to increase the probability of amplification of short fragmented DNA, a that is expected to differ between authentic ancient DNA and modern contaminants (Fig. 5).

Between 5 and 38 µl extract, corresponding to a minimum of 5 to maximum 1000 ng of starting DNA, was blunt-end repaired with a quick blunting kit (New England Biolabs, Beverly, MA, USA) according to the manufacturers instructions. The reaction employs the combined 3' exonuclease and 5' polymerase activity of T4 DNA polymerase with the phosphorylating action of T4 polynucleotide kinase, enabling subsequent ligation reactions. The sample was incubated at room temperature for 15 minutes with 1 µl Blunting enzyme mix, 1X Blunting buffer (1 mM Tris-HCl, 10 mM KCl, 0.01 mM EDTA, 0.1 mM dithiotheitol, 0.01 % Triton X-100, 5 % glycerol) and 0.1 mM dNTP and then heated at 70° C for 10 minutes to inactivate the enzymes. The sample was then cleaned with a Qiaquick purification kit as above, but with an elution volume of 44 µl.

Sticky ends for adaptor ligation were generated by adding 3'-A overhangs to the blunt ended extracts using Klenow fragment (New England Biolabs), which is a truncated DNA polymerase lacking exonuclease activity. The reaction utilized 1-2 units Klenow in 50 uM dATP and 1X NE buffer (New England BioLabs), and was incubated for 30 minutes at room temperature and inactivated at 75° C for 20 minutes. The sample was then purified with a Qiaquick kit.

21

Fig. 5 Schematic view of template preparation prior to amplification.

Adaptors were prepared from the two complementary oligos ad1_m13t and ad2_revm13 (Table 1) in 1 M NaCl, heated to 95° C for 10 minutes and slowly brought back to ambient temperature. Molar proportions for ligation between the adaptors and the template were calculated by considering an average fragment length of 60 bp in the blunted extract and a DNA concentration on the order of ~10 ng/ul. Ligation reactions were prepared with 1 to 10 and 1 to 100 proportions between template and adaptor (10 uM and 100 uM, respectively) and mixed with 4.5 µl Quick T4 DNA ligase in 1X Quick Ligation Buffer (New England BioLabs). The reaction was activated for 5 minutes at room temperature, chilled on ice and cleaned with a Qiaquick purification kit as above.

Amplification was performed in 50 µl volumes with 0.4 uM m13 universal primer

(Table 1), 2 mM MgCl2, 0.2 uM dNTP, 1X SmartTaq PCR buffer and 1 U SmartTaq Hot DNA polymerase (Naxo). The amount of template DNA for the reaction was varied, but ranged between the equivalent of 1 and 19 µl extract The amplification program consisted of a hot start step of 95° C for 15 minutes and then 30 cycles with 30 second steps with denaturation at 95° C, annealing at 55° C and extension at 72° C. A final extension step was held for 15 minutes. Amplification reactions were cleaned with an MSB PCRapace kit in a work area free from high-copy PCR products. The WGA product was mixed with 250 µl binding buffer and spun for 3 minutes at 12 000 rpm. A volume Elution buffer equivalent to the starting amount of extract in the blunting reaction was used to collect the WGA product from the column. The resulting DNA concentration was measured on a ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA).

22

454 and Illumina sequencing

A whole genome amplification product from specimen Ajvide-19 was chosen for shotgun sequencing on the basis of its relatively low levels of human contamination, high rate of PCR-amplification success using dog-specific primers, high starting concentration and high WGA-product concentration. A 454 library from Ajvide-19 was constructed at the Centre for Ancient Genetics (University of Copenhagen, Denmark) as in Gilbert et al. (2007), with post adaptor-ligation heat-liberation of template molecules from the beads (Maricic & Pääbo 2009) and qPCR quantification of the library modified from Meyer et al. (2008). The library was sequenced with a LR70 FLX kit on 1/4 of a picolitre sequencing plate on a GS FLX sequencer (Margulies et al. 2005; Roche-454 Life Sciences, Basel, Switzerland) following the manufacturer's instructions.

Amplified extracts from a Sima de los Huesos cave bear (4 lanes from 2 reactions of the same extract) and hominid (3 lanes) were sequenced on an Illumina Genome Analyzer at Aarhus University and a partial 454 GS FLX plate according to the manufacturers' instructions. Read length for the Illumina sequencer was set to 76 bp

Sequence processing and alignment

Sequence reads obtained by 454 sequencing of Ajvide-19 were re-filtered to allow shorter sequences in Copenhagen and bases were called automatically using the 454 software. Sequence reads from Aarhus were treated similarly, but was not re-filtered. All sequence reads where compared to the boxer dog genome (Canfam 2.0; Lindblad- Toh et al. 2005) and the human reference genome (hg18) which were downloaded from GenBank and formatted using the NCBI software Makeblastdb. Sequences reads were aligned to database entries using MegaBlast (Zhang et al. 2000) with word size = 12 and masking of repetitive sequences to decrease computational expense. Reads for which the best hit (and second best hit in the case of multiple HSPs) was to dog or human sequences with alignment expect value < 0.001, identity > 80 % and bit scores > 10 where identified as putative authentic canine DNA and human contamination, respectively.

The word size parameter of the BlastN and MegaBlast algorithms determines the minimum width of an exact match between the query sequence and database entry for a local alignment to be extended. Considering the expected presence of adaptor sequences in the library as well as an expected increase in mismatches in the pyrosequencing approach and sequence damage in ancient DNA, word size and the required identity proportion was set to allow relatively short alignments. To prevent incorrect taxon assignment of reads, a more stringent expect value, which is the expected number of random matches in a database of similar size, was chosen following e.g. Noonan et al. (2005, 2006), Briggs et al. (2007) and Green et al. (2006).

Sequence reads which had significant hits to both the human and boxer genome and in which the bit scores for the best hit in each species was within 20 % of the best hit in the other species were assigned to taxa based on the number of bases aligning,

23 where the longest alignment was considered the best. Sequences which aligned identically to both the human and boxer genome were regarded as human contamination as a conservative measure.

Illumina GA raw output files were converted to fastA format using the fq_all2std Perl script from the Maq 0.7.1 package (Mapping and Assembly with Qualities, http://sourceforge.net/projects/maq/). Sequence reads were compared to reference mitochondrial genome sequences from Homo sapiens (NC_001807; Ingman et al. 2000), Homo neanderthalensis (Green et al. 2008), Ursus spelaeus (Krause et al. 2008) and Canis lupus familiaris (Kim et al. 1998) using MegaBlast with word size = 12 and expect value = 0.001. Sequences that aligned to any of the target mtDNAs with score > 10 were extracted and aligned to all 1790 vertebrate mitochondrial reference sequences currently available in GenBank (May 10, 2009) and the second published U. spelaeus mtDNA genome (Bon et al. 2008) using MegaBlast with the same parameters as above.

Mitochondrial genome assembly was performed with the MOSAIK package (softwares used were MosaikBuild, MosaikAligner, MosaikSort and MosaikAssembler), which uses a modified Smith-Waterman algorithm (Smith & Waterman 1981). Reads were required to align with at least 80% identity over the whole read to a target reference mitochondrial sequence with no more than 10 % mismatches over the aligned part of the sequence. Hash size was set to 11.

Metagenome analysis

The metagenomic profiles of the 454 libraries were further investigated with BlastX (Altschul et al. 1990) searches against the NCBI nr database, which contains all known and annotated protein sequences. BlastX compares all six reading frames of the input nucleotide sequence with the amino acid sequence entries in the specified database, and the default settings of word size 3 and expect value of 10 were used. The best 5 hits for each read were saved and taxa were identified using MEGAN 3.1.2 (Huson et al. 2007). Lowest Common Ancestor (LCA) algorithm parameters were set as follows: the minimum bit score for a read to be considered was 35 (min-score), hits were further required to have bit scores within 10 % of the best hit for the read (top- percent) and any taxon was required to have at least 2 observed presences in the library to be identified (min-support). Size distributions of reads were obtained using a modified Biopython script (Cock et al. 2009).

Adaptor sequences in the total 454 library as well as partitions of taxonomically assigned sequences were identified with the blastn-short algorithm with a word size of 8. Hits where required to have an expect value < 0.1, 80 % identity to the query match and a bit score > 10.

Evolutionary analysis

Sequences which were identified as canine above were selected and aligned to the

24 current releases of the 7.8X boxer (Canfam 2.0; Lindblad-Toh et al. 2005) 1.5X poodle (Kirkness et al. 2003) and 1.9X Abyssinian cat (felCat3.0; Pontius et al. 2007) genome assemblies with the BlastN algorithm (Altschul et al. 1990) using a word size of 8. To increase to amount of sequences for which the ancestral cat allele could be determined, a second 0.9X shotgun sequence library (id: ACBE; Mullikin et al. unpublished) from seven cat breeds were used to complement some alignments. Alignments that spanned > 60 % of the query sequence and at least 30 positions were kept in the analysis. Duplicate reads of the same template molecule were also identified and discarded.

Because of the limited coverage of the poodle genome, sequence divergence times between the cat, boxer and Ajvide-19 specimen were calculated using all BlastN alignments which spanned 30 bp or more for these three species. Sites containing insertions and deletions in a species were ignored because of uncertainty in their mutation rates compared to transitions and transversions, as well as the lack of accuracy in homopolymeric regions in 454 pyrosequencing. Polymorphisms specific to the Neolithic dog were excluded since a large proportion of them are expected to be due to post-mortem DNA damage. The time of the most recent common ancestor to Felines and Canids was assumed to be 55 million years ago (Pontius et al. 2007).

I set out to estimate the internode time Ta-bp between a theoretical Wright-Fisher population including the Neolithic dog (a) and a corresponding population ancestral to modern poodles (p) and boxers (b) using numerical results of the structured coalescent obtained by Takahata (1989) and expanded upon by Rosenberg (2002). The log-likelihood function of observed topologies under the above model was computed following Wakeley (2008)

where nconcordant is the number of genealogical trees for which the (a(bp)) topology was inferred, ndiscordant is the total number of gene genealogies with the other two topologies and T is the internode time between the two divergence events in the three-population model.

An approximate confidence interval was obtained using the profile likelihood method i.e. solving for the two values of T that lowers its maximum likelihood by 1.92. For comparison, randomly chosen shotgun reads from four globally distributed wolf populations, an Alaskan Malamute and a German shepherd (Lindblad-Toh et al. 2005) were downloaded from the EMBL database and aligned to the target genomes. Gene genealogies were inferred by manual inspection using the unweighted parsimony criterion, and subjected to the same computations as above.

25

Results

Two PCR assays were used to identify ancient canid samples that showed a high degree of DNA preservation. Using canid-specific primers amplifying 111 bp of the mitochondrial control region, 5 of 11 Neolithic dog samples (from Ajvide and Korsnäs) produced amplicons of the expected size (Table 2). None of two tested medieval samples (Skara; Stockholm) resulted in PCR products. No PCRs using dog-specific primers yielded products in control extractions of medieval cattle (Lödöse; Marstrand) which were processed alongside the dog samples (data not shown). A contamination assay contamination proportion in the obtained PCR products revealed a varying ratio of canid to human DNA in the samples, ranging between 50 - 0 % contamination.

Optimization of the metagenomic amplification protocol

The performance of the WGA method used was evaluated by assaying total (i) total DNA concentration with a Nanodrop spectrophotometer, (ii) allelic quantification of a human/canid discriminant SNP in the mitochondrial 16S gene (iii) by 454 and Illumina shotgun sequencing of different WGA products. The samples used ranged in chronological age between ~50 - 600 000 years and were taken from canine, bovine, ursine and hominid fossils.

Table 2 List of canine samples used.

Specimen Sample PCR product (extraction) Locality Element (mg) (success/failure)

2 Korsnäs Tibia 200 0/2 5 Korsnäs Cranium 120 2/0 13 Ajvide Radius 150 0/2 14 Ajvide Radius 200 0/2 16 Ajvide Radius 160 1/1 17 (a) Ajvide Radius 200 1/1 17 (b) " " 200 2/0 19 (a) Ajvide Humerus 200 2/0 19 (b) " " 200 2/0 20 Ajvide Cranium 300 0/2 21 Ajvide Incisor 50 0/2 23 Ajvide Jawbone 320 1/1 26 (a) Skara Mandible 90 0/2 26 (b) " Incisor 90 0/2 28 Stockholm Bone 160 0/2 MC3 Historical Bone 120 1/-

26

Total DNA concentration was found to increase 3 to 10-fold when averaged over different rounds of extractions (Fig 6). An increased level of light absorbance was also found for negative controls. The procedure was also performed on samples prepared with a different extraction method (Emma Svensson, unpublished), based on magnetic separation. These were found to have lower starting amount of material, but resulted in similar amount of amplified DNA. The effect of the amount of starting template on amplification was investigated on 4 samples handled in parallel with starting (extract) concentration between 2-13 ng/ul. Two different PCR-experiments with template corresponding to 1 and 9- µl of the original extraction resulted in only minor effect on post-amplification concentration. The amplification reaction was also optimized by varying the adaptor (2 and 4 mM) and dNTP (0.2 and 0.4 mM) concentrations, both separately and simultaneously, with no apparent increase in amplification.

C

Fig. 6 Effects of the metagenomic amplification on the DNA content of 5 ancient samples, handled in parallel. (A) Measured by absorption at 260 nm prior to- and after amplification (WGA). (B) Proportion of human to canine DNA measured by pyrosequencing allelic quantification at 3 stages of the amplification-procedure for 5 Neolithic dog samples and a pyrogram of sample Ajvide-19a (C) showing the allelic proportions of the canine (A) and human (T) haplotypes.

27

To investigate the effect of laboratory work on the reaction, the proportion of human contamination was monitored for a set of samples handled in parallel during three different stages of the protocol: post-extraction, post-ligation, and post-amplification. A non-significant tendency of increased contamination after the initial handling steps was observed, with a corresponding decrease in contamination proportion after amplification (Fig. 6).

Sequencing of the Neolithic dog

In total, 49809 sequence reads where obtained from FLX-sequencing of an amplified DNA extract prepared from the Ajvide-19 dog specimen. At the 80 % identity and expect value 0.001 level, 736 reads (1.5 %) had significant hits to the dog genome. After comparing the relative scores and alignment coverage of reads that mapped to regions of both the human and dog genome, 114 reads (0.23 %) could be confidently assigned as canine while 2100 reads (4.22 %) were assigned to Homo sapiens. Canine sequences ranged between 30 and 269 bp with an average length of 89 bp (Fig. 7), resulting in a total of 9981 bp aligned sequence. For comparison, the library of reads that were assigned as human contaminants had an average length of 96.2 bp with the distribution spanning 24 bp (limited by the identification pipeline) to 302 bp (limited by the maximum read length of the GS FLX system). An intact m13 adaptor in the 5' end was observed in 261 reads (0.5 %) but alignment of the m13 sequence with all reads using BlastN-short resulted in 2590 putative adaptor sequences (5.2 %).

Sequencing of the Sima de los Huesos fossils

Two independent WGA-products from the same extraction of the Middle Pleistocene cave bear Ursus deningeri (extraction performed as in Valdiosera et al. 2006) was sequenced on 1/64 of a 454 FLX plate and 4 lanes on an Illumina Genome analyzer. Among 5642 FLX sequence reads, 6.6 % were found to contain an m13-adaptor sequence, of which 4 % had a perfectly intact sequence in the 5' end. In an effort to screen for putative bear sequences, 31 reads (0.55 %) could be aligned to the dog genome while 91 reads (1.6 %) were the assigned to the human species.

A total of over 20 million Illumina sequence reads (20 049 825) were obtained from the two WGA samples, of which 2064 could initially be aligned to the Ursus spelaeus mitochondrial sequence and 3260 to the domestic dog mitochondrion. These reads together with reads that aligned to human mitochondria were extracted and aligned to all published complete vertebrate mtDNAs. This analysis revealed that a large proportion (2326 reads) were porcine contaminants, with their best hit to Sus mitochondria. No hit to bear mitochondria with an expect value > 0.001 had a better score than a hit to a porcine mitochondria in this analysis, but 27 reads were assigned as human mtDNA.

28

Fig. 7 Size distribution of (A) the total amount of reads obtained from Ajvide-19 and (B) sequences matching the human (blue) and dog (green) genome.

One WGA-product (Hh-c2) prepared from an extraction of the Sima de los Huesos hominid (H. heidelbergensis) was also sequenced on 1/16 of a 454 FLX plate and 3 lanes of an Illumina GA flow cell. A total of 21454 FLX sequences were obtained, of which 4.2 % had an intact m13-adaptor sequence in the 5' end. Using a blast algorithm which allows a certain amount of mismatches, an adaptor sequence was found in 6.6 % of the reads. The read library was subjected to the same analysis as the cave bear sample, with 0.02 % of the reads aligning to the dog genome (consistent with the found level of pig contamination) and 1.1 % to the human genome.

More than 1.8 million Illumina sequence reads (18 267 346) were obtained from the three lanes, from which the ~0.03 percent reads that initially aligned to one or more of the four animal mitochondria were extracted and aligned to a database containing all vertebrate mtDNAs. Again, a large portion was identified as porcine in origin (5473 reads or 0.01 %). In this analysis 50 reads were identified as human, but none of these had a better hit to H. neanderthalensis than H. sapiens.

To examine the origin of the porcine contaminants, 4408 sequence reads that had hits to Suidae sequences (pigs and relatives), were aligned to the reference mitochondrial sequence of Sus scrofa domestica (Yasue H, Yamamoto Y, Hayashi T, Honma D, Nishibori N, Nishibori M, Wada Y, uploaded to the NCBI 24 Feb 2009). After filtering, 2945 sequence reads were successfully aligned, creating a 13.2X assembly of a total of 222 045 bp sequence data, covering 75.6 % of the 16770 bp reference sequence. No reliable SNPs were identified, and thus there was no indication of contamination from more than one pig individual.

29

Fig. 8 The most populous taxonomic orders identified with BlastX searches in the Ajvide-19 metagenome.

Fossil metagenome contents

The metagenomic content of the three 454 sequencing libraries was profiled by comparison with the NCBI protein sequence database nr. In the sequences obtained from Ajvide-19, the proteobacterial order Rhizobiales was the most populous, followed by Burkholderiales and Flavobacteriales (Fig. 8). Rhizobiales was also the most populous order of the two libraries prepared from Middle Pleistocene fossils from Spain, with additional frequent hits in protein sequences from Actinomycetales, Flavobacteriales and Sphingomonadales (Fig 8). The most populous phylum was proteobacteria in all three libraries. There were also matches to both the Primate and Carnivore sequences (see discussion).

Evolutionary analysis of Neolithic dog sequences

After removing duplicate reads of the same template molecule, 30 reads were aligned to the boxer, poodle and cat genome assemblies. After further filtering to assure authenticity (see methods for details), 20 three-way alignments of the putative Neolithic dog, boxer and cat spanning 1488 bp of positions (excluding indels) were scrutinized for informative mutations. Following several previous studies (e.g. Green et al. 2006; Noonan et al. 2006; Wall & Kim 2007), I ignored substitutions specific to the Neolithic dog sequences since a large proportion of these are expected to be post-mortem damage, instead focusing on the amount of dog-specific substitutions which pre- and postdate the divergence of the Ajvide dog and the boxer.

30

Fig. 9 Metagenomic profiles of 454 sequence libraries prepared from amplified Sima de los Huesos samples. The most populous taxonomic orders identified by BlastX in A) Hh-1c and B) Ud1-b2. Size distribution of the total read library from Hh-1c (C) and Ud1 (D).

At the 1488 studied positions, 200 nucleotide transitions and transversions between the boxer and cat genome assemblies were found, corresponding to 13.4 % divergence. Assuming an equal mutation rate in the feline and canid lineages, 100 of the substitutions can be expected to have occurred on the lineage leading to dogs from the cat-dog ancestor. For 2 positions, the Ajvide-19 sequence was found to retain the ancestral cat allele for which the Boxer had a derived allele. Thus, 2 % of the substitutions on the lineage leading to the boxer occurred since its separation from the Ajvide canine. Calibrating the divergence of cats and dogs to 55 mya (Pontius et al. 2007), this would correspond to an average time to most recent common ancestor (TMRCA) between the Ajvide dog and the boxer of about 1.1 million years. However, I note that both of the derived boxer substitutions for which the Ajvide dog displayed the ancestral allele were found on the same 48 bp read (see discussion). If this sequence is excluded from the analysis, the observed cat-dog divergence is instead 13.6 % (198/1455), with no identified substitutions having occurred on the boxer lineage since its divergence from the Neolithic dog. While no statistical analysis of data such as this can be made, the result implies that the average TMRCA between Ajvide and boxer sequences is less than 550 000 years.

31

Fig. 10 Log-likelihood (Log(L(T))-log(L)max) for observing the number of concordant and discordant gene genealogies under different internode times T. The likelihood functions displayed are for data from Gray wolves (blue), Ajvide-19 with outgroup (red), Ajvide-19 without outgroup (green), an Alaskan malamute (purple) and a German Shepherd (yellow). The dashed line represents the 1.92 log-likelihood drop i.e. the 95 % confidence interval.

To estimate the time of split between the populations leading to the Ajvide dog and the population giving rise to modern boxer and poodle breeds, I instead took the approach applied by Wakeley (2008) on data from a previous study (Jennings and Edwards 2005). This approach takes advantage of the fact that the proportion of gene genealogies that show discordant topology to the population phylogeny is exponentially distributed with the internode time between the most recently diverged populations (in this case the poodle and boxer), and the earlier split with the third population i.e. the Neolithic dog population in Ajvide (Takahata 1989; Rosenberg 2002). If there was ancestral population structure between the Ajvide dog and the modern breeds, I hypothesized that the internode time T plus the expected time of creation of modern breeds 100 - 200 years ago, would be greater than 5000 years (the estimated age of the Ajvide-19 specimen (Malmström et al. 2008)) when transformed from the coalescent time scale to chronological years.

I created four-way alignments between the Ajvide-19 reads (a), boxer genome (b), poodle genome (p) and cat genome (c), and scored alignments that contained synapomorphic SNPs or indels to determine genealogical relationships. In a total of 7 alignments could the genealogy be inferred using the cat as an outgroup and in all except 1 (14.3 %) was the Ajvide dog found to retain the ancestral cat allele, yielding an estimated genealogy concordant with the (c)((a)((b)(p))) tree. The single discordant

32 genealogy saw the poodle and Ajvide dog share a derived allele, for which the boxer retained the ancestral variant. Using the approach of Wakeley (2008), I computed the full log-likelihood function of the observed data (Fig. 10) and arrived at a maximum likelihood point estimate for the internode divergence time between the populations at T = 1.54 (95 % confidence interval: 0.30 - 4.33) where T is measured in units of Ne generations. These sequences were different from the cat genome at 13.8 % of the studied positions.

In this approach, the cat outgroup is used to avoid the possible bias introduced by post-mortem damages in the ancient sequences, but is formally not required for the concordance-discordance analysis. If the requirement of an outgroup cat sequence is abolished, the genealogy can be determined for 11 alignments, of which 2 (18.2 %) show the discordant (c)((ap)(b)) grouping. The point estimate for T given this dataset is T = 1.30 (95 % CI: 0.366 - 3.00) and the relative log-likelihood function

−푇 −푇 퐿표푔(퐿(푇))=(6 log (1−2푒 /3)−1 log( 푒 /3)) - 퐿표푔(퐿)max is plotted in Figure 10.

To provide references and a possibility to evaluate the approach taken to gauge the time of split between populations, shotgun sequences from a modern day German shepherd, Alaskan Malamute and pooled data from 4 wolves from different worldwide populations (Lindblad-Toh et al. 2005) were subjected to the same analysis above, without the requirement of a feline outgroup sequence. Based on 48 alignments, the maximum likelihood estimate of the internode time for an assumed split between German shepherds and a population ancestral to Boxers and Poodles was T = 0.24 (CI: 0.013 - 0.556), while corresponding values for 26 Alaskan Malamute sequences was T= 0.33 (0,077-0,657). A putatively discordant genealogical history was found for only one of 37 wolf sequences, yielding a maximum likelihood estimate of T = 3.23 (CI: 1.79-6.09).

33

Discussion

To identify suitable candidates for sequencing, several samples from ancient material were screened for DNA preservation and a subset were selected for a metagenomic amplification procedure, together with samples of unknown preservation and contamination. The success rate of PCR amplification was in line with previous studies conducted on some of the specimens (Malmström et al. 2005, 2007, 2008; Girdland- Flink 2008), and more general observations from ancient DNA studies (Millar et al. 2008).

Metagenomic amplification for aDNA sequencing

Initially, the metagenomic amplification procedure was evaluated by using light absorbance as a measure of total DNA concentration. Varying increases in total DNA concentration was observed, with a marked signal in PCR controls as well. Generally, the signal from PCR controls was lower, and without a clear mode in absorbance at the expected wavelength 260 nm (data not shown). This would be indicative of a presence of molecules other than DNA, such as aminoacids and sugars in both controls and test samples. Alternative methods of post-amplification purification could potentially ameliorate these effects, but would not redeem the fact that spectrophotometer analysis can only provide a rough measurement of DNA content. This is possibly even more of a concern for measuring the molecular content of DNA extractions from ancient samples, since quantitative PCR experiments have shown that these almost unequivocally contain extremely low amounts of DNA (e.g. Hofreiter et al. 2001, Malmström et al. 2005, 2007).

The effect of the procedure was also investigated by amplifying a conserved region of the mitochondrial 16S gene and quantifying the relative proportion of a dog-specific SNP to the variant common in humans. Generally, the level of contamination seemed to increase after amplification compared to what was observed directly in the extracts. This is not completely unexpected since the extensive handling and amount of additional reagents required in the metagenomic amplification increase the chance of exposure to exogenous DNA from humans. However, when contamination was measured at an additional checkpoint prior to amplification but after all enzymatic steps involving blunt-ending and ligation, the majority of reactions was revealed to have an initial increase in contamination during the first handling stage, but a slight drop in contamination after amplification, with the net result being the same. This could be an effect of only DNA molecules that have two successfully ligated adaptors on each end being amplified, and since authentic molecules are expected to be more heavily fragmented than modern contaminants, the stochiometric abundance of ends in the ancient molecules would predict that a larger proportion of adaptors would ligate to these, compared to what would be expected from the molecular abundance measured per-nucleotide.

Increasing the yield of authentic sequences from shotgun libraries of ancient material is currently a main target of research (Noonan et al. 2006; Meyer et al. 2008; Maricic

34

& Pääbo 2009; Pennisi 2009), since genomes from now extinct taxa can provide important information for conservation and natural history. I investigated the efficiency of the metagenomic amplification procedure by yield of target DNA and the proportion of templates to which an adaptor had successfully been ligated. Three 454 shotgun libraries containing ~6000, ~10 000 and ~50 000 sequence reads were generated from metagenomic amplifications and examined to investigate the proportion of reads that contained a successfully ligated adaptor sequence. Only one of these libraries contained reads that could be assigned as authentic, namely the 114 sequences of the Ajvide-19 library which aligned to the dog genome better than any other sequence. This library also contained a large proportion of human contamination (~4 %) compared to authentic sequences (~0.2 %), largely consistent with the results of Malmström and coworkers' (2005) qPCR analysis the sample and associated remains, which also showed a large variance in contamination level between extractions of the same sample.

In the sequences obtained from Ajvide-19, an intact m13 adaptor in the 5' end was observed for only 0.5 %, but a search algorithm allowing a degree of mismatches resulted in adaptor sequences being observed in 5.2 %. In the libraries obtained from the significantly older Spanish samples, 6.6 % of reads from Ud1-b2 were found to contain an m13-adaptor sequence, of which 4 % had a perfectly intact sequence in the 5' end. In the Hh-c1 library, 4.2 % of the reads had an intact adaptor in the 5' end. These results imply that ligation and amplification is not so efficient as to preclude a large proportion of and non-amplified molecules in the sequence libraries. However, since the experiment did not include shotgun sequencing directly from the extracts, a clearer picture of the efficiency of the method can not be attained.

While no authentic sequences from the Sima de los Huesos cave bear and hominid were identified from the ~35 million reads in the Illumina library, a surprising proportion of the sequence reads matched the mitochondrion of domestic pigs. Assuming a 1:200 ratio between nuclear and mitochondrial DNA (Poinar et al. 2005; Noonan et al. 2006), it could be estimated that the total level of porcine contamination on the Illumina GA flow cell was on the order of 2 million reads, or 5 % of the total library sequenced from the cave bear fossil. This is to be contrasted with the level of human contamination, which was approximately 2 x 10-4 % mtDNA or 0.04 % of the total library, making the same assumption about the nuDNA:mtDNA ratio.

More taxonomic orders were also identified in the metagenomic profile of the two libraries prepared from Middle Pleistocene hominid and cave bear remains retrieved from an assembly in a cave environment in the Atapuerca Mountains compared to the shotgun sequences from the Neolithic dog fossil, which was excavated from a limestone substrate in the Baltic Sea area. However, this could also be an effect of the modifications to the original sequencing protocol made in Copenhagen, which are meant to optimize the retrieval of authentic sequences. For instance, the measures taken to retain shorter sequence reads in Copenhagen are readily visible when the size distributions of the three 454 libraries are compared. Modifications to suit the specific needs of ancient DNA which were implemented in Copenhagen might also explain the lack of animal contaminants such as what was found in the libraries prepared in Aarhus.

35

Authenticity of canine sequences

From the Ajvide-19 sample, 114 of 49809 454 sequence reads had significant hits to the dog genome and no hit to a human sequence with a bit score within 10 % of the hit to the dog genome. While a similar pattern was present in the cave bear and Neandertal libraries of Noonan and coworkers (2005, 2006), the observation of two modes in the size distribution of the canine sequences raised suspicion of the longer molecules being contaminants. However, after filtering duplicates and reads with matches with insufficient coverage, 30 sequence reads were informative in an evolutionary context, and several observations point to these being authentic sequences from the Neolithic dog.

First, while there were signs of extensive human contamination in the read library, it seems unlikely that in a pool of ~2000 human sequence reads, a number of sequences would show the observed level of homology to the dog genome compared to the human genome. For cases in which a read matched both genomes with bit scores within 20 % of each other, the alignment which spanned the most positions was considered the best. Several of the sequences that had identical alignments to both target genomes (these were conservatively assigned as contamination) had their hit in the boxer genome in the unassigned ('Un') category (data not shown), suggesting that they might be contaminants present in both this study and the shotgun libraries of the dog genome project. Secondly, if the sequences originated from some vertebrate other than human or dog, it seems unlikely that such a high proportion of derived dog mutations compared to the cat would be observed (i.e. 0-2 cat alleles compared to 198-200 canid-specific substitutions in a total of 1488 informative positions). Also, the apparent contamination of domestic pig sequences which was observed in Illumina sequences from the two Spanish samples was not found in the Ajvide-19 sequences. Finally, if the sequences were exogenous contaminants from a modern dog, it seems unlikely that their observed divergence time to the poodle and boxer would differ to such an extent to what was computed for randomly chosen German Shepherd and Alaskan Malamute shotgun sequences. The sequences used for computation of population divergence also differed from the homologous cat sequence at 13.8 % positions, similar to the 13.4 % average observed in the total sequences that were identified as canine.

Sequence analysis

The 20 informative sequence reads from the Neolithic dog could be aligned to the boxer and cat genome assemblies over a total of 1488 bp. When scrutinized for single nucleotide polymorphisms (SNPs) excluding indels, only two such polymorphisms were found. Interestingly, in both cases the sequence displayed the same variant as was found in the cat genome and both were located on the same sequence read. While the chance of this result might seem low, this read could be aligned to both target genomes over 33 bp, which is usually considered sufficient for correct species identification in metagenomic libraries from fossils (e.g. Briggs et al. 2007). Moreover, the sequence could be aligned to a homologous sequence in the poodle genome assembly, which displayed both the derived boxer alleles not found in the Ajvide-19

36 library. In their SNP-discovery effort in several dog breeds and canid species, Lindblad-Toh and coworkers (2005) observed a rate of 1 SNP per ~900 bp in most comparisons between the boxer and other breeds (including the poodle). The SNP- rates in the Alaskan malamute (1/787) and grey wolves (1/~580) were slightly higher and so it could be expected that due to the age of the Ajvide-19 specimen, its sequence similarity to the boxer would be intermediate to these values e.g. in the range of 1 SNP per 600-700 bp. If so, it would be expected to observe on the average ~2 SNPs among 1488 analyzed positions, but this expectation would include mutations specific to the Neolithic dog, which were excluded due to the increased presence of artifacts in ancient DNA (Hofreiter et al. 2001). While it would be unexpected to observe 2 SNPs on the same sequence read, mutation rate is known to vary across the dog genome, causing a non-uniform distribution in SNPs within and between chromosomes (Lindblad-Toh et al. 2005).

I first took the approach of previous studies on Neandertal humans (Noonan et al. 2006, Wall & Kim 2007) for estimating the average TMRCA between Neolithic dog sequences and the modern boxer. This approach ameliorates the potential effect of postmortem sequence damage in ancient DNA by considering the proportion of substitutions that postdate the divergence of an ancient population (i.e. Neandertals or Neolithic Scandinavian dogs) from the population leading to a modern counterpart (anatomically modern humans or boxers) by including an outgroup taxon (chimpanzee or domestic cat). Again, due to uncertainties in their mutation rate, indels were excluded. Depending on whether the sequence containing two SNPs described above is included in the analysis, either 2 or 0 substitutions are found to have occurred on the boxer lineage since its split with the Neolithic dog, of a total of 100 changes that occurred since its split with the cat. Assuming a split between the ancestors of cats and dogs 55 million years ago (Pontius et al. 2007), this would imply an average TMRCA of boxer and Ajvide-19 sequences of either 1.1 million or less than 550 000 years.

Clearly, two factors cause the analysis above to provide insufficient information on the relationship between the Ajvide dog and modern dogs. First, more data in the form of sequence positions would be required to obtain a well-resolved ratio of substitutions that pre- and postdate the split of the two dog lineages. However, this would not completely mitigate the fact that the split time between cats and dogs is two orders of magnitude more ancient (~106 years) than a reasonable Upper Paleolithic/Neolithic split time between the two dog lineages (~104 years). In this regard, a reference genome sequence from a more closely related outgroup such as the announced but unpublished genome of the Giant Panda (Illumina news release 9 Jan 2009) or more preferably a distantly related canid could provide a smaller evolutionary window and increased resolution to this type of analysis. Moreover, even if a good estimate of the average TMRCA could be obtained, the date obtained for the TMRCA of sequences from different populations is not the same as the date of their demographic divergence. In fact, this event is always more recent. For instance, using similar methods as presented above, Green et al. (2006) found an average TMRCA for two autosomal sequences in African humans of 459 000 years, but anatomically modern humans are considered to have split from other archaic hominid populations about 200 kya.

37

Evolutionary analysis

Due to the shortcomings of the TMRCA-based analysis discussed above, I instead took an approach based on expectations on the probability of genealogical discordance between three populations under a structured coalescent model. Since it relies only on inferring the underlying genealogy of sampled sequences, it is insensitive to assumptions on mutation rate, which allowed me to make use of parsimony- informative indels in the data.

Apart from the alignments to the boxer and cat genomes discussed above, the obtained 10-7X coverage Neolithic canine sequences were also aligned to shotgun sequences from a poodle. This analysis revealed a large a proportion of dog-specific mutations, common to both the poodle and boxer, but not found in the ancient Neolithic dog sequences. An approximated divergence between a population harboring the Ajvide dog and a population leading to the modern day breeds was computed based on expectations on genealogical discordance. The same analysis on publicly available shotgun sequences from other canids revealed a much earlier divergence of two modern dog breeds and an expected more ancient estimate for grey wolf sequences.

However, since the rate of coalescence between lineages in a population is proportional to its effective size, estimates are obtained on the coalescent time scale, in which time is measured in units of T = Ne generations. To convert this estimate to chronological years, assumptions on the generation time and the effective population size Ne of the ancestral population to boxers and poodles is required. Effective population size is usually estimated from the scaled mutation rate θ = 4Neµ (also known as the population mutation rate) when the mutation rate µ is known. The size of the dataset presented here is insufficient to provide a reasonably accurate estimate of these parameters and software developed to co-estimate the ancestral effective population size using the variance in TMRCA of the diverged sequences (Yang 2002; Yang and Rannala 2003) fail to accommodate samples taken from different points in time. However, previous attempts in the literature to estimate Ne in dog population history are Lindblad-Toh et al. (2005) and Gray et al. (2009). These authors used instead patterns of linkage disequilibrium to and the population recombination parameter ρ = 4Ner, from which the effective population size can Ne also can be obtained when the recombination rate r is known, for instance from gametic typing. While Gray et al. (2009) conditioned on the domestication event happening exactly 5000 generations ago in their simulations, Lindblad-Toh et al. (2005) co-estimated the date of the event and the effective size of the domesticated population, by fitting a model to the observed decay of linkage disequilibrium in different dog breeds. Although investigating only a rather sparse grid of parameters in their analysis, Lindblad-Toh et al. (2005) obtained most support for an ancestral domesticated population size of 13 000 chromosomes with inbreeding coefficient F = 0.12. While their analysis was not exhaustive of all alternative model of demographic history, it is reasonable to assume an effective population size Ne in the ancestral dog population to have been on the order of 103.

38

Assuming an effective population size of 13 000 chromosomes and a generation time of 2 years (3 years are sometimes used in the literature (Gray et al. 2009) but 2 years represent a more conservative assumption in this case), the maximum likelihood point estimate for the internode time between the creation of the boxer and poodle breeds and their divergence to the Ajvide population is 40 000 years (Confidence interval: 7800 - 112 580) or using the approach less robust to sequencing errors 33 800 years. Breed creation likely occurred some 100-300 years ago, and this value is to be added to the total time since divergence. The same analysis conducted on wolf sequences gives an estimate of 83 200 years (CI: 46 800 - 158 600) but this chronological conversion is probably less valid, since several studies point to a notable bottleneck during domestication (Lindblad-Toh et al. 2005; Gray et al. 2009), which would bring about an increased rate of coalescence, thus changing the coalescent time scale, and causing a fixed Ne = 13 000 to results in overestimation of the internode time. The dates obtained from subjecting shotgun data from a German Shepherd (point estimate: 5200 years, CI: 260 - 15 600) are as expected more recent than for the Ajvide dog, but nonetheless more ancient than the historic accounts of breed creation. However, the assumption that Boxers and Poodles share a more recent history than with German Shepherds is likely not valid, since European dog breeds show equal pairwise degrees of genetic differentiation (Karlsson et al. 2007) and are expected to derive from a common ancestral population less than 100 generations ago (Lindblad-Toh et al. 2005). The dates obtained for the Alaskan Malamute are more interesting since the assumption on the population 'tree' has some support from interbreed phylogenies and cluster analyses, which indicate that Alaskan Malamutes together with some Asian breeds occupy a basal position and an early divergence from other breeds (Parker et al. 2004). However, assuming an ancestral Ne of 13 000 chromosomes still gives estimates older than predicted by historical accounts (internode time: 7800 years, CI: 2600 - 15600). However, this may also be confounded by gene flow from North American gray wolves into the Malamute population, which has been claimed by breeders and is somewhat supported by genomic data (Lindblad-Toh et al. 2005), or a smaller effective population size in the population compared to more ancient times.

An aspect that could also possibly confound the estimates on the internode time of the Ajvide dog and the modern breeds is if there were considerable demographic fluctuations in the ancestral population of the poodle and boxer. For instance, if the poodle and boxer breeds were both created from the same subpopulation, this contraction in Ne would result in a faster rate of coalescence between sequences from the two breeds and an overestimation of the split time using the above approach.

Fluctuations in domestic dog Ne over time since divergence from wolves have is still not been completely investigated, but patterns in linkage disequilibrium point to two significant events in their history. The relationship between genomic patterns of linkage disequilibrium and demographic history can be understood in that ancient events are seen in gametic correlations (LD) within small windows, but as the investigated window increases, the accumulation of recombination events in the ancestral graph of the studied region will cause these signals of old events to fade and LD would tend to the neutral expectation for a constant population. Most modern breeds show a biphasic decay of linkage disequilibrium which is likely due to a first bottleneck during domestication and a second more recent one during breed

39 formation, which created short- and long-range LD in the genomes of modern breeds, respectively. Importantly however, LD between breeds do not conform to this pattern, instead it falls to ambient levels within 200 kb (Lindblad-Toh et al. 2005; Karlsson et al. 2007). Lack of short-range LD in interbreed datasets supports a scenario where the dog population held a fairly constant effective population size prior to breed creation, which lends further support to the poodle-boxer-Ajvide divergence estimation approach taken in this study.

Still, the fact that almost all current breeds save a few Asian populations show equidistant pairwise relationships (Karlsson et al. 2007), with little structure between previously identified breed clusters (Parker et al. 2004) could be interpreted as evidence for breed creation acting on a subset of the global dog population at the time, but this does not necessarily mean a concomitant contraction in Ne. Mitochondrial DNA phylogeography show a decrease in genetic variation with distance from East Asia, which has previously been proposed to be a result of the region being the geographic origin of domestication (Savolainen et al. 2002), but this could also be an effect of directional breeding being less emphasized in the region. As for the ancient population, Anderson and coworkers (2009) found support an expansion in Ne immediately after the domestication event, which would also tend to increase the rate of coalescent events, warping the coalescent time scale to shorter units of chronological time. However, this would only affect estimates on the internode time if the Ajvide dog population and the ancestral population to modern dogs diverged during the initial expansion of the newly domesticated dog population.

Although the divergence estimation is based on only 7 and 11 independent loci, Jennings and Edwards (2005) observed a steep drop-off in variance of the inferred divergence times when resampling datasets of 0 to 10 loci from their original 28. Also, genetic data from a previous study of the Ajvide dog population (Malmström et al. 2008) show a similar pattern to published boxer and poodle mitochondrial sequences. The mtDNA sequence of the Ajvide-19 specimen, as well as the majority of the other samples, was found to belong to the phylogenetic haplogroup "C" (Malmström et al. 2008). In contrast, while Lindblad-Toh et al. (2005) do not report the mitochondrial sequence of the boxer Tasha whose genome was sequenced, poodle and boxer sequences have previously been assigned to haplogroup "A" (Vilà et al. 1997), which is separated in the canid mtDNA phylogeny from haplogroup C by gray wolf sequences (Vilà et al. 1997, 1999). Previously obtained mtDNA (Malmström et al. 2005; 2007; 2008) and Y-chromosomal (Girdland-Flink 2008) data from the Neolithic Ajvide population was not included in the divergence analysis because these markers are expected to have a decreased effective population size compared to autosomal loci, and thus operate on a different (faster) time scale. Nevertheless, the striking pattern of a majority of ancient European dog lineages grouping outside the diversity of modern breeds (Verginelli et al. 2005; Malmström et al. 2008; Girdland-Flink 2008; Degouilloux et al. 2009; this study) seem to support a more ancient origin or closer relationship with gray wolves than is evident in data from major breeds.

Since most models of dog domestication show that the divergence from wolves probably conferred a relatively severe bottleneck in the newly founded domestic population, the pattern observed in the genomic data presented here is also

40 congruent with the Ajvide-19 dog tracing its ancestry to a wolf population other than boxers and poodles. The tight bottlenecks of two independent domestication events would confer an increased rate of coalescence in the respective populations, which could explain the observed pattern even if the domestication or backcrossing events happened relatively recently i.e. during the Neolithic.

The approach of using summary statistics based on observed number of loci with concordant and discordant genealogical trees provides a possible solution to a number of problems associated with evolutionary analysis of ancient DNA data. Perhaps most importantly, the possible impact of post-mortem sequence damage is far lesser than for summary statistics based on nucleotide diversity or the site- frequency spectrum. For a post-mortem substitution to affect the outcome of the analysis, it must confer a change from a derived variant to the allele present in the outgroup taxon, or vice versa. Furthermore, the approach is likely robust to the problem that sampled sequences are heterochroneous, i.e. sampled at different points in time, since the genealogical tree length is irrelevant as long as sufficient mutations are present to determine the topology. Furthermore, likelihood functions for split times computed from different individuals can be multiplied if they are assumed to be independent. This allows the power of direct high-throughput sequencing to be harnessed even though it is extremely unlikely that sequences from different runs overlap. The requirement of at least two reference genomes from closely related populations makes the method difficult to apply to non-model species, but there are great prospects to apply the approach for studies on ancient human populations. There are already reference genomes available for humans of both Asian and European ancestry, and the launch of a consortium to sequence 1000 human genomes (http://www.1000genomes.org/) further underlines the future possibilities of paleogenomics.

41

Conclusions

The results presented here provide new evidence to longstanding questions surrounding the origin and domestication of dogs. Observed differences between data from a Neolithic Scandinavian dog and the nuclear genomes of modern boxer and poodle dogs support an early Upper Paleolithic origin of dogs, or genetic contributions from more than one wolf population. The study also shows that even relatively small-scale paleogenomic shotgun sequencing efforts can be worthwhile if a reasonably well-preserved specimen is chosen, and provide statistical power equal or greater than that of single marker studies of ancient populations. Also, while the success rate of the molecular amplification method is not yet totally clear, it might be a feasible approach worth more investigation. The concordance/discordance-based coalescent analysis method might be an especially powerful method to estimate demographic parameters, which is also robust to typical problems with ancient DNA, such as the presence of postmortem sequence damage or uncertainties in how to treat sequence data from different time points.

42

Acknowledgments

Thanks to Anders Götherström for making this study possible as well as being a great supervisor and role model. I am very grateful to Emma Svensson, Eva Daskalaki, Tom Gilbert and Linus Girdland-Flink for their help with the technical part of the study; Mattias Jakobsson, Martin Lascoux, Lucie Gattepaille, Thomas Källman and John Wakeley for insightful comments on the analysis; and Axel Künstner, Benoit Nabholz and Tobias Skoglund for helpful computational suggestions. Eva Daskalaki is also credited for conceiving the metagenomic amplification rationale. I also want to thank Thomas Källman, Pádraic Corcoran and Michael Stocks for suggesting improvements on an earlier version of the report, as well as Tobias Skoglund and Don Hitchcock for the cover illustration.

Thanks to Jan Storå (Osteoarkeologiska forskningslaboratoriet, Stockholm), Maria Vretmark (Västergötlands Museum, Skara), Leena Drenzel (Historiska Museet, Stockholm) and (Centro Mixto UCM-ISCIII de evolución y comportamiento humanos, Madrid) for giving access to and describing the archaeological samples as well as Love Dalén, Christina Valdiosera and Emma Svensson for extracting DNA from hominid, ursine and bovine remains, respectively.

454 FLX library building and sequencing in Copenhagen was performed by Tom Gilbert, Morten Rasmussen and Eske Willerslev. Illumina GA and 454 FLX sequencing at Aarhus University was provided by Christian Bendixen.

43

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, (1990) Basic local alignment search tool, Journal of Molecular Biology 215:403-410

Anderung C, Bouwman A, Persson P, Carretero JM, Ortega AI, Elburg R, Smith C, Arsuaga JL, Ellegren H, Götherström A, (2005) Prehistoric contacts over the Straits of indicated by genetic analysis of Iberian Bronze Age cattle, Proceedings of the National Academy of Science of the of America, 102:8431-8435

Anderson TM, vonHoldt BM, Candille SI, Musiani M, Greco C, Stahler DR, Smith DW, Padhukasahasram B, Randi E, Leonard JA, Bustamante CD, Ostrander EA, Tang H, Wayne RK, Barsh GS (2009) Molecular and Evolutionary History of Melanism in North American Gray Wolves, Science 1339-1343:323 Published online 5 February 2009

Arnesson-Westerdahl A, (1983) Mesolitiskt hundfynd vid Hornborgasjön, Värdefulla fauna- och domestikationshistoriska fynd, Flora och Fauna 78:211-216

Arsuaga JL, Martinez I, Gracia A, Carretero JM, Carbonell E (1993) Three new human skulls from the Sima de los Huesos Middle Pleistocene site in Sierra de Atapuerca, Spain, Nature 362:534 – 537

Arsuaga JL, Martinez I, Gracia A, Lorenzo C, (1997) The Sima de los Huesos crania (Sierra de Atapuerca, Spain): a comparative study, Journal of 33:219–281

Axelsson E, Willerslev E, Gilbert MTP, Nielsen R, (2008) The effect of ancient DNA damage on inferences of demographic histories, Molecular Biology and Evolution 25:2181-2187

Beaumont MA, Zhang W, Balding DJ, (2002) Approximate Bayesian computation in population genetics, Genetics 162:2025-2035.

Benecke N, (1987) Studies on early dog remains from Northern Europe, Journal of Archaeological Science 14:31-49

Binladen J, Gilbert MT, Bollback JP, Panitz F, Bendixen C, Nielsen R, Willerslev E, (2007) The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS ONE 2:e197

Binladen J,Gilbert MTP, Willerslev E (2007) 800 000 year old mammoth DNA, modern elephant DNA or PCR artefact? Biology Letters 22:55–56.

Bischoff JL, Williams RW, Rosenbauer RJ, Aramburu A, Arsuaga JL, García N, Cuenca- Bescós G, (2007) High-resolution U-series dates from the Sima de los Huesos hominids yields 600 kya: implications for the evolution of the early lineage, Journal of Archaeological Science 34:763-770

Björnerfeldt S, Vilà C, (2007) Evaluation of methods for single hair DNA amplification, Conservation Genetics 8:977–981

Blow MJ, Zhang T, Woyke T, Speller CF, Krivoshapkin A, Yang DY, Derevianko A, Rubin EM, (2008) Identification of ancient remains through genomic sequencing, Genome Research 18:1347-1353

44

Bon C, Caudy N, de Dieuleveult M, Fosse P, Philippe M, Maksud F, Beraud-Colomb E, Bouzaid E, Kefi R, Laugier C, Rousseau B, Casane D, van der Plicht J, Elalouf J-M, (2008) Deciphering the complete mitochondrial genome and phylogeny of the extinct cave bear in the Paleolithic painted cave of Chauvet, Proceedings of the National Academy of Sciences of the United States of America 105:17447-17452; published online before print October 27, 2008

Briggs AW, Stenzel U, Johnson PL, Green RE, Kelso J, Prüfer K, Meyer M, Krause J, Ronan MT, Lachmann M, Pääbo S, (2007) Patterns of damage in genomic DNA sequences from a Neandertal, Proceedings of the National Academy of Sciences of the United States of America 104:14616–14621

Brotherton P, Endicott P, Sanchez JJ, Beaumont M, Barnett R, Austin J, Cooper A (2007) Novel high-resolution characterization of ancient DNA reveals C > U-type base modification events as the sole cause of post mortem miscoding lesions. Nucleic Acids Research 35:5717–5728

Bruford MW, Bradley DG, Luikart G, (2003) DNA markers reveal the complexity of livestock domestication, Nature Reviews Genetics 4:900-910

Chang ML, Lee Terrill R, Bautista MM, Carlson EJ, Dyer DJ, Overall KL, Hamilton SP, (2007) Large-Scale SNP genotyping with canine buccal swab DNA, Journal of Heredity 98:428-437

Chapman B, Chang J (2000) Biopython: python tools for computational biology, ACM SIGBIO Newsletter 20:15-19

Clutton-Brock J, (1995) Origins of the dog: domestication and early history, in The Domestic Dog: Its Evolution, Behaviour and Interactions with People, Serpell J, Ed. Cambridge Univ. Press 7–20.

Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics 25: 1422-1423

Cooper A, Poinar HN, (2001) Ancient DNA: do it right or not at all. Science 289:1139

Cooper A, Lalueza-Fox C, Anderson S, Rambaut A, Austin J, Ward R (2001) Complete mitochondrial genome sequences of two extinct moas clarify ratite evolution, Nature 409:704-707

Davis SJM, Valla FR (1978) Evidence for domestication of the dog 12,000 years ago in the Natufian of , Nature 276:608–610.

Dayan T, (1994) Early domesticated dogs in the Near East, Journal of Archaeological Science 21, 633–640.

Deguilloux MF, Moquel J, Pemonge MH, Colombeau G (2009) Ancient DNA supports lineage replacement in European dog gene pool: insight into Neolithic southeast France, Journal of Archaeological Science 36:513–519

Ellegren H, (2008) Sequencing goes 454 and takes large-scale genomics into the wild, Molecular Ecology 17:1629-1631

Fan J-B, Chee MS, Gunderson KL, (2006) Highly parallel genomic assays, Nature Reviews Genetics 7:632

45

Fiedel SJ, (2005) Man's best friend - mammoth's worst enemy? A speculative essay on the role of dogs in Paleoindian colonization and megafaunal extinction, World Archaeology 37:11-25

Garcia MA, (2005) Ichnologie générale de la grotte Chauvet, Bulletin de la Société préhistorique française 102:103–108

García N, Arsuaga JL, Torres T, (1997) The carnivore remains from the Sima de los Huesos Middle Pleistocene site (Sierra de Atapuerca, Spain), Journal of Human Evolution 33:155- 174

Germonpré M, Sablin MV, Stevens RE, Hedges REM, Hofreiter M, Stiller M, Despré VR, (2009) Fossil dogs and wolves from Palaeolithic sites in , the Ukraine and : osteometry, ancient DNA and stable isotopes, Journal of Archaeological Science 36:473– 490

Gilbert MTP, Bandelt HJ, Hofreiter M, Barnes I, (2005) Assessing ancient DNA studies, Trends in Ecology & Evolution 20:541-544

Gilbert MTP, Binladen J, Miller W, Wiuf C, Willerslev E, Poinar H, Carlson JE, Leebens-Mack JH, Schuster SC (2007) Recharacterization of ancient DNA miscoding lesions: insights in the era of sequencing-by-synthesis, Nucleic Acids Research 35:1–10

Gilbert MTP, Tomsho LP, Rendulic S,Packard M,2 Drautz DI, Sher A, Tikhonov A, Dalén L, Kuznetsova T, Kosintsev P, Campos PF, Higham T, Collins MJ, Wilson AS, Shidlovskiy F, Buigues B, Ericson PGP, Germonpré M, Götherström A, Iacumin P, Nikolaev V, Nowak- Kemp M, Willerslev E, Knight JR, Irzyk GP, Perbost CS, Fredrikson KM, Harkins TT, Sheridan S, Miller W, Schuster SC, (2007) Whole-genome shotgun sequencing of mitochondria from ancient hair shafts, Science 317:1927–1930

Gilbert MTP, Drautz DI, Leskc AM, Ho SYW, Qi J, Ratan A, Hsu C-H, Sher A, Dalén L, Götherström A, Tomsho LP, Rendulic S, Packard M, Campos PF, Kuznetsova TV, Shidlovskiy F, Tikhonov A, Willerslev E, Iacumin P, Buigues B, Ericson PGP, Germonpré M, Kosintsevo P, Nikolaev V, Nowak-Kemp M, Knight JR, Irzyk GP, Perbost CS, Fredrikson KM, Harkins TT, Sheridan S, Miller W, Schuster SC (2008a) Intraspecific phylogenetic analysis of Siberian woolly mammoths using complete mitochondrial genomes, Proceedings o the National Academy of Sciences of the United States of America 105:8327-8332

Gilbert MTP, Jenkins DL, Götherström A, Naveran N, Sanchez JJ, Hofreiter M, Thomsen PF, Binladen J, Higham TFG, Yohe RM, Parr R, Scott Cummings L, Willerslev E, (2008b) DNA from Pre-Clovis Human Coprolites in Oregon, North America, Science 320:786 - 789

Girdland-Flink L, (2008) Y-SNPs from Neolithic Scandinavian dogs, Project work in biology, Uppsala University, Supervised by A. Götherström

Goebel T, (1999) Pleistocene human colonization of the Americas: an ecological approach, Evolutionary Anthropology 8:208-227

Gray MM, Granka JM, Bustamante CD, Sutter NB, Boyko AR, Zhu L, Ostrander EA, Wayne RK, (2009) Linkage disequilibrium and demographic history of wild and domestic canids, Genetics 181:1493-1505

Green RE, Krause J, Ptak SE, Briggs AW, Ronan MT, Simons LF, Du L, Egholm M, Rothberg JM, Paunovic M, Pääbo S, (2006) Analysis of one million basepairs of Neandertal DNA,

46

Nature 444:330-336

Green RE, Malaspinas A-S, Krause J, Briggs AW, Johnson PLF, Uhler C, Meyer M, Good JM, Maricic T, Stenzel U, Prüfer K, Siebauer K, Burbano HA, Ronan M, Rothberg JM, Egholm M, Rudan P, Brajković D, Kućan Z, Gušić I, Wikström M, Laakkonen L, Kelso J, Slatkin M, Pääbo S (2008) A Complete Neandertal Mitochondrial Genome Sequence Determined by High- Throughput Sequencing, Cell 134:416-426

Gundry RL, Allard MW, Moretti TR, Honeycutt RL, Wilson MR, Monson KL, Foran DR, (2007) Mitochondrial DNA analysis of the domestic dog: control region variation within and among breeds, Journal of Forensic Science 52:1556-4029

Götherström A, Anderung C, Hellborg L, Elburg R, Smith C, Bradley DG, Ellegren H, (2005) Cattle domestication in the Near East was followed by hybridization with aurochs bulls in Europe, Proceedings of the Royal Society Biological Sciences 272:2345-2350

Hey J, Nielsen R, (2004) Multilocus methods for estimating population sizes, migration rates and divergence times, with applications to the divergence of Drosophila pseudoobscura and D. persimilis, Genetics 167:747-760.

Hofreiter M, Jaenicke V, Serre D, Haeseler Av A, Paabo S, (2001) DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Research 29:4793–4799

Hofreiter M, Serre D, Poinar HN, Kuch M, Pääbo S, (2001) Ancient DNA, Nature Reviews Genetics 2:353-359

Hofreiter M, (2008) Palaeogenomics, Comptes Rendes Palevol, 7:113-124

Hubbard YJP, Aken NL, Beal K, Ballester B, Caccamo B, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Overduin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez- Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A, Birney E (2007) Ensembl 2007, Nucleic Acids Res. 2007 Vol. 35, Database issue:D610-D617.

Hudson RR, (1990) Gene genealogies and the coalescent process, in Oxford Surveys in Evolutionary Biology, 7:1-44, Oxford University Press, Oxford.

Hudson RR, Coyne JA, (2002) Mathematical consequences of the genealogical species concept, Evolution 56:1557-1565.

Huson DH, Auch AF, Qi J, Schuster SC, (2007) MEGAN Analysis of metagenomic data, Genome Research, 17:377-386

Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans, Nature 408:708-713

Karlsson EK, Baranowska I, Wade CM, Salmon-Hillbertz NHC, Zody MC, Anderson N, Biagi TM, Patterson N, Rosengren Pielberg G, Kulbokas EJ, Comstock KE, Keller ET, Mesirov JP, von Euler H, Kämpe O, Hedhammar Å, Lander ES, Andersson G, Andersson L, Lindblad-Toh K (2007) Efficient mapping of mendelian traits in dogs through genome-wide association, Nature Genetics 39:1321-1328

47

Kim KS, Lee SE, Jeong HW, Ha JH (1998) The complete nucleotide sequence of the domestic dog (Canis familiaris) mitochondrial genome, Molecular Phylogenetics and Evolution 10:210-220

Kingman JFC, (1982a) On the genealogy of large populations, Journal of Applied Probability 19:27–43

Kingman JFC, (1982b) The coalescent, Stochastic Processes and Applications 13:235–248

Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, Rusch DB, Delcher AL, Pop M, Wang W, Fraser CM, Venter JC (2003) The Dog Genome: Survey Sequencing and Comparative Analysis, Science 301:1898-1903

Krause J, Unger T, Noçon A, Malaspinas A-S, Kolokotronis S-O, Stiller M, Soibelzon L, Spriggs H, Dear PH, Briggs AW, Bray SCE, O'Brien SJ, Rabeder G, Matheus P, Cooper A, Slatkin M, Pääbo S, Hofreiter M, (2008) Mitochondrial genomes reveal an explosive radiation of extinct and extant bears near the Miocene-Pliocene boundary, BMC Evolutionary Biology 8:220.

Lalueza-Fox C, Sampietro ML, Caramelli D, Puder Y, Lari M, Calafell F, Martinez-Maza C, Bastir M, Fortea J, de la Rasilla M, Bertranpetit J, Rosas A (2005) Neandertal evolutionary genetics: mitochondrial DNA data from the , Molecular Biology and Evolution 22:1077-1081

Leonard JA, Wayne RK, Wheeler J, Valadez R, Guillén S, and Vilà C, (2002) Ancient DNA Evidence for Old World Origin of New World Dogs, Science 298:1613.

Leonard JA, Shanks O, Hofreiter M, Kreuz E, Hodges L, Ream W, Wayne RK, Fleischer RC, (2006) Animal DNA in PCR reagents plagues ancient DNA research, Journal of Archaeological Science 34:1361-1366

Lindahl T, (1993) Instability and decay of the primary structure of DNA, Nature 362:709– 715

Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, Clamp M, Chang JL, Kulbokas EJ, Zody MC, Mauceli E, Xie X, Breen M, Wayne RK, Ostrander EA, Pointing CP, Galibert F, Smith DR, deJong PJ, Kirkness E, Alvarez P, Biagi T, Brockman W, Butler J, Chin CW, Cook A, Cuff J, Daly MJ, DeCaprio D, Gnerre S, Grabherr M, Kellis M, Kleber M, Bardeleben C, Goodstadt L, Heger A, Hitte C, Kim L, Koepfli KP, Parker HG, Pollinger JP, Searle SMJ, Sutter NB, Thomas R, Webber C, Broad Institute Sequencing Platform members and Lander ES, (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog, Nature 438:803-819

Linderholm A, (2008) Migration in prehistory: DNA and stable isotope analysis of Swedish skeletal material Stockholm: Stockholm University, Faculty of Humanities, Department of Archaeology and Classical Studies, Theses and papers in scientific archaeology, ISSN 1400- 7835 ; 10

Malmström H, Storå J, Dalén L, Holmlund G, Götherström A, (2005) Extensive human DNA contamination in extracts from ancient dog bones and teeth, Molecular Biology and Evolution 22:2040-2047

Malmström H, Svensson EM, Gilbert MTP, Willerslev E, Götherström A, Holmlund G, (2007) More on contamination: The use of asymmetric molecular behaviour to identify

48 authentic ancient human DNA, Molecular Biology and Evolution 24:998-1004

Malmström H, Vilà C, Gilbert MTP, Storå J, Willerslev E, Holmlund G, Götherström A, (2008) Barking up the wrong tree: modern northern European dogs fail to explain their origin, BMC Evolutionary Biology 8:71

Mardis ER, (2008) Next generation DNA sequencing methods, Annual Review of Genomics and Human Genetics 9:387–402

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM, (2005) Genome sequencing in microfabricated high-density picolitre reactors, Nature 437:376–380

Maricic T, Pääbo S (2009) Optimization of 454 sequencing library library preparation from small amounts of DNA permits sequence determination of both DNA strands, BioTechniques 46:51-57

Martinsson-Wallin H (2008) Land and sea animal remains from Middle Neolithic Pitted Ware sites on Gotland island in the Baltic sea, Sweden In: Terra australis 29:171-173 eds: G. Clark, F. Leach & S. O'Connor

Meyer M, Briggs AW, Maricic T, Höber B, Höffner B, Krause J, Weihmann A, Pääbo S, Hofreiter M, (2008) From micrograms to picograms: quantitative PCR reduces the material demands of high-throughput sequencing, Nucleic Acids Research 36:1-6

Millar CD, Huynen L, Subramanian S, Mohandesan E, Lambert DM, (2008) New developments in ancient genomics, Trends in Ecology & Evolution 23: 386-393

Miller W, Drautz DI, Ratan A, Pusey B, Qi J, Lesk AM, Tomsho LP, Packard MD. Zhao F, Sher A, Tikhonov A, Raney B, Patterson N, Lindblad-Toh K, Lander ES, Knight JR. Irzyk GP. Fredrikson KM, Harkins TT. Sheridan S, Pringle T, Schuster SC, (2008) Sequencing the nuclear genome of the extinct woolly mammoth, Nature 456:387-390

Miller W, Drautz DI, Janecka JE, Lesk AM, Ratan A, Tomsho LP, Packard M, Zhang Y, McClellan LR, Qi J, Zhao F, Gilbert MTP, Dalén L, Arsuaga JL, Ericson PGP, Huson DH, Helgen KM, Murphy WJ, Götherström A, Schuster SC, (2009) The mitochondrial genome sequence of the Tasmanian tiger (Thylacinus cynocephalus), Genome Research 19:213-220

Mullikin JC, Brockman JA, Hansen NF, Young A, Cruz P, Shen L, Ebling H, Donahue WF, Tao W, Saranga DJ, Brand A, Rubenfield MJ, Al-Murrani SWK, Locniskar MF, Smith DR, Abrahamsen MS, NISC Comparative Sequencing Program (unpublished) Discovery of Three Million SNPs across Seven Breeds of Domestic Cat and One Wild Cat

Musil R, (1984) The first known domestication of wolves in central Europe, in Animals and archaeology, vol. 4, Husbandry in Europe, British Archaeological Reports International Series 227

Nakano M, Komatsu J, Matsuura S, Takashima K, Katsura S, Mizuno A (2003) Single- molecule PCR using water-in-oil emulsion. Journal of Biotechnology 102:117–124

49

Nielsen R & Beaumont MA, (2009) Statistical inferences in phylogeography, Molecular Ecology 18:1034-1047

Nobis G (1979) Der älteste haushund lebte vor 14 000 Jahre, Umschau 19:610

Noonan JP, Hofreiter M, Smith D, Priest JR, Rohland N, Rabeder G, Krause J, Detter JC, Pääbo S, Rubin EM (2005) Genomic sequencing of Pleistocene cave bears, Science 309:597-599

Noonan JP, Coop G, Kudaravalli S, Smith D, Krause J, Alessi J, Chen F, Platt D, Pääbo S, Pritchard JK, Rubin EM (2006) Sequencing and analysis of Neanderthal genomic DNA, Science 314:1113-1118

Olsen SJ, (1985) Origins of the domestic dog: the fossil record, University of Arizona press, Tuscon, AZ

Orlando L, Pagés M, Calvignac S, Hughes S, Hänni C (2007) Does the 43 bp sequence from an 800,000 year old cretan dwarf elephantid really rewrite the textbook on mammoths? Biology Letters 22:57-9

Ostrander EA, Wayne RK, (2005) The canine genome, Genome Research 15:1706-1716

Park JW, Beaty TH, Boyce P, Scott AF, McIntosh I (2005) Comparing whole-genome amplification methods and sources of biological samples for single-nucleotide polymorphism genotyping. Clin Chem Lab Med 51:1520–1523.

Parker HG, Kim LV, Sutter NB, Carlson S, Lorentzen TD, Malek TB, Johnson GS, DeFrance HB, Ostrander EA, Kruglyak L (2004) Genetic Structure of the Purebred Domestic Dog, Science 304:1160-116

Pennisi E, (2009) Neandertal genomics: Tales of a prehistoric human genome, Science 323:866-871

Poinar HN, Hoss M, Bada JL, Pääbo S, (1996) Amino acid racemization and the preservation of ancient DNA. Science 272:864-866

Poinar HN, Schwarz C, Qi J, Shapiro B, Macphee RD, Buigues B, Tikhonov A, Huson DH, Tomsho LP, Auch A, Rampp M, Miller W, Schuster SC, (2006) Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science 311:392-394

Poulakakis N, Parmakelis A, Lymberakis P, Mylonas M, Zouros E, Reese DS, Glaberman S, Caccone A, (2006) Ancient DNA forces reconsideration of evolutionary history of Mediterranean pygmy elephantids. Biology letters 2:451–454

Pääbo S (1985) Molecular cloning of ancient Egyptian DNA. Nature 314:644-645

Pääbo S, Poinar H, Serre D, Jaenicke-Després V, Hebler J, Rohland N, Kuch M, Krause J, Vigilant L, Hofreiter M, (2004) Genetic analyses from ancient DNA. Annual Reviews Genetics 38:645-679

Rambaut A, Ho SYW, Drummond AJ, Shapiro B, (2008) Accommodating the effect of ancient DNA damage on inferences of demographic histories, Molecular Biology and Evolution 26:245-248.

Randi E, Lucchini V, (2002) Detecting rare introgression of domestic dog genes into wild

50 wolf (Canis lupus) populations by Bayesian admixture analyses of microsatellite variation, Conservation Genetics 3:29-43

Rannala B, Yang Z, (2003) Bayes Estimation of Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From Multiple Loci, Genetics 164:1645-1656

Riesenfeld CS, Schloss PD, Handelsman J (2004) Metagenomics: genomic analysis of microbial communities. Annual Reviews Genetics 38:525-552

Rightmire GP, (1998) Human evolution in the Middle Pleistocene: the role of Homo heidelbergensis, Current Anthropology

Rosenberg NA, Nordborg M, (2002) Genealogical trees, coalescent theory, and the analysis of genetic polymorphisms, Nature Reviews Genetics 3: 380-390

Rosenberg NA, (2002) The probability of topological concordance of gene trees and species trees, Theoretical Population Biology 61:225-247

Ronaghi M, Karamohamed S, Pettersson B, Uhlén M, Nyrén P (1996) Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 242:84-89

Ronaghi M, Uhlen M, Nyren P (1998) A sequencing method based on real-time pyrophosphate Science 281:363, 365

Sablin MV, Khlopachev GA, (2002) The earliest Ice Age dogs: evidence from Eliseevichi I, Current Anthropology 43:795–799

Sablin MV, Khlopachev GA, (2003) Die ältesten Hunde aus Eliseevici I (Russland), Archäologisches Korrespondenzblatt 33:309–316.

Sanger F, Nicklen S.Coulson AR (1977) DNA sequencing with chain-terminating inhibitors, Proceedings of the National Academy of Sciences of the United States of America 74:5463−5467

Savolainen P, Zhang J, Luo J, Lundeberg J, Leitner T, (2002) Genetic evidence for an East Asian origin of domestic dogs Science 298:1610-1613

Savolainen P, Leitner T, Wilton AN, Matisoo-Smioth E, Lundeberg J, (2004) A detailed picture of the Australian dingo, obtained from the study of mitochondrial DNA, Proceedings of the National Academy of Science of the United States of America 101:12387- 12390

Short AD, Kennedy LJ, Forman O, Barnes A, Fretwell N, Wiggall R, Thomson W, Ollier WER, (2005) Canine DNA subjected to whole genome amplification is suitable for a wide range of molecular applications. Journal of Heredity 96:829–835

Smith TF, Waterman MS, (1981) Identification of molecular subsequences, Journal of molecular Biology, 147:195-197

Skoglund P, (2009) On the tail of lost dogs: mitochondrial DNA sheds light on the evolutionary fate of pre-Columbian dogs, Project work in biology, Uppsala University, Supervised by S. Castroviejo-Fisher & J.A. Leonard.

Stiller M, Green RE, Ronan M, Simons JF, Du L, He W, Egholm M, Rothberg JM, Keates SG, Ovodov ND, Antipina EE, Baryshnikov GF, Kuzmin YV, Vasilevski AA, Wuenschell GE,

51

Termini JE, Hofreiter M, Jaenicke-Despres V, S. Pääbo (2006) Patterns of nucleotide misincorporations during enzymatic amplification and direct large-scale sequencing of ancient DNA, Proceedings of National Academy of Sciences of the United States of America 103:13578–13584

Sutter NB, Ostrander EA, (2004) Dog star rising: the canine genetic system, Nature Reviews Genetics 5:900-910

Sutter NB, Eberle MA, Parker HG, Pullar BJ, Kirkness EF, Kruglyak L, Ostrander EA, (2004) Extensive and breed-specific linkage disequilibrium in Canis familiaris, Genome Research 14: 2388-2396

Svensson EM, Anderung C, Baubliene J, Persson P, Malmström H, Smith C, Vretemark M, Daugnora L, Götherström A, (2007) Tracing genetic change over time using nuclear SNPs in ancient and modern cattle, Animal Genetics 38:378-383

Takahata N, (1989) Gene genealogy in three related populations: consistency probability between gene and population trees, Genetics 122:957-966

Thompson MD, Bowen RAR, Wong BYL, Antal J, Liu Z, Yu H, Siminovitch K, Kreiger N, Rohan TE, Cole DEC (2005) Whole genome amplification of buccal cell DNA: genotyping concordance before and after multiple displacement amplification. Clin Chem Lab Med 43:157–162

Thornton K, Andolfatto P, (2006) Approximate Bayesian Inference reveals evidence for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster, Genetics 172:1607-1619

Thurston ME, (1996) The lost history of the canine race, Andrews & Mcmeel Kansas City

Valdiosera C, García N, Dalén L, Smith C, Kahlke RD, Lidén K, Angerbjörn A, Arsuaga JL, Götherström, (2006) Typing single polymorphic nucleotides in mitochondrial DNA as a way to access Middle Pleistocene DNA, Biology Letters 22:601–603

Wakeley J, Hey J, (1997) Estimating ancestral population parameters, Genetics 145:847- 855.

Wall JD, Kim SK, (2007) Inconsistencies in Neanderthal genomic DNA sequences. PLoS Genetics 3:1862–1866

Wayne RK, Leonard JA, Cooper A, (1999) Full of sound and fury: the recent history of ancient DNA, Annual Review in Ecology and Systematics 30:457-77

Wayne RK, Ostrander EA, (2007) Lessons learned from the dog genome, Trends in Genetics 23:557-567

Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO, (2004) Environmental genome shotgun sequencing of the Sargasso Sea, Science 304:66- 74.

Verginelli F, Capelli C, Coia V, Musiani M, Falchetti M, Ottini L, Palmirotta L, Tagliacozzo A, De Grossi Mazzorin I, Mariani-Costantini R, (2005) Mitochondrial DNA from Prehistoric Canids Highlights Relationships Between Dogs and South-East European Wolves,

52

Molecular Biology and Evolution 22, 2541-2551

Vilà C, Savolainen P, Maldonado JE, Amorim IR, Rice JE, Honeycutt RL, Crandall KA, Lundeberg J, Wayne RK, (1997) Multiple and ancient origins of the domestic dog, Science 276: 1687-1689

Vilà C, Maldonado JE, and Wayne RK (1999) Phylogenetic relationships, evolution, and genetic diversity of the domestic dog, Journal of Heredity 90:1

Vilà C, Walker C, Sundqvist A-K, Flagstad Ø, Andersone Z, Casulli A, Kojola I, Valdmann H, Halverson J, Ellegren H, (2003) Combined use of maternal, paternal and bi-parental genetic markers for the identification of wolf-dog hybrids, Heredity 90:70-24

Vilà C, Seddon J, Ellegren H, (2005) Genes of domestic mammals augmented by backcrossing with wild ancestors, Trends in Genetics 21:214-218

Wakeley J, (2008) Coalescent Theory: An Introduction. Roberts & Company Publishers, Greenwood Village, Colorado, ISBN: 0-9747077-5-9

Willerslev E, Cooper A, (2005) Ancient DNA, Proceedings of the Royal Society B 272:3-16

Willerslev E, Cappellini E, Boomsma W, Nielsen R, Hebsgaard MB, Brand TB, Hofreiter M, Bunce M, Poinar HN, Dahl-Jensen D, Johnsen S, Steffensen JP, Bennike O, Schwenninger J- L, Nathan R, Armitage S, de Hoog C-J, Alfimov V, Christl M, Beer J, Muscheler R, Barker J, Sharp M, Penkman KEH, Haile J, Taberlet P, Gilbert MTP, Casoli A, Campani E, Collins MJ,(2007) Ancient biomolecules from deep ice cores reveal a forested southern Greenland, Science 317:111-114

Yang DY, Eng B, Waye JS, Dudar JC, Saunders SR, (1997) Improved DNA extraction from ancient bones using silica-based spin columns, American Journal of Physical Anthropology 105:539-543

Zhang Z, Schwartz S, Wagner L, Miller W, (2000) A Greedy Algorithm for Aligning DNA Sequences, Journal of Computational Biology 7:203-214

Österholm I, (1989) Bosättningsmönster på Gotland under stenåldern. En analys av fysisk miljö, ekonomi och social struktur. Theses and Papers in Archaeology 3. Visby: Gotland University

53