<<

Genetic structure in and : Aspects of Population History and Gene Mapping

Elina Salmela Department of Medical , Haartman Institute, Research Programs Unit, Molecular Medicine, and Institute for Molecular Medicine Finland FIMM University of

Academic dissertation To be presented, with the permission of the Faculty of Medicine of the University of Helsinki, for public examination in Auditorium XII, Main Building, Fabianinkatu 33, Helsinki, on October 5th 2012, at 12 noon

Helsinki 2012 Supervisors Professor Juha Kere Department of Medical Genetics University of Helsinki Helsinki, Finland Folkhälsan Institute of Genetics Helsinki, Finland Department of Biosciences and Nutrition Karolinska Institutet Huddinge, Sweden

Docent Päivi Lahermo Institute for Molecular Medicine Finland FIMM University of Helsinki Helsinki, Finland

Reviewers Professor Jaakko Ignatius Department of Clinical Genetics University Hospital Turku, Finland

Professor Marjo-Riitta Järvelin School of Public Health Imperial College London London, UK

Opponent Docent Mattias Jakobsson Department of Evolutionary Biology University Uppsala, Sweden

ISBN 978-952-10-8190-3 (paperback) ISBN 978-952-10-8191-0 (PDF) http://ethesis.helsinki.fi/

Layout: Jere Kasanen Cover: Ancestries by Admixture. Modified from Figure 14.

Unigrafia, Helsinki 2012 To my family

Houkat ja viisahat uurteet otsillaan että mihin ja miksi ja mistä, mikä johti? Aika ei paljasta arvoituksiaan, meidän mukanaan niitä vain käytävä on kohti.

Pauli Hanhiniemi: Äärelä (2006) In Hehkumo: Muistoja tulevaisuudesta

Fools and wise people wonder with furrowed brows: what, where does it lead to, why, and where from? Time does not tell us its mysteries – all we can do is wander with it to meet them.

Translated by Sara Norja

Contents

List of original publications 9

Abbreviations 10

Abstract 11

Tiivistelmä 14

Introduction 17

Review of the literature 19 1. basics of 20 1.1. Inheritance 20 1.2. Allele frequencies 20 1.2.1. Mutation 21 1.2.2. Genetic drift 21 1.2.3. Selection 22 1.2.4. Migration 22 1.3. Genotypes 23 1.4. Haplotypes, recombination, and linkage disequilibrium 24 2. structure 26 2.1. Types and patterns of variation 27 2.1.1. Single nucleotide polymorphisms (SNPs) 27 2.1.2. Microsatellites 28 2.1.3. Structural variation 28 2.2. Haplotype structure 29 3. gene mapping strategies 30 3.1. Linkage analysis 30 3.1.1. Suitability for different populations and phenotypes 32 3.2. Association analysis 32 3.2.1. Suitability for different populations 34 3.2.2. Suitability for different phenotypes 35 3.3. Complementary methods 36 4.G enetic inference of population history 37 4.1. General principles 37 4.2. Genome-wide SNP datasets in studies of population history 39 4.3. The multidisciplinary inference of human past 41 5. Finland 43 5.1. Population prehistory and history 43 5.2. Language 47 5.3. Genetic structure 48 5.3.1. compared to other European populations 48 5.3.2. Eastern influence in the Finnish gene pool 49 5.3.3. Finns and their Saami neighbors 50 5.3.4. Genetic structure within Finland 50 5.3.5. Phenotypic variation in Finland 51 5.3.6. East-west difference in cardiovascular diseases in Finland 52 5.3.7. Age estimates of Finnish genes 53 6. Sweden 55 6.1. Population prehistory and history 55 6.2. Language 58 6.3. Genetic structure 59 6.3.1. compared to other European populations 59 6.3.2. Genetic structure within Sweden 60

Aims of the study 63

Materials and methods 64 1. Samples 64 1.1. Finns (I, II, III, IV) 65 1.2. Swedes (II, III, IV) 66 1.3. Reference populations (III, IV) 66 1.4. Simulated population data (I) 68 2. markers, genotyping, and quality control 69 3. Analyses 71 3.1. Model-free clustering of individuals: multidimensional scaling 71 3.2. Model-based clustering of individuals: Structure, Admixture, and Geneland 71 3.3. IBS distributions within and between populations 71 3.4. SNP-based inspection of eastern admixture 72

3.5. Distances between populations: FST and allele frequency differences 72 3.6. Hierarchical population , AMOVA, and inbreeding 73 3.7. Linkage disequilibrium 73 3.8. Correlation of genetic and geographic distances 73 3.9. Tests of the subpopulation difference scanning (SDS) approach 73 3.10. Enrichment analyses 74

Results 75 1. population structure in Finland and Sweden 75 1.1. Model-free clustering of individuals: multidimensional scaling 75 1.2. Model-based clustering of individuals: Structure, Admixture, and Geneland 77 1.3. IBS distributions between populations 80 1.4. SNP-based inspection of eastern admixture 81

1.5. Distances between populations: FST and allele frequency differences 82 1.6. Hierarchical population subdivision, AMOVA, and inbreeding 82 1.7. IBS distributions within populations 83 1.8. Linkage disequilibrium 83 1.9. Correlation of genetic and geographic distances 83 2. Subpopulation difference scanning (SDS) approach 84 2.1. Population simulations 84 2.2. East-west differences in the genome-wide SNP dataset 85

Discussion 88 1. technical aspects 88 1.1. Quality control of genome-wide SNP data for population genetics 88 1.2. Comparison of model-based methods of individual clustering 89 2. population structure in Finland and Sweden 90 2.1. Comparison of Finland and Sweden 90 2.2. Finland 92 2.2.1. Eastern influence 92 2.2.2. East-west difference 93 2.2.3. Genetic vs. archaeological view of Finnish population history 94 2.3. Sweden 95 2.4. Limitations of the results 97 3. Subpopulation difference scanning (SDS) approach 100 3.1. The idea 100 3.2. Theoretical testing, advantages, and limitations 101 3.3. Analyses of genome-wide population data 103

Concluding remarks and future prospects 104

Acknowledgements 106

References 110 List of original publications

This thesis is based on the following original publications. They are referred to in the text by their Roman numerals. In addition, some unpublished data are presented.

I Salmela E, Taskinen O, Seppänen JK, Sistonen P, Daly MJ, Lahermo P, Savontaus ML, Kere J. Subpopulation difference scanning: a strategy for exclusion mapping of susceptibility genes. J Med Genet. 2006; 43(7):590-7.

II Hannelius U, Salmela E, Lappalainen T, Guillot G, CM, von Döbeln U, Lahermo P, Kere J. Population substructure in Finland and Sweden revealed by the use of spatial coordinates and a small number of unlinked autosomal SNPs. BMC Genet. 2008; 9:54.

III Salmela E*, Lappalainen T*, Fransson I, Andersen PM, Dahlman- Wright K, Fiebig A, Sistonen P, Savontaus ML, Schreiber S, Kere J, Lahermo P. Genome-wide analysis of single nucleotide poly­ morphisms uncovers population structure in Northern . PLoS One. 2008; 3(10):e3519.

IV Salmela E, Lappalainen T, Liu J, Sistonen P, Andersen PM, Schreiber S, Savontaus ML, Czene K, Lahermo P, Hall P, Kere J. Swedish population substructure revealed by genome-wide single nucleotide polymorphism data. PLoS One. 2011; 6(2):e16747.

* Equal contribution.

Publications II and III have earlier appeared as part of the doctoral theses of Ulf Hannelius ( 2008) and Tuuli Lappalainen (Helsinki 2009), respectively. These publications have been reproduced with permission from their copyright holders. The open access publications II-IV are freely available online.

List of original publications 9 Abbreviations

AD Anno Domini aDNA ancient DNA AIM ancestry-informative marker AMI acute myocardial infarction AMOVA analysis of molecular variance BC before Christ BRCA1 breast cancer 1, early onset gene BRCA2 breast cancer 2, early onset gene CEU Utah residents with ancestry from northern and western Europe CHB in Beijing, CI confidence interval cM centimorgan CVD cardiovascular disease CYP21 steroid 21-hydroxylase gene DNA deoxyribonucleic acid DIP deletion/insertion polymorphism FDH Finnish disease heritage GWAS genome-wide association study HGDP Human Genome Diversity Project HLA human leukocyte antigen HWE Hardy-Weinberg equilibrium IBS identity by state JPT Japanese in Tokyo, kb kilobase KEGG Kyoto Encyclopedia of Genes and Genomes LD linkage disequilibrium LOD logarithm of odds MAF minor allele frequency Mb megabase MCAD medium-chain acyl-CoA dehydrogenase MDS multidimensional scaling mtDNA mitochondrial DNA NHGRI National Human Genome Research Institute OR odds ratio PAR pseudoautosomal PCA principal component analysis QC quality control RNA ribonucleic acid SDS subpopulation difference scanning SNP single nucleotide polymorphism STR short tandem repeat YRI Yoruba in Ibadan, Nigeria

10 Abbreviations Abstract

The genetic structure of populations is of interest not only as a source of population history information, but also because of its importance to gene mapping studies. The main aim of this thesis was to study the genetic structure of the human populations in present-day Finland and Sweden. Although both populations have been studied with small numbers of genetic markers, the present analyses were the first to utilize genome-wide data from thousands of single nucleotide polymorphism (SNP) markers. Furthermore, this thesis introduced a novel gene mapping approach, subpopulation differ- ence scanning (SDS), and tested its theoretical applicability to the Finnish population. The study subjects included 280 Finnish and 1525 Swedish individuals. Of the Finns, 141 had grandparents born in western and 139 in eastern Finland. For the Swedes, the geographic information was based on their places of residence, scattered throughout the roughly according to . The Finns were genotyped for 238,000 SNPs on the Affymetrix 250K StyI array and the Swedes for 550,000 SNPs on the Illumina HumanHap550 array. Most analyses were based on subsets of the 29,000 SNPs common to both platforms, and none used imputed genotypes. Genotypes from Russian, German, British and other populations served as reference data. The amount and patterns of genetic variation between populations and individuals were analyzed by standard population genetic methods using, for example, allele frequencies, identity-by-state (IBS) similarities, FST distances, and linkage disequilibrium (LD). The results revealed that the genetic diversity within Sweden and Finland was lower than in central European reference populations, and was sub- stantially reduced in eastern Finland. Finns also differed clearly from cen- tral Europeans, and a strong population structure existed within Finland: the genetic distance between eastern and western Finns was greater than between for instance the British and northern Germans. In fact, western Finns were genetically equally close to Swedes than to eastern Finns. In Sweden, the overall population structure seemed clinal and lacked strong borders. The population in the southern parts of the country was relatively homogeneous and genetically close to the Germans and British, while the northern subpopulations differed from the south and also from each other. Somewhat surprisingly, however, genetic diversity in northern Sweden was not markedly reduced. Subjects from the Swedish-speaking region in Finnish were genetically intermediate between Finns and Swedes and demonstrated

abstract 11 clear signs of contacts with their Finnish-speaking neighbors. Notably, the geographically closest study areas in the north of Sweden () and Finland (Northern Ostrobothnia) did not appear to be genetically close. Instead, the northern Swedes were genetically closer to the southwestern Finns. Interestingly, the Finns, especially eastern Finns, showed a higher genetic affinity to East Asian reference populations than did most other populations. Although a similar higher affinity was observable in Russians, they were not genetically close to eastern Finns, which suggests that the eastern influence in the Finnish population may predate the expansion of the Russian population to its current areas. The genetic substructure within Finland and Sweden could cause problems in association studies that use geographically unmatched cases and controls, especially if the number of genotyped markers limits the possibilities for stratification correction. On the other hand, the substructure observed within Finland could also be utilized in mapping genes for diseases that show varying incidences within the country, for example cardiovascular diseases with their east-west difference. Because a gene underlying an incidence difference must itself harbor a frequency difference, the SDS mapping approach proposes that such genes could be mapped by comparing samples from high- and low-incidence subpopulations and excluding from further analyses those genome areas that do not show a (sufficient) difference between the samples. Simulations mimicking the population demonstrated that the SDS approach may work for the cardiovascular diseases, provided that a substantial portion of the incidence difference was caused by a single gene. On the other hand, analyses of the genome-wide SNP data suggested that the east-west difference in cardiovascular diseases may largely result from many genes’ combined contributions, each of which may remain too small for SDS to detect. Overall, the patterns of population structure observed – the genetic distinctness and reduced diversity of the north European populations, the unique characteristics of northern Sweden, and the east-west difference and eastern influence in Finland – are congruent with results from earlier studies with smaller numbers of markers and from later genome-wide studies. The patterns are also generally explicable by known features of population history: The combination of mainly European but partly eastern elements in the Finnish gene pool agrees well with the contacts that the Finnish population has had to both east and west throughout its history, while the regionally varying strength of those contacts may serve to explain genetic differences between the eastern and western parts of the country. Furthermore, these east-west differences have been accentuated by the history of agricultural settlement in eastern Finland which has involved strong founder events and subsequent isolation of small breeding units. Although small size has undoubtedly characterized the population also in northern Sweden, there

12 abstract the most extreme drift-induced reductions in genetic diversity may have been alleviated by admixture with the neighboring populations. In summary, the results of this thesis emphasize the capacity of genome- wide SNP data to detect patterns of population structure – also in popula- tions that have often been assumed homogeneous, such as the Finns and Swedes. Obviously, knowledge of genome-wide population structure has immediate relevance also to studies focusing on diseases and other important phenotypes.

abstract 13 Tiivistelmä

Populaation, esimerkiksi ihmisväestön, geneettinen rakenne kertoo populaa- tion historiasta. Geneettinen rakenne vaikuttaa myös siihen, mitkä väestöt ovat eri geenikartoitusmenetelmien otollisimpia kohteita. Tämän väitöstyön päätavoitteena oli tarkastella väestön geneettistä rakennetta Suomessa ja Ruotsissa. Molempien väestöjen rakennetta on aiemmin selvitetty muuta- milla markkereilla kerrallaan, mutta tässä työssä hyödynnettiin ensi ker- taa perimänlaajuista aineistoa tuhansista yhden emäksen polymorfioista (SNP:eistä). Lisäksi tässä työssä esiteltiin uusi geenikartoitusmenetelmä (subpopulation difference scanning, SDS) ja tutkittiin sen teoreettista sovel- tuvuutta suomalaisväestössä käytettäväksi. Tutkimusaineistona oli 280 suomalaista ja 1525 ruotsalaista koehen- kilöä. Suomalaisilta tunnettiin isovanhempien syntymäpaikat: koehenki- löistä 141 edusti Länsi-Suomea ja 139 Itä-Suomea. Ruotsalaiskoehenkilöiltä tiedettiin asuinpaikat, ja ne kattoivat koko maan likimain väestötiheyttä vastaavasti. Suomalaisista tutkittiin 238.000 SNP-markkeria Affymetrix 250K StyI -siruilla ja ruotsalaisista 550.000 SNP-markkeria Illumina HumanHap550 -siruilla. Pääosa analyyseista perustui 29.000:een siruille yhteiseen SNP-markkeriin; laskennallista genotyyppi-imputaatiota ei tehty. Vertailuaineistona käytettiin mm. venäläisiä, saksalaisia ja brittejä. Väestöjen ja yksilöiden välistä geneettistä vaihtelua kartoitettiin vakiintunein populaatiogeneettisin laskentamenetelmin esimerkiksi alleelifrekvenssien,

FST-etäisyyksien ja kytkentäepätasapainon (LD) perusteella. Tulokset osoittivat että geneettinen monimuotoisuus Suomessa ja Ruotsissa on pienempi kuin Keski-Euroopassa, ja Itä-Suomessa erityisen vähäinen. Suomalaiset poikkesivat keskieurooppalaisista vertailuväestöistä huomattavasti, ja Suomen sisällä havaittiin selvä populaatiorakenne: itä- ja länsisuomalaisten välinen geneettinen etäisyys oli suurempi kuin esimerkiksi brittien ja pohjoissaksalaisten välinen, ja länsisuomalaiset olivat itse asiassa geneettisesti yhtä lähellä ruotsalaisia kuin itäsuomalaisia. Ruotsin sisällä havaittiin pohjois-etelä-suuntainen rakenne, mutta ei jyrkkiä geneettisiä rajoja. Eteläruotsalaiset olivat geneettisesti varsin homogeenisia ja muis- tuttivat läheisesti saksalaisia ja brittejä, kun pohjoisruotsalaiset puolestaan erosivat selvästi sekä eteläruotsalaisista että toisistaan. Geneettinen moni- muotoisuus Pohjois-Ruotsissa ei kuitenkaan ollut etelää alhaisempi, mikä oli hivenen yllättävää. Suomalaisaineistoon sisältyi pieni määrä Pohjanmaan rannikon suo- menruotsalaisia. Geneettisesti nämä henkilöt sijoittuivat suomalaisten ja ruotsalaisten välimaastoon, mikä kertoo erikielisten väestöryhmien välisestä

14 Tiivistelmä kanssakäymisestä. Toisaalta Suomen ja Ruotsin rajaseudulla pohjoisessa maantieteellisesti lähekkäiset suomalaiset ja ruotsalaiset koehenkilöt eivät olleet geneettisesti erityisen läheisiä, vaan pohjoisruotsalaiset olivat geneet- tisesti lähempänä lounaissuomalaisia. Kun tutkittuja väestöjä verrattiin itäaasialaisiin, suomalaisissa – erityisesti itäsuomalaisissa – nähtiin pieni mutta selvä itäinen vaikutus. Samanlainen vaikutus nähtiin myös venäläi- sissä, mutta koska venäläiset ja itäsuomalaiset eivät olleet keskenään läheisiä, suomalaisissa havaittu itävaikutus lienee peräisin ajalta ennen venäläisten levittäytymistä nykyisille asuinseuduilleen. Sekä Suomen että Ruotsin sisällä havaitut geneettiset erot ovat niin suuria että ne voivat haitata geneettisiä assosiaatiotutkimuksia joissa tapaus- ja verrokkiotosten maantieteelliset jakaumat poikkeavat toisistaan, varsinkin jos tutkitaan niin pieniä markkerilukumääriä että perimänlaajuisia korja- usmenetelmiä ei voida käyttää. Toisaalta Suomen populaatiorakennetta voidaan mahdollisesti hyödyntää etsittäessä geenejä jotka voisivat olla tiet- tyjen tautien Suomen-sisäisten esiintyvyyserojen (esimerkiksi sydäntautien itä-länsi-eron) taustalla. Perusideana on, että taudin esiintyvyyseron voi selittää vain sellainen geneettinen tekijä, jonka alueellinen yleisyysjakauma vastaa taudin esiintyvyysjakaumaa. Niinpä jos verrataan korkean ja matalan esiintyvyysalueen väestöstä poimittuja otoksia, voidaan jatkotarkasteluiden ulkopuolelle jättää ne geneettiset tekijät, jotka eivät näillä kahdella alueella eroa. Tämän SDS-kartoitusmenetelmän toimimisen edellytyksiä testattiin Suomen väestöhistoriaa mukailevissa tietokonemallinnuksissa ja todet- tiin, että menetelmä voi toimia sydäntaudeille, mikäli riittävä osa niitten yleisyyserosta on yhden geenin aikaansaamaa. Perimänlaajuisen analyysit viittasivat kuitenkin siihen, että sydäntautien esiintyvyyseron taustalla on lukuisia geenejä, joista jokaisella saattaa olla yksinään niin pieni vaikutus, että niitä ei SDS-menetelmällä välttämättä kyetä paikallistamaan. Tässä tutkimuksessa nähdyt populaatiorakenteet – Pohjois-Euroopan väestöjen vähentynyt monimuotoisuus ja korostunut geneettinen etäisyys keskieurooppalaisista, Pohjois-Ruotsin omaleimaisuus sekä Suomessa havaittu itäinen vaikutus ja maansisäinen itä-länsi-ero – vastaavat pääosin tuloksia, joita on saatu aiemmissa, pienempiin markkerimääriin perus- tuneissa tutkimuksissa ja myös sittemmin tehdyissä perimänlaajuisissa tutkimuksissa. Lisäksi ne ovat sopusoinnussa tunnetun väestöhistorian kanssa: suomalaisten pääasiassa eurooppalaistyyppinen perimä itäisine piirteineen selittyy yhteyksillä, joita Suomesta on kautta aikojen ollut eri ilmansuuntiin, ja läntisten yhteyksien suurempi osuus Länsi-Suomessa voi aiheuttaa osan Suomen-sisäisestä erosta. Eroa on lisäksi kasvattanut Itä-Suomen epätavallinen väestöhistoria, jossa maanviljelyksen leviämiseen liittyi voimakkaita perustajanvaikutuksia ja jossa pienet lisääntymisyksiköt ovat myöhemminkin johtaneet geneettiseen satunnaisajautumiseen. Vaikka

Tiivistelmä 15 myös Pohjois-Ruotsissa väestömäärä on ollut pieni, mittavinta satunnais­ ajautumista lienee siellä hillinnyt naapuriväestöjen välinen sekoittuminen. Kokonaisuutena tämän väitöstutkimuksen tulokset kertovat perimänlaa- juisten SNP-aineistojen käyttökelpoisuudesta väestörakennetutkimuksissa. Vaikka suppeammatkin aineistot pystyvät valaisemaan väestöhistoriallisia ilmiöitä, tautien ja muiden näkyvien ominaisuuksien tutkimuksissa nimen- omaan perimänlaajuisen rakenteen tuntemus on olennaista. Nämä tulokset myös muistuttavat, että kulttuuristen erojen vähäisyys ei takaa väestön geneettistä homogeenisuutta, kuten Suomessa havaitut jyrkät geneettiset erot hyvin osoittavat.

16 Tiivistelmä Introduction

Population genetics studies the genetic structure of populations – groups of (potentially) interbreeding individuals – and the forces affecting that structure. Insights into a population’s structure can be of interest for several reasons. Because the structure results from various forces that have affected the population, it carries information about the population’s past; in gene mapping efforts, the genetic variation and structure of the study population determine the potential success of different mapping approaches; in forensics, information on population structure and variant frequencies are centrally important in assigning correct probabilities for random and observed genetic matches; in the conservation of endangered species, the maintenance of genetic variability requires knowledge of population structure, and so on. For many decades, the markers available to population structure studies were limited to blood groups and a small number of other proteins, until the development of deoxyribonucleic acid (DNA) methods enabled the use of molecular markers. Since then, the majority of population genetic studies have focused on molecular variation in the mitochondrial DNA (mtDNA) and the Y chromosome. Their uniparental mode of inheritance and lack of recombination make them the markers of choice for studies of branching and timing as well as for the detection of sex-specific phenomena. On the other hand, each is only a single locus, subject to random forces such as genetic drift, and may thus not reflect the full history of a population. Their contribution to the phenotypic variation of populations is also limited. Fortunately, it has recently become technically feasible to complement them by analyses of genome-wide sets of thousands of single nucleotide polymorphisms (SNPs). This thesis studies the population structure of humans (Homo sapiens Linnaeus 1758) in Finland and Sweden using autosomal markers, with an emphasis on the population history implications and gene mapping conse- quences. These northern European populations are of interest because their remote geographic location has led to a restricted gene flow into, within, and between them. Additionally, after a long history of small sizes and low densities – partly due to the relatively late introduction of agriculture – the populations have recently expanded. Whereas the genetic structure in both populations has previously been studied using protein, mtDNA and Y-chromosomal markers, studies III and IV of this thesis constitute the first published analyses of population structure within Finland and Sweden that were based on genome-wide SNP data. With molecular markers, genetic structure in Finland has been studied more extensively than that in Sweden. This difference may relate to interest

Introduction 17 awakened by the obvious discrepancy between the eastern origin of the Finns’ language and their predominantly European gene pool. Furthermore, it may partly stem from research efforts directed toward diseases of Finns, the existence of which is intimately related to population history. Such studies have demonstrated the feasibility of Finns as a target population for standard gene mapping methods. Meanwhile, it may also be possible to utilize the typical features of the Finnish population structure by more specific methods. This is exemplified by Study I, which introduces a novel approach (subpopulation difference scanning, SDS) potentially useful in mapping disease genes that have drifted to differing frequencies in otherwise closely related subpopulations. In humans, studies of population structure can complement population history information derived from various other sources including written historical records, archaeology, and linguistics. In theory, a population’s genetic structure provides a source of information independent of data and conclusions from the other disciplines. In practice, however, the genetic structure observed is usually compatible with several population history scenarios, and the confidence intervals of any timing estimates typically remain very wide, rendering it impossible to base sensible conclusions as to a population’s past on genetic data alone. Thus, the interpretation of population genetic results in terms of population history will rely heavily on knowledge and insights from other disciplines. Because of the interdis- ciplinary of the subject, this thesis attempts to review non-genetic literature on the history of its study populations in a wider fashion than is usually possible in standard genetic studies. Likewise, the Review of the Literature section aims at providing potential non-geneticist readers with some basic knowledge of population genetics essential for interpretation of this and other genetic studies.

18 Introduction Review of the literature

The genetic structure of a population results from a combination of various processes, the basic features of which are briefly described in the first section of this literature review. Their consequences for the human genome structure, gene mapping strategies, and population history inference are discussed in the three subsequent sections. The last two sections review the history, linguistics, and population structure of Finland and Sweden. Treatment of these in the wider context of Europe and the region can be found for example in Lappalainen (2009). This thesis refers to many geographical areas by their names in the local language rather than in English (for example Skåne and Häme instead of and Tavastia). For simplicity, these names are used also in discuss- ing the past, when the current entities were still nonexistent. Analogously, the population history review concerns those moving into and residing in the current area of Finland and Sweden since ancient times, although the emergence of Finns and Swedes as nations is naturally a much later development. Additionally, this thesis uses the term ”history” to refer to past events regardless of the existence of written sources, i.e., referring to both history and prehistory. Unless otherwise indicated, information in the first section is based on Hamilton (2009), Hartl & Clark (2007), and Hedrick (2005), in sections 2 and 3 on Strachan & Read (2011) and Read & Donnai (2011), and in section 4.1 on Jobling et al. (2004).

Review of the literature 19 1. bASics of population genetics

1.1. Inheritance

Diploid organisms have two alleles in each locus. In sexual reproduction, only one of the alleles is passed to an offspring, and the offspring inherits one allele from each parent. Which of the two alleles is inherited is determined randomly and independently for each offspring, as originally established by Mendel (Mendel’s first law). In human beings, the exceptions to this pattern are the mitochondrial DNA (mtDNA) and the sex chromosomes. The mtDNA is passed from mothers to all offspring; thus, it forms maternal lineages. The Y chromosome, in turn, is inherited along a paternal lineage from father to sons. Mothers pass an X chromosome to both daughters and sons, and daughters receive another X chromosome from their fathers (Figure 1).

paternal paternal maternal maternal grandfather grandmother grandfather grandmother

father mother

son daughter Figure 1. Inheritance in a three-generation pedigree. Mitochondrial DNA (circles) is inherited maternally and the Y chromosome (small bars) paternally without recombinations, whereas the autosomal chromosomes (pairs of long bars) recombine in each meiosis. X chromosomes not shown.

1.2. Allele frequencies

Four evolutionary factors can change the allele frequencies of a population: mutation, migration, selection, and genetic drift. In the absence of these factors, the allele frequencies of a population will remain constant from one

20 Review of the literature generation to the next. Notably, whereas selection depends on phenotype, the other three factors are in principle independent of the phenotypic effects of the locus in question.

1.2.1. Mutation Mutation is a random, molecular-level process that results in permanent differences between the ancestral and descendant copies of a DNA sequence. It is the ultimate source of all genetic variation. Mutations range from sin- gle-base DNA changes to large structural alterations, and their phenotypic or fitness effects range from advantageous through silent or neutral to highly deleterious. The frequency of mutations depends on the type of mutation and the organism in question (for humans, see section 2.1). In general, mutations are rare and will affect the allele frequencies of a population only in the long term.

1.2.2. Genetic drift Genetic drift is a random process caused by sampling errors in the propor- tions at which individuals, gametes, and alleles from the parental generation contribute to the gene pool of the next generation. It results in random allele frequency fluctuations which, despite being independent in successive gen- erations, tend to accumulate over time. In the absence of balancing effects of selection or new variants introduced to the population by mutation or migration, genetic drift will lead to the fixation of one allele in the population and loss of the other(s); the probability of an eventual fixation of an allele is equal to its initial frequency. Thus, genetic drift leads to a reduction in a population’s genetic diversity. Meanwhile, it causes divergence between populations, as they can become fixed for different alleles. Since sampling errors are largest in small samples, the effect of genetic drift is strongest in small populations (Figure 2). Even short periods of small population size – population bottlenecks – can lead to substantial genetic drift and a marked reduction in genetic diversity. Another instance of considerable genetic drift is founder effect: when a small group of indi- viduals emigrates to form a new population, the allele frequencies in that founder group – and subsequently in the established population – may differ strikingly from those of the ancestral population. Genetic drift can also have a stronger effect on the population than its census size would suggest, if the effective size of the population is small, for example due to an unequal breeding sex ratio or a large variation in family size. Notably, mtDNA and the Y chromosome have lower effective population sizes due to their uniparental inheritance, which makes them more prone to genetic drift than autosomal, biparentally inherited loci.

Review of the literature 21 ABC

1.0

0.8

0.6

0.4

allele frequency

0.2

0.0 020406080 100 020406080 100 020406080 100 generation generation generation

F igure 2. Allele frequency fluctuations caused by genetic drift in populations of size 50 (A), 300 (B), and 1500 (C). Each line denotes allele frequency from one simulation of 100 generations. For example, in small population A, of the initial 20 loci, only 6 remain polymorphic after 100 generations, whereas in the larger populations, none of the alleles become fixed, but their final frequencies vary between ca. 0.2 and 0.85 in B and between 0.4 and 0.65 in C. Initial allele frequency in all simulations was 0.5; 20 simulations were done per population size. Calculated with a population simulator written by E.S. and described in Lappalainen et al. 2010.

1.2.3. Selection If individuals with a certain phenotype leave more descendants than others, the alleles contributing to that phenotype (if any) rise in frequency. This pro- cess, called selection, is the mechanism producing evolutionary adaptation. Selection can act for or against any genotype(s) and affect any life stage of an individual, for example viability, mating success (in which context it is often called sexual selection) or fecundity. The effects of selection on a pop- ulation’s allele frequencies depend on the strength of selection for or against a given genotype, on genotype frequencies, and on population size; in small populations, the effects of selection can easily be overridden by genetic drift. Selection can also affect loci that are themselves neutral: in a phenomenon called genetic hitch-hiking, alleles can change in frequency due to selection acting on a nearby locus. The general importance of selection in shaping and maintaining a population’s genetic variation has been highly debated.

1.2.4. Migration The migration of individuals between populations (or the movement of their gametes, as with plant pollen) can lead to gene flow, i.e., the inclusion of the individuals’ alleles in the new population’s gene pool. Gene flow can change allele frequencies in either or both populations. These changes can efficiently oppose the effects of genetic drift: they reduce the divergence between the populations by harmonizing allele frequencies across them, and

22 Review of the literature maintain the variation within the populations by replacing alleles that could otherwise become lost. Within a population, migration patterns can lead to deviations from random mating, i.e., to population substructure. One such spatial pattern that is frequently observed is isolation by distance, where geographically close individuals are most likely to mate, and the probability of mating will decrease with distance.

1.3. Genotypes

Under a set of simplifying assumptions, the genotype frequencies of a popu- lation follow the Hardy-Weinberg equation p2 + 2pq + q2 = 1. In this equation, p is the frequency of allele A, q = p - 1 is the frequency of allele B, with p2, 2pq, and q2 being the frequencies of the genotypes AA, AB, and BB. This equation will hold for a biallelic locus in a population of a diploid, sexually reproducing species with non-overlapping generations. Furthermore, the population must mate randomly and have equal allele frequencies for males and females; selection, mutation, migration, and genetic drift need to be absent; and the union of gametes in fertilization must be random. However, even when some of these conditions go unmet, the equation will often hold approximately, for example in finite populations (i.e., in the presence of genetic drift) with overlapping generations. A population whose genotype frequencies match those predicted by the above equation is in Hardy-Weinberg equilibrium (HWE). Deviations from HWE can signal deviations from the equation’s basic assumptions. For example, relative to the equilibrium frequencies, positive assortative mating leads to an excess of homozygotes, but negative assortative mating to an excess of heterozygotes. Likewise, the presence of diverged subpopulations within a population will lead to a deficiency of heterozygotes compared to a situation in which the same population is mating randomly (Figure 3). Such heterozygote deficiency is measured by the F statistics (for exam- ple FST) that are used to detect population structure. Note that nonrandom mating patterns – unlike the factors listed in section 1.2 – do not change the allele frequencies of the population, only the distribution of the alleles into genotypes: if the population becomes randomly mating (and otherwise fulfills the criteria above), its genotype frequencies will reach HWE in the following generation.

Review of the literature 23 A

Allele frequency in total population: 1/2 Subpopulation allele frequency: 1/6 Subpopulation allele frequency: 5/6 Heterozygote frequency: 20/72 = 28% B

Allele frequency in total population: 1/2 Subpopulation allele frequency: 1/2 Subpopulation allele frequency: 1/2 Heterozygote frequency: 36/72 = 50%

Figure 3. Genotype frequencies in a subdivided population (A) and a randomly mating population (B). Each subpopulation is panmictic and has genotype frequencies deter- mined by the Hardy-Weinberg equilibrium, but the population consisting of two diverged subpopulations (A) has fewer heterozygotes than the population with the same overall allele frequency but no subdivision (B). Black and white circles represent two types of alleles, and ovals demarcate alleles of one individual.

1.4. Haplotypes, recombination, and linkage disequilibrium

According to Mendel’s second law, the alleles at two loci are inherited inde- pendently. An exception to this rule are the alleles in adjacent markers on the same chromosome. These alleles form a haplotype, and are inherited together unless a recombination occurs between them (Figure 1). Recombinations result when the homologous chromosomes (one inherited from the mother and another from the father) change segments in meiosis in a process called crossing over. Recombinations may create haplotypes different from those present in the parental chromosomes, thus leading to increased genetic variation in populations. Like mutation, however, recombination is a slow process, and its effect on a population’s genetic variation is therefore rela- tively small.

24 Review of the literature The further apart two loci are on a chromosome, the more probable that a recombination will occur between them. When two loci have a 1% probability of recombination, their genetic distance is defined as 1 centimorgan (cM). In humans, 1 cM roughly corresponds to a physical distance of 1 megabase (Mb), but variation in both directions is at least threefold (Strachan & Read 2011). On an extremely fine scale, recombination rates show even greater variation, producing recombination hotspots (Myers et al. 2005, Coop et al. 2008). When alleles on adjacent markers co-occur on a chromosome more often (or more rarely) than would be expected at random based on their respective frequencies, they are said to be in gametic disequilibrium or linkage disequi- librium (LD). LD commonly arises because a mutation initially occurs on a certain haplotype, and recombinations will only gradually produce other haplotypes carrying the mutation. Additionally, the level of LD can depend on many other population processes: Selection for or against certain allele combinations can change haplotype frequencies relative to LD. In positive assortative mating, the decay of LD will become slower. Small populations maintain higher levels of LD than do larger populations, and LD decays faster in expanding than in constant populations (Slatkin 1994). Admixture can create strong LD, especially when highly diverged populations mix in equal proportions. Notably, processes like selection or admixture can cre- ate LD even between loci situated on different chromosomes or otherwise completely unlinked. Genome with exceptional recombination patterns include the mtDNA which does not recombine, the Y chromosome which does not recombine apart from small pseudoautosomal regions (PARs), and the X chromosome which, apart from PARs, recombines only in female meiosis.

Review of the literature 25 2. Human genome structure

The human genome contains ca. 3200 Mb of DNA. It consists of the nuclear genome – 22 pairs of autosomes and one pair of sex chromosomes (XX in females and XY in males) that vary in length between 48 and 249 Mb – and the small mitochondrial genome which is 16.6 kilobases (kb) in size. The total genetic length of the nuclear genome (excluding the Y chromosome) is 3615 cM (Kong et al. 2002). While the whole mitochondrial DNA was sequenced relatively early (Anderson et al. 1981), a detailed insight into the nuclear genome was first provided by the initial sequences produced by the International Human Genome Sequencing Consortium (Lander et al. 2001) and a company, Celera Genomics (Venter et al. 2001). These sequences covered about 90% and were later refined to cover more than 99% (International Human Genome Sequencing Consortium 2004) of the euchromatic portion of the nuclear genome. (The remaining 200 Mb, or 6.5% of the genome, is highly repet- itive heterochromatin which is very difficult to sequence.) Since then, the International HapMap Project (see section 2.2), The (1000 Genomes Project Consortium 2010), and many other studies have complemented and updated these views (Lander 2011). The majority of the human genome consists of different types of repetitive sequences; transposable elements alone constitute at least 40% (Strachan & Read 2011, de Koning et al. 2011). Meanwhile, only 1.1% of the genome is protein coding, and another 2 to 4% shows evolutionary conservation across species that points to important regulatory or other functional roles (Dermitzakis et al. 2005). On the other hand, it appears that almost all of the human genome is transcribed (ENCODE Project Consortium 2007), which suggests previously unknown functions and mechanisms for the noncoding DNA. The human genome is estimated to harbor ca. 20,500 protein-coding genes (Clamp et al. 2007). Compared to other animals, this is an unexcep- tional number: for instance, the flatworm Caenorhabditis elegans has more than 19,000 genes in its 97-Mb genome (C. elegans Sequencing Consortium 1998, Hillier et al. 2005), and the water flea Daphnia pulex has more than 30,000 (Colbourne et al. 2011). Due to the abundance of alternative splicing (Pan et al. 2008), however, the human genome codes for a much higher number of different proteins than the number of protein-coding genes might suggest. Additionally, the human genome contains at least 6000 genes for various types of noncoding ribonucleic acids (RNAs) (Amaral et al. 2008, Read & Donnai 2011).

26 Review of the literature 2.1. types and patterns of variation

The sequences of any two human beings are on the average about 99.9% identical (Kidd et al. 2004). Compared to that of many other species, this degree of variation is small, and has been attributed to a recent population bottleneck in the human species (Jorde et al. 2001 and references therein). The global distribution of these differences is also relatively uniform: the differences between continents account for 3 to 10% of the genetic variation and the differences between populations within a continent for only 2 to 5%, while 84 to 95% of the genetic variation appears between individuals within populations (Barbujani et al. 1997, Rosenberg et al. 2002). Furthermore, the geographic patterns of variation between populations tend to be mostly clinal (as in Europe; Novembre et al. 2008). All genetic variants have originally been produced by mutation (see section 1.2.1). Other factors that produce variation are recombination and sexual reproduction that reshuffle these variants into new combinations in individuals. Variants whose frequency in a population exceeds an arbitrary threshold, often 1%, are termed polymorphisms. Variants rarer than this are often referred to as mutations. This may easily bring unsupported con- notations of pathogenicity and recent origin, although a rare variant can be equally old and equally neutral as a common one – indeed, the same variant can obviously be common in one population and rare in another.

2.1.1. Single nucleotide polymorphisms (SNPs) In terms of numbers, the most abundant type of variation in the human genome involves single nucleotide polymorphisms (SNPs): locations of the genome where individuals may frequently differ by one DNA base (Figure 4A). Common SNPs with minor allele frequency (MAF) above 0.05 number about 9 to 10 million (International HapMap Consortium et al. 2007), i.e., on average roughly one per 300 bases, although their density along the genome varies. The rate of single-base mutations is very low, approximately 1.0 to 2.5 x 10-8 mutations per nucleotide per generation (Nachman & Crowell 2000, 1000 Genomes Project Consortium 2010). Consequently, most SNPs have likely arisen in a single mutation event in the past, which makes them attractive markers for association studies (see section 3.2).

Review of the literature 27 AB

ACGAGCTGCCACG ACGAGCCACACA GCCACG TGCTCGACGGTGC TGCTCGGTGTGT CGGTGC

ACGAGCCGCCACG ACGAGCCACACACACAGCCACG TGCTCGGCGGTGC TGCTCGGTGTGTGTGTCGGTGC Figure 4. Two possible alleles of a single nucleotide polymorphism (SNP) (A) and a microsatellite (B). The SNP alleles differ by one basepair (bold) of a DNA stretch. In the microsatellite, the upper allele contains three dinucleotide (CA/GT) repeats, and the lower allele contains five.

2.1.2. Microsatellites Microsatellites, or short tandem repeats (STRs), are runs of short sequence units (1-6 nucleotides), repeated tandemly in the genome. The number of repeats can vary between individuals (Figure 4B). The human genome con- tains about 150,000 polymorphic microsatellites. Microsatellite mutations are typically additions or deletions of one to two repeat units, caused by poly- merase errors during DNA replication. The mutation rate of microsatellites is much higher than that of SNPs, in the range of 1.5 x 10-3 (Butler 2006). Due to their high mutation rate and multiple alleles, many microsatellites are highly polymorphic (Weber & May 1989, Litt & Luty 1989), which has made them widely useful in many genetic applications since the early 1990s. Whereas microsatellites are still in use in forensics, technical advances in high-throughput SNP genotyping in the early 2000s have made SNPs the markers of choice for gene mapping and much of population genetics; relative to microsatellites, their higher genomic density and lower mutation rate amply compensate for their lower diversity.

2.1.3. Structural variation Structural variants can be defined as genomic alterations that are larger than 1 kb in size. They include copy number variants such as insertions, deletions, and duplications, and structural alterations that do not change the copy num- ber, such as inversions and translocations. Copy number changes involving segments less than 1 kb in size are often called indels or deletion/insertion polymorphisms (DIPs) (Feuk et al. 2006). Structural variants appear to be much more common in the genome than previously believed (Sebat et al. 2004, Iafrate et al. 2004, Tuzun et al. 2005, Redon et al. 2006), even in the nonrepetitive sequences of phenotypically normal individuals. Although

28 Review of the literature structural variants are fewer in number than are SNPs, they comprise the majority of differing nucleotides between individuals (Levy et al. 2007).

2.2. Haplotype structure

The level of LD varies along the human genome. These patterns of long- range LD are usually similar between populations (De La Vega et al. 2005, Service et al. 2006, Conrad et al. 2006), although the overall levels of LD can differ as a result of population history (see section 1.4). At small scale, the LD pattern is characterized by discrete haplotype blocks of high LD that exhibit low diversity and are separated by points of recurrent historical recombinations, while the long-range LD arises from haplotype correlations between these blocks (Daly et al. 2001). The block structure, resulting from the effects of recombination hotspots and human population history, has been studied in detail by the International HapMap Project (International HapMap Consortium 2003, 2005, International HapMap Consortium et al. 2007, International HapMap 3 Consortium et al. 2010), initially focusing on four populations from three continents. Depending on the block inference method, the blocks appear 5 to 16 kb in length and carry 3.6 to 5.6 haplotypes on average. The overall block structure is similar among populations, although in the African HapMap population, the blocks are slightly shorter and harbor more diversity than in the European or Asian HapMap populations. Because of the low haplotype diversity within blocks, knowing block structure can greatly simplify the pursuit of association mapping (see sec- tion 3.2): when most of the variation within a block can be captured with a few tagging SNPs, in a genome-wide association study it suffices to genotype a few hundred thousand tagging SNPs, instead of the several million common SNPs present in the genome.

Review of the literature 29 3. gEne mapping strategies

Like any genetic variants, disease alleles follow the basic mechanisms of inheritance described in section 1.1: once created by mutation, they are passed from parents to offspring along with the alleles in their adjacent markers, unless separated by a recombination; over the generations, the recombinations narrow down the segments of the original haplotype that the disease chromosomes share. These mechanisms form the basis of the two main approaches of gene mapping, linkage analysis and association analysis, which are detailed in sections 3.1 and 3.2. These approaches monitor the flow and sharing of chromosome segments in pedigrees or populations by use of sets of molecular markers. None of these markers needs to be directly causal of the disease but some of them may show a pattern similar to the one expected for the disease gene. Such markers are likely to reside close to the causal locus. Neither of these approaches leads directly to the identification of a gene underlying the phenotype but only indicates its probable position in the genome – the actual gene has to be recognized by other means. In the old days, those involved the tedious process of cloning, but with the current availability of the human genome sequence, genes in the relevant area can just be sought in the databases and prioritized for further analyses like mutation screening based on (predicted) information on their function. The new sequencing methods also enable a direct search for mutations causing Mendelian diseases from whole-genome or whole-exome data. However, distinguishing the pathogenic mutation(s) from among the thousands of benign variants discovered requires intensive computational efforts of careful filtering (Bamshad et al. 2011, Gilissen et al. 2012).

3.1. linkage analysis

Linkage analysis, in its most basic form, monitors the cosegregation of a phenotype and the alleles of a set of markers when they are inherited in a pedigree; the markers whose segregation pattern is compatible with the pattern of the phenotype are likely to reside close to the causal gene (Figure 5). Extensions of this basic principle exist, for example for quan- titative phenotypes. In humans, the pedigrees available are usually not maximally informative (as specifically designed test crosses would be), and linkage mapping has to rely on computational methods. Parametric linkage analysis, applicable to phenotypes in which the inheritance pattern is known and Mendelian, produces logarithm of the odds (LOD) scores that quantify how much more likely is an observed pedigree if the disease gene and the marker in question are linked than if they are unlinked. Nonparametric

30 Review of the literature linkage analysis, suitable also for phenotypes with less clear patterns of inheritance, detects markers in which affected relatives share alleles more often than expected by chance.

Figure 5. Principle of linkage mapping. In a three-generation family, an autosomal dominant disease (black circles and squares) segregates with a short haplotype of three markers (colored bars and horizontal stripes). The disease allele appears to be inherited along with the red haplotype, and the recombinations in the two individuals on the lower right suggest that the disease gene resides close to the uppermost marker (arrow).

Naturally, in a single meiosis, many alleles throughout the genome will show an inheritance pattern compatible with that of the phenotype. Therefore, linkage analysis requires information from many meioses, and typically it is necessary to collect several pedigrees. Still, the number of informative meio- ses and recombinations crucially limits the resolution of linkage analysis. In some cases, the resolution can improve in subsequent analyses of ancestral haplotypes or LD or in homozygosity mapping (Lander & Botstein 1987). The inherently low resolution, on the other hand, makes linkage analyses with relatively low-density marker sets feasible: the linkage studies of the late 1990s and early 2000s routinely used sets of 300 to 500 microsatellites. Although the microsatellites were informative for linkage due to their high polymorphism, they have by now been largely replaced by SNP sets that compensate for the lower marker variability with substantially higher density and allow high-throughput genotyping.

Review of the literature 31 3.1.1. Suitability for different populations and phenotypes Linkage analysis may benefit from the use of specific populations. In pop- ulations characterized by strong founder effects or other forms of extreme genetic drift, some diseases that are rare elsewhere may have become enriched. This will facilitate the collection of a sufficient number of families affected with the disease. Furthermore, genetic drift has likely reduced allelic heterogeneity and thus simplified haplotype-based fine-mapping efforts, as well as the locus heterogeneity, so that most of the affected families will presumably carry a mutation in the same gene. Linkage analysis has proved effective in mapping genes causal of mono- genic diseases, but it is very sensitive to locus heterogeneity and has low power to detect loci with weak effects. Its successes have therefore been modest in locating genes for complex diseases, which are thought to arise from the contributions of several small-effect susceptibility alleles from multiple genes. Indeed, replicable detection of such genes with linkage analysis would require extensive if not impossibly large sample sets, whereas the same power is achievable with considerably smaller sample sets in association analyses (Risch & Merikangas 1996). The latter have thus won popularity in complex disease studies, once technological developments made them feasible.

3.2. Association analysis

Rather than explicitly inferring recombination events during meioses, asso- ciation analysis relies on the cumulative effect of those recombinations in a population; individuals sharing a disease mutation through common descent are also likely to share adjacent short stretches of the haplotype in which the mutation originally occurred (or in which the mutation entered the population). This leads to LD between the disease allele and the surrounding alleles and can be detected as co-occurrence of the surrounding alleles with the disease more often than expected (Figure 6). In its simplest form, associ- ation analysis compares a set of cases with the disease of interest to a set of control individuals without the disease but otherwise resembling the cases as closely as possible; any genetic factors that differ between the groups are assumed to relate to the disease. Extensions of this principle to quantitative phenotypes exist, but in this section, association analysis is mainly discussed from the point of view of case-control analysis and disease traits.

32 Review of the literature cases controls

x C x T x T C T C C C C C

C x T x C C C C C C C C

x T x T C x T C C x T C C C

x T x T C C T x T C C C C

frequency of allele T: frequency of allele T: 8/20 = 40% 4/20 = 20%

Figure 6. Principle of association mapping. A genetic variant (marked X) that contributes to risk for a complex disease is in linkage disequilibrium (LD) with allele T of a nearby SNP. Despite incomplete LD (risk haplotypes with C and non-risk haplotypes with T allele), reduced penetrance (healthy individuals with the risk variant) and phenocopies (cases without the risk variant), the T allele is more common among cases than among controls, i.e., it associates with the disease.

Because many meioses and recombinations have taken place in a pop- ulation since the original appearance of a disease mutation, LD usually extends a much shorter distance on the chromosome than does linkage, and association analyses typically require a genotyping density in the range of a few kilobases. This obviously translates to a relatively high resolution. On the other hand, the prior unavailability of sufficiently dense marker-sets restricted association analyses mainly to candidate gene studies and to the fine mapping of linkage analysis signals (in which context association analysis is often called linkage disequilibrium analysis). Since the early 2000s, commercial SNP genotyping arrays have enabled genome-wide association studies (GWASes). These are based on the “com- mon disease - common variant” hypothesis, in which susceptibility to com- plex diseases results from genetic variants that are common in the population and also largely shared between populations. By 2012, the nearly 1200 published GWASes had detected associations (p < 1 x 10‑8) of more than 2000 SNPs with dozens of phenotypes ranging from melanoma, migraine, and malaria to height and hair color (National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies, http://www.genome.gov/gwastudies, accessed 03/06/2012).

Review of the literature 33 Compared to linkage analysis, association analysis is less sensitive to reduced penetrance of the causal variant and to locus heterogeneity. On the other hand, unlike linkage analysis, it can be seriously affected by allelic heterogeneity or population stratification. Population stratification can create spurious associations, if the sampling patterns of cases and controls across subpopulations differ, because an allele whose frequencies vary between subpopulations may differ between cases and controls even if the allele is unrelated to the phenotype. Obviously, when genome-wide data are available, they can serve in the genetic matching of cases and controls prior to association analysis or in a computational stratification correction, typically based on principal component analysis (PCA) (Price et al. 2006).

3.2.1. Suitability for different populations Association analysis may benefit from the use of populations that are char- acterized by founder effects, small size, or isolation, because their reduced genetic diversity can lead to lower locus and allelic heterogeneity. However, the reduction in diversity is likely smaller for common variants than for the rare ones targeted by linkage analysis. The obvious downside of such study populations is that some of the variants contributing to the disease in other populations will be missing or rare and therefore impossible to find, and, conversely, the enriched variants may elsewhere have lower frequencies and thus be of less importance (see Martin 2006). The LD utilized by association analysis can vary in strength between pop- ulations. When LD extends more widely on the chromosome, an association analysis needs a less dense set of markers. It can therefore be advantageous to conduct association studies in populations that exhibit high LD, be it from founder effects, small population size (Terwilliger et al. 1998), or other factors resulting in genetic drift. This includes a trade-off, however; while stronger LD allows association scanning with smaller sets of markers, it also leads to lower resolution. This disadvantage may be partly circumvented by replicating and refining association findings from high-LD populations in populations that are more outbred. Increased LD can result from a recent population admixture and, as stated above, reduce the number of markers needed in an association scan. Additionally, if a disease differs in frequency between the parental popula- tions, its causal variant is likely to lie in a genome region where cases show ancestry from the high-frequency parental population either more often than controls do or more often than elsewhere in their genome (Stephens et al. 1994, Smith & O’Brien 2005, Zhu et al. 2008); the ancestry inference can be based on a set of ancestry-informative markers (AIMs) that differ in frequency between the parental populations. This approach, admixture mapping (or mapping by admixture linkage disequilibrium), has successfully

34 Review of the literature located genes for several phenotypes such as hypertension and multiple sclerosis (Zhu et al. 2005, Reich et al. 2005; see Winkler et al. 2010 for a review). The usual target population has been African-Americans, in whom the admixture between Africans and Europeans is sufficiently recent to enable mapping with a hundredfold fewer markers than for nonadmixed populations (Smith et al. 2004). Other AIM panels exist, for example for Latinos (Tian et al. 2007, Price et al. 2007, Mao et al. 2007) and East Asian Uyghurs (Xu & Lin 2008). A further advantage of admixture mapping is that it can locate relatively low-risk loci by use of fairly small sets of cases and controls (Hoggart et al. 2004, Patterson et al. 2004; cf. section 3.2.2).

3.2.2. Suitability for different phenotypes The commercial SNP genotyping arrays that are used in GWASes feature common SNPs, typically with MAFs exceeding 5%. Since the LD between a common and a rare variant will be low (Wray 2005, Eberle et al. 2006), GWASes are underpowered to find rare disease alleles. Common alleles, in turn, are unlikely to have large effects on disease risk, due to the purifying effects of selection – perhaps with the exception of late-onset diseases and variants that would not have been disadvantageous, for example, for a pre-industrial life style. Thus, GWASes were predicted to find mainly low-effect variants, and, indeed, most SNPs indicated by GWASes have small effect sizes: the 1,234 associated SNPs (with p < 1 x 10-8) reported by the NHGRI Catalog of Published Genome-Wide Association Studies (accessed 01/01/2012) had a median odds ratio (OR) below 1.28. Finding small-effect alleles requires large sample sizes, especially in genome-wide studies where the significances must survive a stringent multi- ple testing correction. Accordingly, many of the current successful GWASes are meta-analyses or efforts of large international consortia, and feature combined datasets with thousands or tens of thousands of individuals. The need to collect large numbers of cases can obviously limit the usefulness of GWASes in small populations or for relatively rare diseases. Furthermore, the power of prediction of the disease risk of individuals based even on ample sets of such markers tends to be low, often much lower than that based on traditional (non-genetic) risk factors (see Meigs et al. 2009 for type 2 diabetes, Aulchenko et al. 2009 for height). Even when association analyses have revealed numerous variants that contribute to a complex disease or phenotype, the variants commonly explain only a modest portion – usually 10% or less – of the genetic variation in the phenotype (although when non-significantly associated loci are also considered, the proportion may rise to 50%) (Visscher et al. 2012 and their references). This phenomenon, known as missing heritability, may result from many factors: 1) epistatic interactions computationally unfeasible

Review of the literature 35 to test in a hypothesis-free manner; 2) structural variations, which are currently understudied in GWASes, substantially contributing to disease susceptibility; 3) complex inheritance, for instance epigenetic mecha- nisms or parent-of-origin effects; 4) inflated estimates of total heritability; 5) common variants not reaching genome-wide significance in the GWASes due to their low penetrance but that might be found by further enlarging sample sizes; 6) actual causal variants in the GWAS-associated regions having larger effects than the genotyped SNPs; and 7) rare alleles that confer large effects and that may be found by sequencing (Eichler at al. 2010, Manolio et al. 2009). These factors, however, have not yet been sufficiently studied to enable estimation of their relative contributions to genetic variation in human complex diseases, and the subject remains debated.

3.3. complementary methods

As described, linkage and association analysis are applicable to a wide range of diseases and other phenotypes. Indeed, each approach has mapped dozens of genes successfully. Both approaches nevertheless have their disadvantages and limitations, which creates a demand for complementary methods that may be suitable for a specific subset of phenotypes, populations, or variants. This thesis introduces and theoretically tests one such method, the subpop- ulation difference scanning (SDS).

36 Review of the literature 4. gEnetic inference of population history

4.1. General principles

As described in section 1, various demographic processes will influence the genetic structure of a population. Conversely, population structure can be useful in inferring processes that have affected the population in the past and thus in estimating earlier population structure. Another, more direct means of assessing the structure of past populations would be studying ancient DNA (aDNA), but such studies often suffer from the limited availability of suitable samples and from technical challenges such as DNA degradation and contamination (Hofreiter et al. 2001, Mitchell et al. 2005, Willerslev & Cooper 2005). Consequently, the genetic inference of population histories mostly relies on studies of extant populations. To complement direct observations of contemporary population structure and diversity, population history inference often utilizes population simu- lations. These can model the processes that may have played a role in the population’s past. Two main types of simulations exist. Forward simulations start from a set of initial parameters, proceed forwards in time according to defined demographic scenarios, and produce simulated datasets that can then be compared to the real data. Backward or coalescent simulations start from the observed data and proceed backwards in time to infer what kind of processes may have produced them. However, relative to the multitude of factors that affect real population demographies, both forward and backward simulations are often highly simplified. Additionally, compatibility of the observed data with (forward) simulations does not prove that the simulated scenario would be the one that has actually taken place. Equally or more compatible results could result from other, untested simulation scenarios. Thus, instead of proving a given scenario, simulations can serve to exclude, from various explanations for the observed data, obviously improbable scenarios. Genetic variation within a population may result from very different population-genetic processes, typically from a combination of many factors. For example, factors such as selection, absence of migration, or a small population size that has caused strong genetic drift can all result in low diversity of a locus. The presence of selection can often be explicitly tested, or simply ignored if the proportion of loci under selection is presumably low. The effects of selection can, however, extend to neighboring loci through hitch-hiking phenomena (see section 1.2.3). Mutation, in turn, usually has negligible effects on population diversity when the time intervals of interest

Review of the literature 37 are short, and the mutation rates for the marker type in question are low, as for example for SNPs. In addition to overall level of diversity, geographical structuring of that diversity within a population may vary and thus be of interest for population-history inference. Indeed, low diversity does not always equal within-population homogeneity (Figure 7), because factors such as low population density that reduce general diversity (through genetic drift) will likely cause increased substructuring (through small breeding units). The existence of population substructure is typically detected as deviations in genotype frequencies from HWE (see section 1.3), and is quantified by

FST-like measures, several versions of which exist for different marker types. Furthermore, information on the past can emerge from the level and patterns of LD: for example, admixture events create increased LD that in subsequent generations will gradually decay.

heterogeneous homogeneous

high diversity

low diversity

Figure 7. Relation between genetic diversity and homogeneity or heterogeneity. Even the populations (large circles) with relatively low diversity can be either homogeneous or heterogeneous, depending on the geographic distribution of genetic variants (colored circles) they contain.

Genetic data from extant populations also enable the timing of popula- tion history events. When populations diverge through genetic drift, their genetic distances can serve in inferring their relative divergence times. This approach, however, presumes the absence of subsequent migration between

38 Review of the literature the diverging populations and constant (or at least known) effective popu- lation sizes – both criteria that the human population history rarely fulfills. Other timing methods rely on the gradual accumulation of mutations (for example STR diversity within Y-chromosomal haplogroups) or the decay of haplotypes through recombination (Figure 8). These estimates can be calibrated into years or generations by use of mutation and recombination rates, but because these rates are usually imprecise, the confidence intervals of the resulting time estimates typically remain very wide. Importantly, these approaches measure the molecular, rather than population, history. For example, the most recent common ancestor of a set of lineages in a population may greatly antedate the migration that brought those lineages into the population. The two time-points will be close to each other only in the cases of extreme founder effects or bottlenecks.

AB

time

Figure 8. Principle of genetic timing. Mutations (A) or recombinations (B) accumulate over time on the descendant copies of an original haplotype. The larger the number of mutational differences (thin stripes in A) or the shorter the shared stretch of the original haplotype (white in B), the longer the time passing since the split between the two lineages.

4.2. Genome-wide SNP datasets in studies of population history

It is generally preferable to base population history inference on data from several loci, because the patterns of single loci are subject to different levels of selection and genetic drift (Jobling et al. 2004). Notably, mtDNA and the Y chromosome which have been widely used in population genetics comprise only two loci because of their nonrecombining nature, and their smaller effective population size makes them specifically prone to genetic drift. Whereas the latter may even be advantageous when studying closely related populations, the two loci may not yield a full and representative picture of a population’s history, and may need to be complemented by autosomal markers. In recent years, an important source for such autosomal data has been pro- vided by the genome-wide SNP datasets produced by GWASes. Furthermore, GWASes have awakened increased interest – also among non-population

Review of the literature 39 geneticists – into population structure because it is a confounding factor in the association studies. Together, these developments have resulted in numerous studies of population structure on a global and continental as well as on more local scales. They have also spawned methodological advances in population genetics, for example evaluations of the interpretation and limitations of the traditionally used method of PCA (Patterson et al. 2006, Novembre & Stephens 2008, McVean 2009, François et al. 2010). Population genetic analyses of the genome-wide SNP datasets can use many of the existing analysis methods, provided that those are computation- ally sufficiently fast to effectively utilize the vast amount of data. Obviously, the presence of recombinations restricts the range of possible analysis methods relative to those applicable on mtDNA- and Y-chromosomal data, and the biallelic nature of the SNPs on commercial genotyping arrays limits the informativeness of single markers. Use of short haplotypes of adjacent SNPs may, however, enhance the power of analyses at least in some situations (Jakobsson et al. 2008, Morin et al. 2009, Gattepaille & Jakobsson 2012). Additionally, the large number of markers available enables individual-based analyses, which have the potential to produce results independent of the researcher’s preconceptions of the underlying population structure. This is an important capacity, especially when the geographical sampling distribu- tion is relatively even and suggests no natural division into units of analysis. Several novel analysis methods have also been developed specifically for genome-wide SNP data (see Moorjani et al. 2011, Johnson et al. 2011, Pickrell & Pritchard 2012). However, many such methods may be more useful for study of global or intercontinental phenomena than of the small differences between closely related (sub)populations. Some disadvantages may limit the use of genome-wide SNP datasets from GWASes in population history inference. Although the size and geo- graphic density of sampling in these datasets may be unprecedented, from a population history perspective, the populations targeted by association analyses may not be maximally interesting. The geographical and ancestry information available can also be much less detailed than if the individuals had been recruited specifically for population genetic studies. Even geo- graphically dense sampling may not always be a blessing, because some of the existing analysis methods are not well suited to inferring continuous and clinal population patterns. Furthermore, ascertainment of SNPs on commercial genotyping arrays is biased: those SNPs are common rather than rare variants. Moreover, if their selection has been based on data from one particular population, variation in other populations may not be represented equally well (Wakeley et al. 2001, Clark et al. 2005, Rosenblum & Novembre 2007). While correction methods for ascertainment do exist (Nielsen et al. 2004, Guillot & Foll 2009), a more balanced picture of population genetic variation will eventually emerge from sequencing studies.

40 Review of the literature 4.3. The multidisciplinary inference of human past

The historical inferences genetics provides can be compared to those pro- vided by other disciplines. Archaeology studies the remnants of past human cultures to infer their customs, contacts and sources of livelihood. Linguistics can expose the divergence of related languages through time, presumably reflecting their spatial spread, and discover loanwords signaling contacts between speakers of different languages. Additionally, the words’ meanings can provide information on the circumstances in which they were used or acquired – for example, names for inventions are often loaned along with the inventions themselves – and toponyms (place names) of loaned origin can even anchor these language forms to a specific geographical location. Physical anthropology can use bone findings to draw inferences as to the anatomy and health of ancient individuals and relationships between peoples. Other material from the past can undergo study by tools from sciences like botany, zoology, and ecology to yield information on past environments, living conditions, and means of human subsistence. History which relies on written documents can often provide more detailed and direct information than other disciplines, but is applicable only to periods with written records available. Because these disciplines study different aspects of the human past, their conclusions may also differ – our genes determine neither the language we speak nor the type of pottery we use. Admittedly, the three often correlate because of being typically inherited in a similar manner, but the correlation may be broken in various circumstances. From the archaeological record it is often difficult to tell whether inventions or cultures have spread into an area along with major migrational movements or mainly as cultural loans; the two options could obviously leave very different traces in a population’s language or gene pool. Conversely, there are major, historically known linguistic influences invisible in the archaeological record (see Mallory 1989 p. 166). Meanwhile, relatively few immigrants per generation can significantly shape a population’s gene pool over time, without necessarily causing noticeable changes in the material culture or leaving more than an occasional loan word in the linguistic record. On the other hand, factors like the social dominance of a small group of immigrants can lead to language replacement in the whole population within a few generations, while the influence on the gene pool may remain subtle. In theory, historical inferences based on genetic data are independent of results from other disciplines. In practice, however, it is fairly difficult to form an accurate picture of a population’s past based on genetic evidence alone, since many different phenomena can result in similar genetic patterns, and the confidence intervals of timings of genetic events are often wide (see section 4.1). Therefore, insights into a population’s history may be crucial

Review of the literature 41 in the sensible interpretation of genetic data. For instance, agricultural populations are typically much denser and more sedentary than hunter-gath- erer populations, which immediately influences many population genetic processes. Obviously, an up-to-date view of the timing, direction and extent of contacts with neighboring populations is equally important.

42 Review of the literature 5. Finland

5.1. population prehistory and history

The earliest – albeit highly disputed – signs of habitation within the current area of Finland may date back to the Eemian interglacial more than 40,000 years ago (Pettitt & Niskanen 2005, Schulz 2010). During the last glacial period, however, the Fennoscandian area was covered by a thick continental ice sheet that both made the area unsuitable for life and wiped out most signs of possible earlier inhabitants. While certain species like may have survived in some ice-free locations of (Parducci et al. 2012), the distribution of humans and most other species was limited to refugia located further south, for example in Iberia and the . The first postglacial settlers arrived in the area of present-day Finland soon after the glacial ice started retreating. These were hunter-gatherers who probably followed the spread of deer and other prey species into the newly emerged areas (Huurre 2005). The earliest settlements in Finland date back to 8000-8500 BC, with those immigrants coming from the south (Takala 2004). Other early settlers arrived from eastern and southeastern directions (Siiriäinen 2003a). In northern , including northern Finland, the early settlers arrived via a route along the Norwegian coast (Siiriäinen 2003a). Recently, objects connected to the southern culture and dating back to 8300-8200 BC have also been found in northernmost Finnish (Rankama & Kankaanpää 2008). From the period between 7000 and 6000 BC, multiple settlement sites are known throughout Finland (Huurre 2005). The population size was probably not more than a few thousand (Pitkänen 2007). Pottery came into use around 4200 BC along with the arrival of the Comb Ceramic culture (Figure 9A) that prevailed in large areas east and southeast of Finland. Several subtypes of Comb Ceramics can be recognized, for exam- ple the Typical Comb Ceramic culture that spread from in 3400 BC. Although such subtypes indicate the presence of regional (and temporal) differences, the era was apparently characterized by frequent contacts both within the country and with outside areas. These are demonstrated for instance by archaeological findings that include objects made of Siberian , originating from the Ural mountains, and Baltic amber (Siiriäinen 2003a, 2003b, Huurre 2005).

Review of the literature 43  

Figure 9. Geographic distribution of the Comb Ceramic (CC), Funnel Beaker (FB), and Pitted Ware (PW) cultures in Fennoscandia in ca. 3000 BC (A), and the ca. 2500-2000 BC (B), according to Huurre (2005) and Carpelan (1999). Note that other cultures existed elsewhere in the area.

The area of Finland first split into western and eastern cultural domains upon the arrival of the Corded Ware culture, part of a wider movement that spread over the whole of , southern Norway and southern Sweden in 2300 BC, and reached Finland through the eastern Baltic, possibly a few centuries earlier. In Finland, its influence was limited to the southwestern and coastal parts of the country (Figure 9B). Contacts between the Corded Ware area and the rest of Finland seem to have been very rare. In the eastern areas, the Comb Ceramic culture with its predominantly eastward connections persisted. Even in the western area, Corded Ware and the local Comb Ceramic culture seemed to remain mostly separate, until towards the end of Stone Age they fused to form the culture (Carpelan 1999, Siiriäinen 2003a, 2003b, Huurre 2005). With the advent of metal in the beginning of the Bronze Age (1500/1300 BC - 500 BC), the division between eastern and western Finland continued: both regions acquired the metal via their main direction of contact. In the west, the main direction of contact shifted from the Baltics to Scandinavian areas, as Scandinavian traders extended their trips to the Finnish coasts to buy fur. In the east, southeastern contact brought textile pottery from central into southeastern and eastern Finland. From there, ceramics spread also to the previously aceramic north (Siiriäinen 2003a, 2003b, Huurre 2005). The introduction of agriculture to Finland has often been connected to the arrival of the Corded Ware people who were agriculturalists elsewhere, although direct signs of this have not been found in Finland (Siiriäinen 2003a,

44 Review of the literature 2003b). On the other hand, experiments with agriculture may have been made even earlier, during the Typical Comb Ceramic period, as in related cultures in and in the Onega region in Russia (Mökkönen 2011). One theory is that agriculture arrived during the Bronze Age both from the east along with textile pottery and from the west along with Scandinavian contact (Siiriäinen 2003a). Pollen studies have detected signs of agriculture from more than 4000 years ago in southwestern Finland (Vuorela 1999). Regardless of the time of arrival, (especially fishing) long contin- ued to be an important means of subsistence even in the agricultural areas (Siiriäinen 2003a). Earlier, archaeological findings dating to the early were scarce. This led archaeologists to postulate a discontinuity in Finnish settlement and a large-scale migration of the ancestors of the present-day Finns from the south in the first centuries AD. This theory is no longer valid, however, since the archaeological record has become more plentiful for this period. Currently, settlement is thought to have been continuous, although climate factors may have caused a population decrease in 500 BC to 0 AD which was compensated by a population increase from 0 to 200 AD. The latter part of the Iron Age was characterized by extension of agricultural areas, increase in population density, Baltic and Scandinavian contact and trade with central Russia (Siiriäinen 2003b, Huurre 2005, Tallavaara et al. 2010). Based on the archaeological record it is often difficult to say which of the observed influences have resulted from a mere cultural diffusion and which may have involved immigration, and the archaeologists are hardly unanimous in their interpretations. The initial introduction of ceramics may have been mainly a cultural loan, but a population movement has often been associated with the spread of the Typical Comb Ceramic culture, although the increase in settlement finds may also stem from a local economic boom prompted by the warming climate rather than from a population influx. The arrival of the Corded Ware culture was likely accompanied by some degree of immigration, but the actual number of settlers may have been very small; furthermore, they eventually fused with the former population rather than replaced it. The Scandinavian contact during the Bronze Age, though mainly trade-related, may have involved a small amount of colonization. For the Iron-Age influence from the south, estimates of immigration range from some to none (Carpelan 1999, Siiriäinen 2003a, 2003b, Huurre 2005, Tallavaara et al. 2010). The influence of began gradually to diffuse into Finland in the 11th century, although pagan traditions persisted widely for centuries. Nevertheless, this development attached Finland as a part of the Swedish kingdom in the second half of the 12th and first half of the 13th century; in this process, the efforts of the church were in fact more important than the

Review of the literature 45 efforts of the royal authorities (Lindkvist 2003a, 2003b, Mäntylä 2003a, Sawyer & Sawyer 2003, Siiriäinen 2003b, Vahtola 2003a). In the early 12th century, agricultural settlement was limited to small areas of coastal , Åland, southern , Häme, Southern Savo, and the northern and western shores of . The early medieval centuries witnessed an increasing density of these settlements as well as the spread of agriculture to new areas in Southwest Finland, Satakunta, Karelia, Häme, Savo, Southern Ostrobothnia and the and river valleys. Consequently, a population estimated as less than 50,000 in early the 12th century may have at least doubled within a few centuries. Population growth in Sweden led to Swedish immigration to the coasts of Ostrobothnia and southern Finland between the mid-13th and mid-14th centuries, and some Estonian immigrants may also have arrived in Southwest Finland. Numerous German merchants settled in , and many were later assimilated (Orrman 1999, Dahlbäck 2003, Vahtola 2003a, 2003b, Pitkänen 2007). Despite these developments, agriculture remained limited south of 62 degrees of latitude except on the west coast, mostly because of climate and soil. The areas inland were not uninhabited, but consisted of hunting grounds of the agricultural areas and of the nonagricultural Saami popula- tion. From the 16th century onwards, however, agriculture extended into these areas with the spread of settlers predominantly originating from the of Southern Savo (themselves mostly of Karelian ancestry). This spread was enabled by their farming methods (slash-and-burn) and a new variety of winter-tolerant rye. The movement started spontaneously, probably propelled by the earlier population growth, but since the latter half of the 16th century it was also legislationally encouraged by the Swedish King. By the 17th century, the migrants had gradually spread to Northern Savo, to the previously uncultivated areas of Häme, Satakunta, Southern and Northern Ostrobothnia, and to , Karelia, the inland areas of central and northern Sweden, and even to Norway. The Saami inhabitants of the areas probably in part retreated from and in part assimilated with the newcomers. Notably, many of these areas were on the Russian side of the 1323 border, but the newly-settled areas of northern Finland were ceded to Sweden in 1595. Other consequences of the process were the doubling of the agricultural area of Finland and growth of the population to 250,000 to 350,000 (Orrman 1999, Keränen 2003, Vahtola 2003a, 2003b, Pitkänen 2007). After the population expansion and increase of the 16th century, the population growth of the following century was very modest, mainly due to cold summer temperatures leading to numerous years of low crop yields throughout the century. The famine of the winter of 1696-1697 was especially severe: in combination with related epidemics, it may have killed one-third of the population. Consequently, though the population size was about

46 Review of the literature 290,000 in 1635 and 400,000 before the famine, in 1721 it was 306,000. In subsequent years, population growth resumed, leading to a population of 405,000 in 1749 and nearly 900,000 in 1800. Despite a famine that killed 100,000 people as late as in the winter of 1867-1868, the population reached two million by 1880 (Mäntylä 2003a, 2003b, Nygård & 2003, Olkkonen 2003, Pitkänen 2007). Finland was ceded to Russia in 1809 and gained independence in 1917. Since the late 19th century, migration within the country has been directed to the and towards the south. For example Helsinki and the surrounding province, , contained 10% of the Finnish population in 1880 but 25% in 1990 (Pitkänen 2007). Nevertheless, urbanization occurred relatively late: more than half of the population lived in rural areas until 1970 (Vihavainen 2003, Pitkänen 2007). Another instance of internal migration took place after World War Two, when people from the areas ceded to the – 12% of the total population – were resettled elsewhere in Finland (Haataja 2003). The 20th century also featured two major emigration waves; in the first decades of the century, 150,000 Finns left, mainly to America, and in the 1960s, almost 300,000 went to Sweden (Nygård & Kallio 2003, Vihavainen 2003, Pitkänen 2007). A smaller population movement occurred during World War Two, when 70,000 children were sent for safety to Sweden. While most of them returned after the war, thousands stayed in their foster families (Laine 2003). At present (July 2012), the Finnish population size is 5.4 million (Official Statistics of Finland 2012). In summary, the Finnish population history demonstrates a general continuity, possibly since the first postglacial settlers. While the overall number of incoming settlers has undoubtedly been small, immigration has nevertheless been characterized by relatively frequent contact with various neighboring populations throughout the ages, rather than a handful of distinct immigration waves. The population remained nonagricultural for a long time, especially in eastern Finland, and small in numbers until a few centuries ago.

5.2. Language

The , with its 5 million speakers, is the second largest language of the Uralic (Finno-Ugric) language family (Nanovfszky 2004). It is closely related to other , including Estonian and several small languages spoken around the Baltic Sea, for example Karelian and Vepsian, and to the Saami languages spoken in the northern parts of Finland, Sweden, and Norway, and in the Russian . Other related languages include for example Komi, Udmurt, and Mordvin, spoken west of the Urals in Russia, and the more distant relatives, Hungarian and several languages of (Laakso 1991). Within Finnish is a major division into

Review of the literature 47 eastern and western (Leskinen 1999). The first written documents of Finnish date from the 16th century (Häkkinen 1996). The Finnish language signs, in the form of loanwords, of contacts with ancient Indo-European language forms, and more recent strata of loan- words from Baltic, Slavic, and various forms of . Extensive analysis of the typical meanings of these words is available (Häkkinen 1990), but interesting in the current context is that several Finnish words for female relatives are Baltic loans, and the word for “mother” is Germanic in origin. During later times, loanwords have come from Swedish as well as Greek and (often via Swedish), and recently from English (Häkkinen 1990). Studies of Finnish place names have recognized Germanic names, names of Saami origin even in southern Finland, and possibly names originating from some unknown language (Häkkinen 1996). The first arrival of the Uralic language in Finland has often been connected to the Typical Comb Ceramic culture (see Meinander 1984), the spread of which would coincide with the traditional estimates of splitting of the Proto- Uralic language ca. 6000 years ago. Furthermore, the arrival of the Corded Ware people has been thought to have led to the initial divergence between Saami and Finnic languages (Sammallahti 1984) and to explain some of the Baltic loan words found in Finnish (Carpelan 1999). More recently, however, an apparent archaeological continuity has led to suggestions that it was the early settlers of Finland who originally introduced the Uralic language into the area (Wiik 2002), but this view is strongly opposed by most linguists. Some linguists, in turn, have proposed markedly shorter chronologies for the divergence of the Uralic language family: an initial split of the Proto-Uralic language as late as 2000 BC, the arrival of a Uralic language in the Baltic Sea area ca. 1900 BC, and the split between Finnish and Saami no earlier than a few centuries BC (Kallio 2006, Häkkinen 2009). On the other hand, a Bayesian phylogenetic analysis has suggested a Proto-Uralic divergence time largely in line with the traditional estimates (Honkola et al. unpublished). In 2011, Finnish was spoken by 4.9 million people in Finland (some 90% of the population). The other official language, Swedish, was spoken by 270,000 (4.9%), and Saami by 1,900 (0.03%). Other minority languages in Finland include Russian with 58,000 speakers (1.1%) and Estonian with 33,000 (0.6%) ().

5.3. Genetic structure

5.3.1. Finns compared to other European populations The Finns’ genetic structure has been studied extensively, starting with blood group markers in the first half of the 20th century (see Norio 2003b p. 461 for a review). A study of nuclear genes found Finns as outliers within Europe

48 Review of the literature (along with for example Saami, Basques, and Icelanders), with Belgians, Germans, Austrians, and Swedes as their closest genetic neighbors (Cavalli- Sforza et al. 1994). Finns also differ from their Scandinavian neighbors in their human leukocyte antigen (HLA) frequencies (Siren et al. 1996), and the frequencies of one blood group reveal a Baltic influence in Finland (Sistonen et al. 1999). The mtDNA diversity and haplogroup composition of Finns are similar to those of other Europeans (Sajantila et al. 1995, 1996; Lahermo et al. 1996, Torroni et al. 1996, Kittles et al. 1999, Finnilä et al. 2001, Hedman et al. 2007), whereas Finns’ Y-chromosomal diversity is substantially reduced (Sajantila et al. 1996, Kittles et al. 1998, 1999; Lahermo et al. 1999, Zerjal et al. 2001, Hedman et al. 2004, Lappalainen et al. 2006). Y-chromosomal haplogroups also reveal Baltic and Scandinavian contact (Lappalainen et al. 2006, 2008). Recent genome-wide SNP studies have shown that Finns differ from other Europeans even more than their geographic location would indicate. Depending on the combination of populations included, Finns appear to be genetically closest to Swedes, , Germans, and Poles, among others (Seldin et al. 2006, Bauchet et al. 2007, Lao et al. 2008, Novembre et al. 2008, McEvoy et al. 2009, Nelis et al. 2009). In a study looking at genetic matches between individuals from 23 European locations, for 83% of the Finnish samples the closest match was another Finn, which illustrates their genetic distinctness (Lu et al. 2009). Furthermore, genome-wide data have allowed study of signs of recent positive selection among Finns and other European populations (Lappalainen et al. 2010). Finns are also likely to be among the first populations to be extensively studied by the new sequencing methods, not least because they are included in the 1000 Genomes Project (1000 Genomes Project Consortium 2010).

5.3.2. Eastern influence in the Finnish gene pool Although the Finnish gene pool is mainly European, it also harbors distinct eastern elements. Estimates of the relative contributions of these sources to the nuclear gene pool have been 75% European and 25% non-European (Nevanlinna 1984) or 90% European and 10% Uralic (Guglielmino et al. 1990). HLA haplotype frequencies in Finland are consistent with a predom- inantly European origin of the population but also show traces of eastern elements (Ikäheimo et al. 1996). Stronger eastern influence is apparent in the Y chromosome (Zerjal et al. 1997, Lahermo et al. 1999), exemplified by the most common haplogroup of Finns, N3, which is of eastern origin (Rootsi et al. 2007) and has a frequency of about 60% (Lappalainen et al. 2006). Some eastern mtDNA haplogroups have been detected in Finns (Torroni et al. 1996, Meinilä et al. 2001), and an increased eastern affinity has also been suggested by some genome-wide SNP studies (Bauchet et al. 2007).

Review of the literature 49 5.3.3. Finns and their Saami neighbors The Finns show signs of admixture with the Saami population that lives in the north of Finland (as well as in northern Sweden, Norway, and Russia) (Sajantila et al. 1995, Lahermo et al. 1999). The admixture is stronger in the northern and northeastern Finnish than in the central and south- western ones (Meinilä et al. 2001, Hedman et al. 2007). While the Saami speak languages related to Finnish, and are relatively similar to Finns with respect to the Y chromosome (Lahermo et al. 1999 and Raitio et al. 2001), they differ from the Finns and other Europeans by their mitochondrial and nuclear DNA (Barbujani & Sokal 1990, Cavalli-Sforza et al. 1994, Sajantila et al. 1995, Lahermo et al. 1996, Huyghe et al. 2011). The Saami show mtDNA patterns and increased LD indicative of constant population size (Lahermo et al. 1996, Laan & Pääbo 1997, Kaessmann et al. 2002, et al. 2005, Huyghe et al. 2010). An apparent discrepancy between the variation pat- terns of genes (Saami vs. other populations) and of languages (Finno-Ugric vs. Indo-European) in the Baltic Sea area has sometimes been attributed to language replacements (Sajantila & Pääbo 1995, Laitinen et al. 2002). Interestingly, the Saami gene pool harbors eastern – Uralic and non-Eu- ropean – contributions of up to 50% (Guglielmino et al. 1990, Meinilä et al. 2001, Tambets et al. 2004, Ingman & Gyllensten 2007, Johansson et al. 2008, Huyghe et al. 2011).

5.3.4. Genetic structure within Finland Within Finland, mtDNA distribution is relatively homogeneous (Vilkki et al. 1988, Hedman et al. 2007), although small regional differences have been detectable (Palo et al. 2009). The Y chromosome, however, shows striking differences between eastern and western Finland (Kittles et al. 1998, Raitio et al. 2001, Hedman et al. 2004, Lappalainen et al. 2006, Palo et al. 2007, 2008). These contrasting patterns have been suggested to result from a male-biased Scandinavian gene flow into the western areas (Palo et al. 2009). East-west differences also exist in classical blood group markers (see Norio 2003b p. 464 for a review), HLA (Siren et al. 1996), and nuclear markers (Barbujani & Sokal 1990). These differences are unsurprising in the view of Finnish population history (see section 5.1): the eastern and western parts of the country have been subject to different external influences at least since the Corded Ware culture and Bronze Age; they were separated by the medieval border between Sweden and Russia; and their demographic histo- ries, for example population densities, have differed up until present times. Similar east-west differences are evident in many non-genetic phenomena (see Norio 2003b Figure 2 for maps), e.g., eastern and western dialects of Finnish (Leskinen 1999), in mammalian fauna (Heikinheimo et al. 2007),

50 Review of the literature and even in the predominance of major vs. minor key in tunes (Toiviainen & Eerola 2005). Between northern and southern Finland genetic differences are even larger than between east and west; additionally, the northern areas exhibit a strong substructure (Jakkula et al. 2008). Eastern and northern areas also show reduced genetic diversity, both in the Y chromosome (Raitio et al. 2001, Hedman et al. 2004, Lappalainen et al. 2006) and autosomally in terms of increased LD and extended regions of homozygosity (Varilo et al. 2003, Uimari et al. 2005, Service et al. 2006, Jakkula et al. 2008). The mtDNA variation and LD patterns are compatible with a population expansion (Lahermo et al. 1996, Laan & Pääbo 1997, Kaessmann et al. 2002). Furthermore, the percentage of eastern elements (see section 5.3.2) is higher in eastern than in western Finland (Lappalainen et al. 2006), and eastern Finns are genetically close to (Lappalainen et al. 2008). The Swedish-speaking minority of Finland lives mainly in the coastal regions of Ostrobothnia and southern and southwestern Finland, descends predominantly from 12th to 14th century immigrants, and is genetically intermediate between Finns and Swedes. Estimates of the extent of admixture with the Finnish-speaking population range from 60% (Virtaranta-Knowles et al. 1991) to 75% (Nevanlinna 1984). The admixture has likely taken place at or since an early stage of the population’s history because it appears to be geographically homogeneous (Virtaranta-Knowles et al. 1991). On an extremely fine geographical scale, the blood-group frequencies within Finland have differed more between single than when aver- aged over communes or provinces (Nevanlinna 1972). Few subsequent datasets have allowed analyses with similar resolution. Recently, however, neighboring communes in northern Finland were shown to differ from each other conspicuously (Jakkula et al. 2008). Consequently, genetic dif- ferences in the area allow prediction of parental birthplaces at a median accuracy of 23 km (Hoggart et al. 2012). Additionally, a similar fine-scale population structure may explain part of the striking Y-chromosomal dif- ference between the small commune of on the western coast and many Finnish provinces (Palo et al. 2008). This phenomenon is not unique to Finland, however, but exists for example in , Scotland, and Croatia (O’Dushlaine et al. 2010b).

5.3.5. Phenotypic variation in Finland The genetic structure of the Finnish population also reflects in its phenotypes. Maybe the best demonstration is the Finnish disease heritage (FDH), a group of at least 36 monogenic disorders far more common in Finland than elsewhere in the world (Norio et al. 1973, Peltonen et al. 1999, Norio 2003a, 2003b, 2003c). Revealingly, many of them are concentrated in eastern and

Review of the literature 51 northern Finland, and some show local enrichment (Norio 2003a, 2003c), demonstrating founder effects and genetic drift. On the other hand, stemming from the same effects, several diseases are markedly rarer in the Finnish population than in other European or Caucasian populations. Examples include cystic fibrosis (Kere et al. 1989), phenylketonuria (Guldberg et al. 1995), medium-chain acyl-CoA dehydrogenase (MCAD) deficiency (Pastinen et al. 2001), and Friedreich ataxia (Juvonen et al. 2002). In addition to the FDH diseases, a number of other diseases and mutations show geographic variation within Finland. An east-west difference for the cardiovascular diseases (section 5.3.6) is perhaps the most conspicuous. Such patterns can shed light on gene flow or on other aspects of population history, as with schizophrenia (Hovatta et al. 1997), multiple sclerosis (Tienari et al. 2004), cystic fibrosis (Kere et al. 1994), and the mutations of CYP21 (Levo et al. 1999) and of BRCA1 and BRCA2 (Sarantaus et al. 2000). All in all, the history and genetic structure of the Finnish population have enabled the mapping of a multitude of disease genes, especially for monogenic disorders (Peltonen 2000, Kere 2001). Reduced diversity and increased LD, combined with the good performance of HapMap-based tagging SNPs (Willer et al. 2006, Service et al. 2007, Lundmark et al. 2008) make Finns an attractive target population also for complex disease gene studies (Peltonen et al. 2000, Varilo & Peltonen 2004, Kristiansson et al. 2008).

5.3.6. East-west difference in cardiovascular diseases in Finland Cardiovascular diseases (CVDs) are more common in Finland than in most other Western , particularly in males (Kuulasmaa et al. 2000, Sans et al. 1997). Furthermore, they show substantial variation within Finland, being more common in eastern Finland, especially in (Tuomilehto et al. 1992, Jousilahti et al. 1998, Karvonen et al. 2002, Havulinna et al. 2008). For example, age-standardized attack rates (per 100,000 / year) for acute myocardial infarction (AMI) in 1991 for men aged 35 to 64 years were 769 in western Finland (Turku and ), 776 in eastern Finland (Northern Savo), and 860 in North Karelia (Salomaa et al. 1996), while corresponding rates were 499 in Estonia (Tallinn) in 1991 (Laks et al. 1999), 367 in Sweden () in 1990-1994 (Wilhelmsen et al. 1997) and 355 in Denmark (Glostrup) in 1988-1991 (Kirchhoff et al. 1999). In 1988-1992, the corresponding coronary-event rate was 483 in western and 603 and 697 in eastern Finland (Northern Savo and North Karelia, respectively) (Kuulasmaa et al. 2000). Such an internal variation is not unique to Finland but exists for example in Sweden (Nerbrand et al. 1991, Hammar et al. 1992, Johansson et al. 1992) and in Britain (Morris et al. 2001, 2003). However, in the other populations the majority of these gradients may be explained by regional differences

52 Review of the literature in traditional risk factors (Hammar et al. 1992, 2001; Morris et al. 2001) like serum cholesterol levels (Rosengren et al. 1999) and blood pressure (Shaper et al. 1991), or by environmental variation such as drinking-water constituents (Nerbrand et al. 1992). In Finland, in contrast, only 30 to 40% of the east-west difference seems to be explained by traditional risk factors (Pekkanen et al. 1992, Jousilahti et al. 1998). CVDs have shown a declining trend in western countries (Sans et al. 1997, Tunstall-Pedoe et al. 1999, Kuulasmaa et al. 2000). In Finland, this decline has been markedly rapid (Salomaa et al. 2003, Pajunen et al. 2004): for instance, coronary mortality declined 80% between 1972 and 2007 (Vartiainen et al. 2010). This decline stems mostly from favorable devel- opment in the levels of classical risk factors (Vartiainen et al. 1994, 2010), partly initiated by the North Karelia project (Salonen et al. 1979, Puska et al. 1998), but the population has benefited also from new medical treatment (Salomaa et al. 1999, Tunstall-Pedoe et al. 2000). Interestingly, the main reasons for the decline may differ between eastern and western Finland (Salomaa et al. 1996). Nevertheless, the east-west difference has mostly persisted for AMI incidence (Karvonen et al. 2002, Havulinna et al. 2008), unlike that for stroke (Sarti et al. 1994). Persistence of the east-west difference has evoked the question of its origin, especially because current differences in risk factors are relatively small (Similä et al. 2005). Some of the variation may correlate with environmental factors such as ground-water geochemistry (Kaipio et al. 2004, Kousa et al. 2004, 2006) or cold climate (Gyllerup et al. 1993, Gyllerup 2000, Näyhä 2005, Analitis et al. 2008). Studies of Finnish immigrants in Sweden and their non-immigrant twins demonstrate effects of life style or environment or both on CVD risk and subclinical atherosclerosis (Jartti et al. 2002a, 2009; Hedlund et al. 2007). On the other hand, some such factors seem to be more dependent on place of birth than on place of residence (Jartti et al. 2002b). For men in the Helsinki area, eastern birthplace contributes to CVD risk independently of classical risk factors (Tyynelä et al. 2009, 2010). Studies of children and young adults suggest a role for genetic factors or for environmental factors early in life or prenatally (Pesonen et al. 1986, 1990; Juonala et al. 2005). Furthermore, the susceptibility to the classical risk factors may vary geographically (Jartti 2002b), possibly for genetic reasons. Unsurprisingly, some genetic risk factors differ in frequency between eastern and western Finland (Aalto-Setälä et al. 1991, Juonala et al. 2006).

5.3.7. Age estimates of Finnish genes Relatively few genetic timings exist for the Finnish population. While the diversity of the mtDNA as a whole is not reduced, the slowly evolving posi- tions in the control region show a decrease in diversity, and served to date

Review of the literature 53 a mitochondrial bottleneck to approximately 3,900 years ago (confidence interval (CI) not reported; Sajantila et al. 1996). For the Y chromosome, early analysis determined an age of about 4,900 or 2,300 years for one haplogroup and about 1,600 or 800 years for another (the latter estimates for each haplogroup with CIs of 1,000 to 8,100 and 300 to 2,700 years, respectively; Kittles et al. 1998). In the context of the whole Baltic area, two mtDNA haplogroups have coalescence ages of 37,000 and 68,000 years, and the most common Y haplogroups 7,700 to 10,700 years (CIs are in the range of 10,000 years or more for the mitochondrial and about 3,000 years for the Y-chromosomal timings; Lappalainen et al. 2008). With a different estimate of the relevant mutation rate, however, the latter ages would be lower, in the range of 3,000 to 4,000 years (Lappalainen 2009). Additionally, age-estimates exist for several disease mutations. These are often based on the geographic distribution of the disease and the width of LD detected around the gene, and the compatibility of these with historically known settlement events (Varilo 1999); more direct estimates also exist. Mutation ages are typically 40 generations or less (Varilo et al. 1996a, 1996b; Moisio et al. 1996, Sarantaus et al. 2000, Mykkänen et al. 2004), meaning that they will shed light mainly on population events during centuries for which historical information already exists. Furthermore, the patterns of disease alleles – both rare and deleterious – may be unrepresentative of events that have affected the population’s gene pool as a whole. Studies of ancient DNA that could directly illuminate the genetic history of the Finnish population and vitally assist in the timing of relevant events have not been published to date. They will probably remain uncommon, because suitable materials are lacking, due to the poor preservation of organic samples in the acidic Finnish soil. Meanwhile, elaborate population simulations have estimated the likely effects of various population scenar- ios – especially the presence and strength of population bottlenecks – on the current genetic composition of the Finnish population. One of the insights gained is that relatively minor, continuous migration – i.e., the type hardly visible in the archaeological record – can easily generate much stronger effects than single bottlenecks can (Sundell et al 2010).

54 Review of the literature 6. Sweden

6.1. population prehistory and history

After the , the first inhabitants of the area of present-day Sweden were hunter-gatherers who followed their prey species, especially , towards the north as the glacial ice retreated. The settlers came from two directions: southern Sweden, then still a part of Europe, received its first occupants from south by 14,000 years ago, while the north was pop- ulated by immigrants arriving north via the Norwegian coast. Interestingly, a similar bidirectional pattern of postglacial colonization exists in some animal species such as the (Hewitt 2000) and the common frog (Knopp & Merilä 2009), with hybridization zones across northern Sweden. By 4000 BC, most of Sweden was inhabited; population size has been estimated as in the vicinity of 25,000 (Siiriäinen 2003a, 2006). Hunting and gathering remained the only means of subsistence for mil- lennia, until agriculture spread into the southern and central parts of Sweden from the south along with the Funnel Beaker culture around 4000 BC (Figure 9A). Even in the southern areas, the adoption of agriculture was slow, and it became the main source of livelihood only after several centuries. Close to the Funnel Beaker culture, in central and southern Sweden, the relied exclusively on hunting – especially seal hunting in the coastal areas – and gathering. It may have had contacts with the Comb Ceramic . Meanwhile, in northern Sweden, influences from Finland and blended with the earlier culture. Some researchers see this as the beginning of development towards the present-day Saami (Siiriäinen 2003a, Lindqvist 2006). In 2300 BC, the Corded Ware culture extended over southern Sweden (Figure 9B), and despite the worsening climate at the end of the Stone Age, agriculture spread northwards. The transition from the Stone Age to the Bronzee Ag in 1800 BC was characterized by an overall continuity: the influx of copper and bronze from central and western Europe had started earlier and continued during the Bronze Age. Imports were the only source of metal; it was traded for amber, fur, and seal skins. These were acquired through northward and eastward contact that reached for instance coastal Finland by 1500 BC (Figure 10A). In the same era, pottery from Finland extended to the aceramic northern Sweden. Notably, this region of influence roughly corresponds to the early historical Saami area (Siiriäinen 2003a).

Review of the literature 55

Figure 10. Geographic distribution of the Scandinavian Bronze culture in ca. 1500-500 BC according to Huurre (2005) (A), and area of the Swedish kingdom in 1659 according to Melin et al. (2003) and Stadin (1999) (B).

In the beginning of the Iron Age, the climate grew colder and damper. Earlier, the small number of findings from the early pre-Roman Iron Age (500 BC - 0 AD) in Sweden was interpreted as settlement discontinuity, but this view is no longer valid. More likely, the climate-related difficulties in agriculture led to some degree of population decline and emigration, but the scarcity of findings may in part stem from Celtic population movements that interrupted the former trade routes from Sweden to southern Europe. When these connections were restored in the Roman Iron Age (0 - 400 AD), many coins and luxury goods appear in the archaeological record. During the (400-550 AD), the main population movements did not reach Sweden, but the era seems to have entailed some kind of unrest, judging from the number of fortifications built and treasures buried (Baudou 1999, Melin et al. 2003, Lindqvist 2006). The increasing fur trade, expanding agriculture, and accumulating wealth led to profound changes in society in southern , as chieftains and petty kingdoms emerged. Towards the end of the Iron Age, in the Vendel Era (550-800 AD), one such center formed in (the area north of current Stockholm). It featured flourishing trade, powerful towns (including ), and influence and contacts that extended north- and southward along the Swedish coast and to as well as across the Baltic Sea to the coasts of Finland and Estonia (Melin et al. 2003, Lindqvist 2006). Starting in the late 8th century, exerted Scandinavian influence over large parts of Europe for 300 years. The Vikings in western Europe originated mainly from Norway and Denmark, whereas the Vikings active in eastern Europe came from eastern Sweden – the same area that had already

56 Review of the literature grown in power in the Vendel Era. Viking activities in western Europe con- sisted of varying periods of trade and raid, but contacts in the east seem to have been predominantly peaceful, involving cultural blending with Slavs, , and Finns. In the end of the 9th century, Viking influence extended all the way to the of Kiev (in present-day Ukraine), but by the end of the 10th century, the city had been slavified and the Vikings’ descendants assimilated. Notably, although the Viking culture influenced large areas of Europe, it dominated neither in Finland nor in the Saami areas of northern and central Scandinavia. In addition to bringing Scandinavian influence to Europe, Vikings transferred European cultural elements to Scandinavia (Sawyer 2003, Roesdahl & Sørensen 2003, Melin et al. 2003). Christianity had slowly diffused through trade and other contacts for over a century by the time a king of Sweden first converted to Christianity around the year 1000 AD. The transition continued slowly, with Christian and pagan traditions living side by side throughout the 11th century. Nevertheless, Christianity played an important role in the unification of the country. Thus, during the late and early , the often unstable petty kingdoms with their bitter rivalries were gradually replaced by the Scandinavian kingdoms. Sweden was the last one to emerge: its core regions and Götaland were separated from each other by an impenetrable forest and only become united under a common rule in the 12th century (Sawyer & Sawyer 2003, Lindkvist 2003a, 2003b). In subsequent centuries, the processes of expansion and unification continued. Finland became a part of Sweden in the 12th century, and the first border between Sweden and Russia was drawn in 1323 across Finland from northwest to southeast. In 1319-1343, Sweden belonged to a union with Norway, and in 1397-1521 to the Union with Norway and Denmark. After a series of wars, the Swedish kingdom was at its largest in 1659 (Figure 10B). The areas outside the current borders of Sweden were lost by 1721, apart from Finland, which was transferred to Russian rule in 1809. In 1814, Sweden and Norway joined a union which dissolved in 1905 (Stadin 1999, Melin et al. 2003, Lindkvist 2003a, Lindqvist 2006). The population size within the current borders of Sweden was in the range of 500,000 - 650,000 in the early 14th century, after having roughly tripled since 750 AD. The 14th century saw slowing population growth and even population declines in southern Sweden, as in large parts of Scandinavia. Meanwhile, spread of settlement and population growth continued in north- ern Sweden. The Swedish colonization of began in the 14th century, and as the Finnish settlement of the Tornio valley was also in progress, an ethnic border formed close to the present-day border between Finland and Sweden. Additionally, the population in Swedish towns was markedly influ- enced during the second half of the 13th and first half of the 14th century

Review of the literature 57 by immigrant German merchants who later assimilated into the population (Benedictow 2003, Vahtola 2003b, Dahlbäck 2003, Lindqvist 2006). In the 16th and 17th centuries, population growth was strong. In 1570, the population size was about 600,000, and almost 1.5 million in 1721. In the first half of the 18th century, growth was limited by wars, epidemics, and low crop yields, leading to a population of 1.8 million in 1750, 2.0 million in 1767, and 2.3 million in 1800. Between 1815 and 1860, the population grew by 60%, from 2.5 million to 4 million. The population surplus was discharged by mass migration: in 1821-1930, 1.2 million Swedes emigrated, 1.0 million of them permanently. For example in 1910, 20% of all Swedes lived in the . Since the 1930s, immigration has exceeded emi- gration, together with internal population growth leading to a population of 9.5 million (May 2012) (Samuelsson 1999, Häger 1999, Melin et al. 2003, Vahtola 2003b, Norman 2003, Lindqvist 2006, Statistics Sweden 2012).

6.2. Language

Swedish belongs to the world’s largest language family, Indo-European, more specifically to its North Germanic branch. Other North Germanic lan- guages are Danish, Norwegian, Icelandic, and Faroese; Germanic languages include for instance German and English ( 2005). Swedish is spoken by more than 95% of the Swedish population, i.e., about 8.5 million people. The largest linguistic minority in Sweden consists of some 250,000 Finnish-speakers. Apart from the population next to the Finnish border in the north, they are mainly recent immigrants. Additionally, Saami speakers in the north number 10,000 to 15,000 (Vikor 2002). An Indo-European language form may have first arrived to the area of present-day Sweden by 2000 BC or possibly earlier, but such estimations remain tentative (Barnes 2003). The subsequent gradual divergence of the can be studied from runic inscriptions, the oldest of which date to the 4th century AD ( 2003). They document a language form consisting of mutually intelligible dialects slowly diverging from the 7th century AD onwards (Bergman 2003), with the final split between Swedish and Danish taking place during the (Pettersson 2005). Meanwhile, the language expanded geographically; from its 7th century area covering Denmark and the southern and central parts of Sweden and Norway, it first spread towards the Saami areas in the north, then to the British Isles, further to Iceland, Faroes, and Greenland in the 9th and 10th centuries, and to the coastal areas of Finland and Estonia during the 11th and 12th century (it later died out in Greenland and the British Isles) (Vikor 2002). About one-third of the Swedish vocabulary lacks cognates (words of common etymological origin) in non-Germanic Indo-European languages,

58 Review of the literature perhaps as a result of contact with an unknown language previously spoken in the areas where the proto-Germanic language then extended (Barnes 2003). During historical times, the has acquired loanwords from several directions: from German in the 14th to 16th centuries through intensive trade interactions, from French in the 17th and 18th centuries mainly through royal and elite contacts, and recently from English. Other loan sources include Greek and especially Latin, acquired both directly and mediated through other languages (Pettersson 2005). Expectedly, place names of Finnish and Saami origin exist in northern Sweden, and Finnish names in the mid-Swedish areas that received Finnish immigrants during the 16th and 17th centuries (Pettersson 2005). In southern Scandinavia, almost all place names are Germanic, with the majority of the exceptions seemingly of Saami origin (Barnes 2003). The Swedish language can be divided into five major groups. The northernmost is spoken in Norrland, except for areas close to the Finnish border where Finnish dominates. This dialect carries signs of contacts with Norwegian in the southwest areas. The region of another dialect group roughly corresponds to Svealand. Götaland features two dialect groups, spoken in its northern and southern parts; the latter demonstrates an influ- ence from Danish. The eastern dialect group includes the dialects spoken in Finland and earlier (until World War Two) also in Estonia. These resemble dialects from the western coast of , and, along with the dialect spoken on Gotland, have retained many archaic features (Bergman 2003).

6.3. Genetic structure

6.3.1. Swedes compared to other European populations Serologic studies of the Swedish population are numerous (see Einarsdottir et al. 2007 and the references therein), but molecular population studies of the Swedes have been less extensive than for the Finns. A classical study based on a small number of autosomal markers found the Swedes to be genetically close to Norwegians and relatively similar to other nearby pop- ulations such as the Danes, Germans, and Dutch, with the exception of the geographically neighboring Finns and especially Saami, and the linguistically related Icelanders (Cavalli-Sforza et al. 1994). The Swedes resemble general European populations in their overall level of mtDNA diversity (Lappalainen et al. 2008), their mitochondrial haplogroup frequencies (Tillmar et al. 2010), and their X-chromosomal STR haplotype frequencies (Tillmar 2012). The most common mitochondrial haplogroup in Sweden is H1, with a frequency of 13% (Lappalainen et al. 2008), which originates from the glacial refugium in Iberia (Torroni et al. 2006 and references therein).

Review of the literature 59 The Swedes resemble most other northern Europeans also in relation to their Y chromosomes. Their diversity of Y-chromosomal haplogroups (Lappalainen et al. 2008) and STR haplotypes (Holmlund et al. 2003) corresponds to European levels. Their haplogroup distribution is similar to Europeans ( et al. 2006) and even to western – but not east- ern – Finns (Lappalainen et al. 2008), whereas their STR haplotypes are closely related to Danes’, Norwegians’, and Germans’ but differ significantly from Finns’ (Holmlund et al. 2006). The most common Y-chromosomal haplogroup in Sweden, at a 36 to 38% frequency, is I1a (Karlsson et al. 2006, Lappalainen et al. 2008, 2009). It has its origin in the Iberian refugium (Rootsi et al. 2004), and may have spread to Sweden by Neolithic migrations from (Lappalainen et al. 2008). Several other Y-chromosomal hap- logroups also seem to have arrived in Sweden from the south (Lappalainen et al. 2008). Swedes have also featured in many of the recent studies of European population structure that have used genome-wide SNP data. There, they match the general pattern of an overall correspondence of genetic and geo- graphic distances, with for example Norwegians, Danes, Germans, and Poles appearing as their closest genetic neighbors, whereas the Finns seem more distant from the Swedes genetically than geographically (Lao et al. 2008, Novembre et al. 2008, Heath et al. 2008, McEvoy et al. 2009, Nelis et al. 2009, Tian et al. 2009, O’Dushlaine et al. 2010a, Leu et al. 2010). Another study on a genome-wide dataset of Europeans underlines the Swedes’ close genetic relationship with neighboring populations by showing that the closest genetic match for any Swede was actually more likely a Norwegian, Dane, or even a German than another Swede (Lu et al. 2009).

6.3.2. Genetic structure within Sweden Within Sweden, variation in mitochondrial and Y-chromosomal haplogroup frequencies is mostly clinal, lacks strong borders, and shows signs of genetic drift in various regions throughout the country, especially in the Y chro- mosome (Lappalainen et al. 2009). Haplogroup frequencies demonstrate influence from the surrounding populations: from Denmark or in the southern part of the country, and from Finland in northern Sweden, eastern Svealand, and (Lappalainen et al. 2009). Influence from Norway is visible in the western provinces of , Dalarna, and Värmland (Lappalainen et al. 2009). However, the main genetic features of Värmland may stem from migrational patterns unrelated to Norwegians (Karlsson et al. 2006). In the contemporary population, both mitochondrial and Y-chromosomal haplogroups reveal signs of recent immigration in the cities of Malmö and Gothenburg (Lappalainen et al. 2009). Additionally, the frequency of one Y-chromosomal haplogroup differs between the eastern and

60 Review of the literature western parts of southern Sweden, possibly since prehistoric times (Karlsson et al. 2006). The estimated ages for the four most common Y-chromosomal haplogroups range from 3,300 to 9,100 years, with wide confidence intervals (Karlsson et al. 2006). The distribution of mtDNA variation is compatible with marked population expansion (Kaessmann et al. 2002, Tillmar et al. 2010). As expected, allele frequencies within Sweden vary also in some autosomal markers (Hannelius et al. 2005). Analogously, many monogenic diseases show enrichment in various parts of Sweden, especially in the northern areas (cf. below; Gustavson & Holmgren 1991). Furthermore, autosomal markers show that the of Gotland differs from the Swedish mainland: one marker demonstrates an influence from the Finno-Ugric populations on the eastern side of the Baltic sea (Beckman et al. 1998), and another one an increased affinity to Baltic populations, possibly mediated by the Finno- Ugric contacts (Sistonen et al. 1999). In contrast, however, Gotland does not markedly differ from mainland Sweden with respect to Y-chromosomal or mitochondrial variation (Zerjal et al. 2001, Karlsson et al. 2006, Tillmar et al. 2010). The population structure in northern Sweden is of special interest, owing to its historically low population density and the long-standing presence of three ethnic groups, Swedes, Finns, and Saami (see Einarsdottir et al. 2007 for a review of the region’s population history). Finnish genetic influence manifests in the elevated frequency of Y-chromosomal haplogroup N3 (Karlsson et al. 2006, Lappalainen et al. 2008, 2009). Autosomal markers suggest that in the areas along the Finnish border, a majority of the gene pool may be of Finnish origin (Nylander & Beckman 1991). Whereas the Saami dif- fer markedly from the Swedes (Zerjal et al. 2001), admixture with the Saami is visible in both mitochondrial (Lappalainen et al. 2009) and Y-chromosomal (Karlsson et al. 2006) haplogroup frequencies of northern Swedes, and autosomal markers have yielded estimates of local admixture proportions up to one-third (Nylander & Beckman 1991). Reciprocally, signs of Swedish admixture are detectable in the Saami population (Karlsson et al. 2006), especially the southern Saami (Ingman & Gyllensten 2007, Johansson et al. 2008). Both marital patterns and autosomal allele frequencies in the area in part follow river valley geography (Einarsdottir et al. 2007). Furthermore, Y-chromosomal haplogroup frequencies in Västerbotten may reflect the high degree of central European immigration in the 17th century (Karlsson et al. 2006). In addition to genetic drift, founder effects and admixture, the genetic structure in the north may have been shaped by the relatively high frequency of consanguineous marriages (Bittles & Egerbladh 2005). These factors have also had an impact on phenotypic variation in the area, visible in the increased prevalence of several monogenic diseases (Gustavson & Holmgren 1991) as well as in more complex phenotypes. Consequently, the

Review of the literature 61 northern Swedish population has usefully served in the mapping of disease genes (Burstedt et al. 1999, Venken et al. 2005). A study of ancient DNA has shown a discontinuity in mitochondrial lineages between the contemporary Swedish population and the Pitted Ware culture hunter-gatherers living in Gotland from 2800 to 2000 BC, while the latter are more closely related to contemporary Baltic popula- tions (Malmström 2009). The hunter-gatherers have also demonstrated a substantially lower frequency of the allele than does the extant Swedish population (Malmström 2010). These patterns may reflect population replacement related to the introduction of agriculture. Contrastingly, however, the high frequency of paleolithic relative to neolithic Y-chromosomal haplogroups in the contemporary Swedish population may signal continuity of the (male) population through the adoption of agriculture (Karlsson et al. 2006). A recent aDNA sequencing study showed the Funnel Beaker farmers and Pitted Ware hunter-gatherers to differ markedly both from each other and from present-day Swedes. Interestingly, the proportion of ancient farmer ancestry varies within the contemporary Swedish popu- lation, being lowest in the north (Skoglund et al. 2012). In genome-wide data, the Swedish population substructure has appeared subtle. A comparison of control datasets with differing geographic distri- butions within Sweden detected no differences in terms of FST’s (Leu et al. 2010). Recently, a direct study of the population substructure (Humphreys et al. 2011) detected genetic distinctness of the northernmost areas similar to that observed earlier in other types of markers (see above) and genome- wide (Study IV). Additionally, it revealed unique genetic features in Dalarna, possibly resulting from Finnish or Norwegian ancestry (cf. results from other marker types above). Furthermore, genome-wide SNP data have served in scanning for signals of recent positive selection (Lappalainen et al. 2010) and for copy number variants and regions of homozygosity (Teo et al. 2011) in the Swedish population.

62 Review of the literature Aims of the study

The overall aim of this study was to analyze the genetic structure of the Finnish and Swedish populations. Previous studies into their structure have concentrated mainly on the Y chromosome and mitochondrial DNA, which may, however, fail to yield a comprehensive picture of the genetic variation, especially the autosomal variation most relevant to diseases and other phenotypes. This study therefore utilized recent methodological advances in SNP genotyping to investigate the structure of these populations by use of large numbers of autosomal markers.

The specific objectives of this study were the following:

1. to study the genetic structure and variation in northern Europe predominantly from a population historical perspective using genome-wide autosomal marker sets (III, IV)

2. to study the autosomal substructure of the Finnish population, with an emphasis on possible differences between eastern and western Finland (II, III)

3. to study the autosomal substructure of the contemporary Swedish population (II, IV)

4. to propose and theoretically test a novel gene-mapping approach, subpopulation difference scanning (SDS), which utilizes the popu- lation substructure for localizing genes for diseases whose incidence differs between closely related subpopulations (I)

Aims of the study 63 Materials and methods

1. Samples

For a summary of the Finnish and Swedish datasets genotyped for this study, see Table 1, and for the locations of the provinces sampled see Figure 11.

T able 1. Swedish and Finnish genotype datasets in this study: number of genotyped markers and numbers of individuals per country, region, and province.

Dataset Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Used in publication(s) I, II II III, IV IV IV Number of SNPs a 30 b 34 201 011 29 339 353 565 Number of individuals (area and abbreviation) c Sweden SWE 0 1599 113 755 1525 Norrland NORR 0 256 0 115 237 Norrbotten NBO 0 98 0 na 73 Västerbotten VBO 0 64 0 na 37 Västerbotten + Norrbotten VNBO 0 na 0 53 na Västernorrland VNO 0 30 0 31 75 Jämtland JAM 0 29 0 0 18 Gävleborg GAV 0 35 0 31 34 Svealand SVEA 0 692 0 245 545 Stockholm STO 0 407 0 104 214 Uppsala UPP 0 56 0 26 39 Södermanland SMA 0 50 0 32 43 Värmland VAR 0 46 0 na 95 Dalarna DAL 0 40 0 na 24 Värmland + Dalarna VARD 0 na 0 21 na Örebro ORE 0 55 0 42 77 Västmanland VMA 0 38 0 20 26 Götaland GOTA 0 651 0 395 743 Östergötland OGO 0 64 0 42 82 Jönköping JON 0 57 0 35 50 Kalmar KAL 0 31 0 na na Gotland GOT 0 6 0 na na Kalmar + Gotland KALG 0 na 0 17 55 Kronoberg KRO 0 19 0 30 48 BLE 0 21 0 34 50 Skåne SKA 0 193 0 68 114 Malmö MAL 0 na 0 29 29 Halland HAL 0 60 0 38 61 Västra Götaland VGO 0 200 0 74 172 Göteborg (Gothenburg) GOB 0 na 0 28 82 Swedes w/o further information SWEmix 0 0 113 0 0

64 Materials and methods Table 1 continued

Dataset Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Finland FIN 440 572 280 0 0 Western Finland FIW 191 322 141 0 0 Southwest Finland SWF 33 85 71 0 0 Satakunta SAT 55 67 16 0 0 Swedish-speaking Ostrobothnia SSOB 18 21 20 0 0 Southern Ostrobothnia SOB 43 54 13 0 0 Häme HAM 41 88 17 0 0 Mixed western Finnish ancestry FIWmix 1 7 4 0 0 Eastern Finland FIE 249 250 139 0 0 Northern Ostrobothnia NOB 107 78 52 0 0 Kainuu KAI 0 0 18 0 0 Northern Savo SAV 84 88 40 0 0 North Karelia d NKAR 17 30 14 0 0 d SKAR 36 48 0 0 0 Mixed eastern Finnish ancestry FIEmix 5 6 15 0 0 Individuals in total 440 2171 393 755 1525 na = not applicable. a Numbers of markers may differ between analyses. b STRs, not SNPs. c Individuals from Sweden categorized into provinces according to place of birth (Dataset 2) or place of residence (Datasets 4 and 5), and individuals from Finland according to birthplace of their grandparents. d Includes areas currently belonging to Russia. Numbers reported in original publications may differ by a few individuals due to different QC thresholds.

1.1. finns (I, II, III, IV)

The Finnish individuals in datasets 1, 2, and 3 (440, 572, and 280, respectively) were male blood-donors recruited by the Finnish Red Cross in 1998-1999. They were 40 to 55 years old at the time and selection was based on their grandparental birthplaces that were located in the same area of Finland, mostly within a single province. They gave their informed consent to par- ticipate in a population genetic study, and blood collection was approved by the ethics committee of the Finnish Red Cross. Only the birthplaces of the subjects’ grandparents were recorded, while the subjects themselves remained anonymous. Originally, the collection comprised 971 individuals. Individuals in Datasets 1 and 2 represented the provinces targeted earlier in a Y-chromosomal study (Lappalainen et al. 2006). Dataset 3 comprised individuals with grandparental ancestry in areas of high (n = 139) and low (n = 141) acute myocardial infarction (AMI) incidence (Havulinna et al. 2008). Geographic coordinates for the Finnish subjects were determined as an average of their grandparental birthplace coordinates obtained from Google Maps. The individuals were classified by provinces based on their grandparents’ birthplaces mainly following traditional Finnish province divisions, with small modifications, for instance, to maintain sufficient sample numbers per subpopulation. These provinces were further grouped

Materials and methods 65 into western or eastern ones based on previous knowledge (Lappalainen et al. 2006). As province coordinates, the average coordinates of the individuals representing the province served for the Mantel test, and the geographical center of the province (from Google Earth version 4.2) for other analyses.

1.2. swedes (II, III, IV)

The Swedish individuals of Dataset 2 (n = 1599) comprised all children born in Sweden during one week in December 2003, complemented with 89 additional newborns from northern Sweden to ensure sufficient sample size in the more sparsely populated areas (Hannelius et al. 2007). The dataset thus includes children of immigrants as well as of native Swedes. Geographic information available was their birth hospital, and this was used to divide the population by provinces and regions. Geographic coordinates for the birth hospitals came from Google Earth version 4.2, and province coordinates were calculated as averages of all births in the province. The individuals were kept anonymous, and the study was approved by the ethics committee of the Karolinska Institute. The Swedish individuals in Dataset 3 were 113 ethnically Swedish males. They originated mainly from the eastern part of the country, but with no detailed ancestry information. They were recruited with informed con- sent, and the recruitment was approved by the ethics committee of Umeå University. Dataset 4 (n = 755) and Dataset 5 (n = 1525) comprise Swedish-born female subjects aged 50 to 74 recruited in 1993-1995 for a breast cancer study (Einarsdóttir et al. 2006). They were recruited with informed consent, and the procedure was approved by the ethics committee of the Karolinska Institute. All 755 individuals in Dataset 4 were controls, while an additional 770 cases were included in Dataset 5 (after removal of the most widely dif- fering SNPs; see section 2) to provide enhanced geographic resolution for some analyses. The population was divided by provinces according to place of residence at the time of recruitment. Some of the smallest provinces were combined, and the cities Malmö and Gothenburg were separated out from their surrounding provinces. The provinces were further grouped into three regions – Norrland, Svealand and Götaland – according to a traditional division. Coordinates for the Swedish individuals, based on their place of residence, were obtained from GeoNames (www.geonames.org).

1.3. reference populations (III, IV)

The German reference group originated from Kiel area in Schleswig-Holstein in northern Germany. They had been recruited for population genetic studies as described elsewhere (Krawczak et al. 2006). All had given their informed

66 Materials and methods NORR

SVEA

GOTA

FIW

FIE

Figure 11. Geographic location of Finnish and Swedish provinces studied. Colors denote grouping of provinces into regions. Abbreviations as in Table 1. consent, and the procedure had been approved by the ethics committee of the Kiel Medical Faculty. A total of 256 male individuals took part in Study III, and 492, male and female, in IV. All data for analysis remained anonymous. The British reference group were male controls of the 1958 birth cohort. Their genotype data were kindly provided by the Wellcome Trust Case Control Consortium; a full list of the investigators contributing to generation of the data is available from www.wtccc.org.uk. Details of the sample collection have been described elsewhere (Wellcome Trust Case Control Consortium 2007). Geographic information was available on the subjects’ region of birth. A total of 296 individuals with birthplaces distributed throughout the country were used in Study III, and 740 individuals in IV.

Materials and methods 67 Additionally, we used two publicly available control genotype datasets: HapMap (International HapMap 3 Consortium et al. 2010) and the Human Genome Diversity Project (HGDP) (Li et al. 2008). The HapMap contains relatively high numbers of individuals from a handful of populations around the world, whereas the HGDP features a wider geographical coverage but a more limited number of individuals for most of its populations. From these datasets, especially those individuals of northern (central) European origin or descent were included in several analyses: CEU individuals (Utah residents with ancestry from northern and western Europe) from HapMap (58 in Study III, 101 in IV) and Vologda Russians from HGDP (25 in Study IV). In this thesis, the German, British, CEU, and Russian population will be jointly referred to as central Europeans. Additionally, individuals from Yoruba in Ibadan, Nigeria (YRI), Japanese in Tokyo, Japan (JPT), and Han Chinese in Beijing, China (CHB) from the HapMap collection were included in selected analyses for global comparison.

1.4. Simulated population data (I)

We simulated a population history in which an initial population splits into three populations. The daughter populations grow exponentially from their initial size of 100, 400, or 800 for either 20 or 40 generations to reach a final size of 100,000. The range of initial sizes encompasses estimates of breeding unit size in a rural Finnish population in the 19th century (Nevanlinna 1972), and the growth times reflect those of the eastern and western Finnish sub- populations. Meanwhile, the end size was chosen based mainly on technical convenience, as most of the genetic drift will occur in the smallest generations at the start of the simulation. After the initial split, no subsequent migration occurred between daughter populations. The simulation had non-over- lapping generations and random mating within the subpopulations, apart from sibling matings, which were excluded. At the end of each simulation, 50, 100, 200, 500, or 1000 non-sibling individuals were randomly sampled for the analyses from two of the daughter populations. Further details of the simulation can be found in Appendix A of Study I.

68 Materials and methods 2. mARkers, genotyping, and quality control

The markers of Dataset 1 were dinucleotide repeat microsatellites (i.e., STRs) from Linkage Mapping Set MD-10 (Applied Biosystems, Foster City, CA, USA). These microsatellites were known to have rare alleles in the Finnish population. They were genotyped with the MegaBACE 1000 capillary electrophoresis instrument (Molecular Dynamics/Amersham Biosciences, Sunnyvale, CA, USA), and the genotypes were called with Genetic Profiler software version 1.1. The genotypes were checked manually twice by a person blind to their province. Quality control (QC) exclusion criteria were success rates below 80% for both individuals and SNPs. The SNPs of Dataset 2 were originally selected for zygosity testing as described in Hannelius et al. 2007. The source material for the samples in datasets 1 and 3 to 5 and the Finnish samples of Dataset 2 was genomic DNA extracted by standard procedures, but the DNA of the Swedish samples of Dataset 2 had been extracted from Guthrie cards (Hannelius et al. 2005) and whole-genome amplified (see Study II for details) before genotyping. Dataset 2 was genotyped on the Sequenom MALDI-TOF platform by use of the iPLEX Gold chemistry (Sequenom, San Diego, CA, USA). Allele calling was done in the Spectro Typer 3.4 software and checked independently by two persons. SNPs and individuals with more than 20% missing data were excluded. Dataset 3 was genotyped with a commercial genotyping array, Affymetrix 250K StyI (Affymetrix, Santa Clara, CA, USA), which contains 238,304 SNPs. The genotype calling was done in the Affymetrix GeneChip Genotyping Analysis Software (GTYPE) version 4.1 with the BRLMM algorithm. QC exclusion thresholds were a success rate of 97% for samples and 95% for SNPs, a HWE p of 0.001, and MAF of 0.005. Additionally, smaller marker sets for computationally intensive analyses were constructed by LD-based SNP pruning. The QC procedures are described in more detail in Study III. Datasets 4 and 5 were genotyped on the Illumina HumanHap550 array (Illumina, San Diego, CA, USA). In QC, individuals with a success rate below 99% and SNPs with a success rate below 95%, HWE p < 1 x 10-7 or MAF < 0.05 were excluded. Dataset 4 was used in comparisons that involved Finns from Dataset 3 and individuals from reference datasets. These datasets had been genotyped on different platforms (the Finns on Affymetrix 250K, the HGDP samples on Illumina HumanHap650K, and the HapMap samples on Affymetrix 6.0 and Illumina 1M). The analyses employed only those SNPs available from all platforms, i.e., no genotype imputation was done in order to avoid possible artefacts from the imputation process. The details of the

Materials and methods 69 mergingd an QC filtering of the datasets can be found in Study IV. Dataset 5 was used for detailed analyses of the structure within Sweden. Because it contained both cases and controls, we removed the SNPs whose allele frequencies differed the most among them (p < 0.1 in a chi-square test); the remaining differences were small (Figure S10 in Study IV). In addition to analyses of all the genome-wide SNPs passing QC, some analyses used smaller SNP sets (mostly because of computational limitations) filtered on the basis of LD.

70 Materials and methods 3. Analyses

3.1. Model-free clustering of individuals: multidimensional scaling

Identity-by-state (IBS) similarities between pairs of individuals were calcu- lated in R 2.7.2 (R Development Core Team 2008, http://www.R-project.org) with package GenABEL 1.4–2 (Aulchenko et al. 2007), and visualized in R by classical multidimensional scaling (MDS) of the IBS distances (1-IBS). The variance explained by each of the resulting MDS dimensions was calculated by dividing its eigenvalue by the sum of all eigenvalues (Cox & Cox 2001).

3.2. Model-based clustering of individuals: Structure, Admixture, and Geneland

Population structure in the genome-wide SNP data was also assessed by clustering the individuals with Structure software version 2.2 (Falush et al. 2003) using the admixture model with 10,000 burn-ins and iterations and doing three separate runs for each number of clusters (K). Estimation of the most likely K was based on the run probabilities, on visual inspection of the resulting clusters, and on a specific statistic (Evanno et al. 2005). For Datasets 1 and 2, both admixture and non-admixture models and both correlated and non-correlated allele frequencies were used, with five runs for each K ranging from 1 to 5. For comparison, the data were also clustered with the Admixture software version 1.12 (Alexander et al. 2009; http:// www.genetics.ucla.edu/software/ admixture/). Assessment of the most probable K was based on the default five-fold cross-validation procedure and a minimum of ten runs per each K. Additionally, individuals in the smaller SNP and STR datasets (Datasets 1 and 2) were clustered with the software Geneland (Guillot et al. 2005, 2008; http://www2.imm.dtu.dk/~gigu/ Geneland/), in which the analysis can also use geographical information of the samples. An independent Gamma prior with parameters (1,20) was placed on the drift coefficients, and Geneland was run 50 times for each dataset using 100,000 iterations and a burn-in of 60,000 iterations. Only the ten runs with the best mean posterior density were considered in the analysis.

3.3. iBS distributions within and between populations

The amount of genetic variation within populations was inspected from their IBS distributions. The IBS distribution within a given population consists of the IBS similarities between those individual pairs in which both individuals

Materials and methods 71 belong to the population in question. Analogously, the IBS distribution between two populations consists of the IBS similarities for those pairs where one individual belongs to one of the populations and one to the other. The difference between the distributions was statistically tested by calculating the Mann-Whitney U test statistic for each pair of distributions, and com- paring this observed value to an empirical distribution of expected values of the test statistic obtained by permuting the population labels 10,000 times. The significance levels were Bonferroni-corrected for the number of tests.

3.4. SNP-based inspection of eastern admixture

We examined the possible eastern contribution to the Finnish autosomal gene pool from a SNP-wise perspective. If our study populations were all to stem from the same proto-European population (approximated by the CEU frequencies) and would have diverged from it in the absence of (east- ern) admixture, their allele frequencies might have been subject to varying amounts of genetic drift, but the number of allele frequencies drifting toward a given direction (for example towards the Asian frequencies) should be similar across populations. We therefore counted, for each study population (eastern and western Finland, Sweden, Germany, and the UK), the number of markers in which allele frequency deviated from the CEU frequency in the same direction as did the CHB+JPT frequency, and the number of markers in which the allele frequency deviated in the opposite direction. These numbers were then compared to the null hypothesis that the proportion of markers deviating toward the Asian direction is the same in all five populations by use of a 2 x 5 chi-square test.

3.5.D istances between populations: FST and allele frequency differences

The FST calculations for the genome-wide data were done in Arlequin version 3.11 (Excoffier et al. 2007), with significance levels based on 5000 permuta- tions. Differences in allele frequency between pairs of populations (sampled to be of equal size) were calculated in Plink (http://pngu.mgh.harvard.edu/ purcell/plink, Purcell et al. 2007) using a chi-square test with one degree of freedom. The level of differences observed was then compared to the level expected by chance by calculating the means of the smallest 90% of the chi- square test statistics both in the observed and expected distributions, and taking the ratio of these means as in Clayton et al. (2005).

72 Materials and methods 3.6. Hierarchical population subdivision, AMOVA, and inbreeding

In Datasets 1 and 2, F-statistics from the R package Hierfstat (Goudet 2005) and Arlequin version 3.1 were used to apportion the genetic variation hier- archically into individual, , region, and country levels. An analysis of molecular variance (AMOVA) was performed using Arlequin. Calculation of p values was based on 10,000 replications for the F-statistics and on 20,000 permutations for the AMOVA results. In Dataset 5, the inbreeding coefficients of Swedish individuals relative to the total Swedish sample were calculated in Plink. The difference between the coefficients of individuals from Norrland, Svealand, and Götaland was tested with a Kruskal-Wallis rank-sum test in R. The significance levels were corrected for the number of pairwise tests by use of a Bonferroni correction.

3.7. linkage disequilibrium

The extent of linkage disequilibrium (LD) in each population was calculated for 67,620 SNP pairs (all SNPs at a less than 500-kb distance from 5083 randomly chosen SNPs) using the expectation-maximization algorithm in Plink. As LD is dependent on sample size, all populations were first sampled to equal size. The LD distributions between populations were compared by use of a Wilcoxon signed rank test (i.e., a paired test) in R. The significance levels were Bonferroni-corrected for the number of pairwise tests.

3.8. correlation of genetic and geographic distances

The correlation between genetic and geographic distances was tested with the Mantel test from the R package ade4 (Dray & Dufour 2007). The geo- graphic distances between pairs of individuals and pairs of were calculated as great-circle distances by using R package fields (Nychka 2007). As genetic distances, IBS distances between pairs of individuals (see above) were used for the genome-wide data, and pairwise FST statistics between counties calculated in Arlequin version 3.1 for Datasets 1 and 2.

3.9.T ests of the subpopulation difference scanning (SDS) approach

Allele frequency differences between areas of high and low AMI incidence were calculated in Dataset 1, by comparison of subpopulations in Finland from central east (111 individuals) and central west (46 individuals) as well as central east and southwest (38 individuals). The result for the most differing allele per marker was recorded in each comparison. Similarly, the differences in the simulated data were recorded as the largest frequency

Materials and methods 73 difference in five initially equifrequent haplotypes between the two sampled daughter populations. Furthermore, the simulated frequency differences were used to determine the power of detecting a disease gene with a given frequency difference when using varying genomic exclusion thresholds (see Appendix A in Study I for details). The degree of disease incidence difference that various gene frequency differences would produce was then studied in a simple model of two subpopulations, each in HWE and differing only in their frequency of the disease-causing allele, which was assumed to be dominant with incomplete penetrance.

3.10. Enrichment analyses

The potential enrichment into Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways among the genome areas that showed a large regional frequency difference between eastern and western Finland was studied using WebGestalt (http://bioinfo.vanderbilt.edu/webgestalt/) (Zhang et al. 2005). The 11,615 SNPs in which either homozygote had a frequency difference above 0.135 in Dataset 3 served as input. Of these SNPs, 4,116 were unambiguously mapped to 1,810 unique Entrez IDs. All pathways with at least two of these genes were tested. The p-value cutoff used (after multi- ple testing correction with the Benjamini and Hochberg method) was 0.05. A list of SNPs associated with cardiovascular phenotypes (with p < 1 x 10-5) was retrieved from the NHGRI Catalog of Published Genome- Wide Association Studies (http://www.genome.gov/ gwastudies, accessed 01/01/2012). We then estimated how large an incidence difference each of these 248 SNPs could explain. For simplicity, it was first assumed that the frequency difference in the GWAS SNP would equal that observed in a nearby SNP in Dataset 3; the largest of the differences in both homozygote frequencies and the allele frequency was used. The increased disease risk caused by the GWAS SNPs was calculated based on the reported OR. As the risk depends on allele frequencies, the maximal risk increase across all frequencies was used to obtain an initial estimate. The estimate was then refined for those SNPs which could explain an incidence difference of at least 1% (or 0.2% when the SNP itself was present in Dataset 3) in the initial calculation and which originated from GWASes reporting the risk allele and conducted mainly on European or European-ancestry populations. When the same or a nearby SNP was reported by multiple studies, the best reported OR was used. The risk increase of the allele was calculated based on the allele frequencies reported in the paper, and if the SNP had not been genotyped in Dataset 3, its allele frequencies in the high and low incidence regions were estimated based on HapMap (phase 2, release 24) CEU haplotype frequencies obtained from Haploview version 4.1 (Barrett et al. 2005).

74 Materials and methods Results

1. pOpulation structure in Finland and Sweden

1.1. Model-free clustering of individuals: multidimensional scaling

When a multidimensional scaling (MDS) analysis was performed on the genome-wide data of north Europeans (Datasets 3 and 4 complemented with Germans and Russians), the individuals clustered mainly according to their area of origin (Figure 12A). The first dimension showed an east-west gradient from eastern Finns through western Finns and Swedes to Germans, and the second dimension revealed a difference between northern Swedes and others. Svealand and Götaland individuals behaved very much like Germans until the third dimension, and the few Russian references were located between this cluster and the western Finns. Overall, the different populations did not form clearly separate clusters. Genetic and geographic distances generally corresponded to each other, except for eastern Finns and northern Swedes who exhibited longer genetic than geographic distances. In an MDS analysis of Finns and Swedes only (Figures 12B and 12C), the two first dimensions formed a pattern very similar to the one that had the German and Russian reference populations included. A province-based coloring of the samples showed that the Finns with closest affinity to Swedes were Swedish-speaking , whereas of the Swedes, Götaland individuals tended to be furthest away from the Finns. Notably, the Swedish individuals from the northernmost province, Norrbotten, were not par- ticularly close to the geographically closest Finnish province, Northern Ostrobothnia. Samples from different provinces, especially within southern Sweden, showed high degrees of overlap. An MDS analysis of the Swedes (Figure 12D and Figure 2C in Study IV) showed a north-south gradient in the first dimension, while the second dimension further differentiated, within Norrland, between Norrbotten and Västerbotten. The samples without geographical information behaved mainly like the northern samples. The plot of Norrland individuals alone (Figure 2D in IV), colored according to their river valley of origin, showed a loose divi- sion into the south, middle, and north of Norrland. An analysis of Götaland and Svealand (Figure S5 in IV) revealed only a very subtle structuring, with the first dimension loosely differentiating between Svealand and Götaland, and the second dimension showing an east-west gradient within Götaland.

Results 75 A B

SWEmix FIE NORR FIW SVEA GER GOTA RUS 0.00 0.02 0.04 0.00 0.02 0.04 2nd dimension (0.21%) 2nd dimension (0.21%)

0.00 0.02 0.04 -0.02 -0.02 0.00 0.02 0.04 1st dimension (0.54%) 1st dimension (0.74%) C GOTA SVEA NORR FIW FIE VGO OGO DAL UPP NBO SOB NOB GOB JON VMA STO VBO SSOB KAI HAL KALG ORE SMA VNO SAT NKAR SKA KRO VAR GAV SWF SAV MAL BLE SWEmix HAM FIN

D E 0.00 0.02 0.00 0.02 0.04 0.02 2nd dimension (0.23%) 2nd dimension (0.52%) -0.02 -0.0 4- -0.04 -0.02 0.00 0.02 0.04 0.06 -0.04 -0.02 0.00 0.02 0.04 1st dimension (0.35%) 1st dimension (0.90%)

Figure 12. Genetic distances between individuals visualized by multidimensional scaling (MDS). Identity by state (IBS) distances in northern Europe (A), Sweden and Finland (B), Sweden (D), and Finland (E). Individuals are colored according to their country (A) or province (B-E) of origin. Axis labels indicate proportions of variance explained by each axis. Abbreviations: Germans (GER), Russians (RUS); others as in Table 1. Based on data from Study IV.

76 Results InnS a MD analysis of the Finns (Figure 12E), the first dimension showed a division into eastern and western samples, with Häme as an intermediate province, and the second dimension revealed a spread within the eastern and western samples. This spread loosely corresponded to province origin, with a span from north (Northern Ostrobothnia and Kainuu) to south (Savo and Northern Karelia) in eastern Finland, and a difference between the south- western provinces (Southwest Finland and Satakunta) and Ostrobothnia (Southern Ostrobothnia and Swedish-speaking Ostrobothnia) in western Finland. Overall, the percentages of variation explained by the first MDS dimen- sions were very small, as is typical in an individual-based analysis, because most of the variation in the human species resides between individuals and is therefore not well captured by MDS or PCA methods. The loose divisions (as opposed to strictly differing clusters) may reflect both the large individual differences, a clinal nature of the genetic differences between areas, and the limited resolution of the MDS, especially with these numbers of individuals.

1.2. Model-based clustering of individuals: Structure, Admixture, and Geneland

When north European individuals (cf. section 1.1) were analyzed with the Structure software (Figure 3 in IV), the analysis with K = 2 revealed a division between eastern Finns and all the other samples, while the analysis with K = 3 added a cluster dominated by the Swedes, K = 4, dominated by northern Swedes, and K = 5, dominated by Germans. In all these analyses, the western Finns and Russians tended to show notable proportions of ancestry in more than one cluster. Analyses with K = 6 and K = 7 did not bring out further differentiation between populations. While the likelihoods of clusterings appeared approximately equal across K, a specific statistic (Evanno et al. 2005) suggested the most likely number of clusters to be 2 or 6. An Admixture analysis of the same dataset (with ca. 25,000 SNPs) pro- duced a similar result (Figure 13), with the exception that K = 3 was the most likely solution (Table 2). Of the three clusters, one appeared central European and showed elevated frequencies also in southern Sweden, one was eastern Finnish, with intermediate frequencies in western Finland, and one was enriched in northern Sweden and low in eastern Finland, including the North Ostrobothnians geographically closest to northern Sweden (Figure 14). A Structure analysis of the Finns alone (not shown) supported the exis- tence of two clusters which largely corresponded to the eastern and western subpopulations: only 1.8% of the samples showed higher ancestry in the opposite cluster. In an Admixture analysis, even with an unpruned data- set of 201,011 SNPs, the most likely result was only one cluster (Table 2). A Structure analysis of the Swedes showed two clusters (an analysis with

Results 77 K = 2

K = 3

K = 4

K = 5

x

FIE

BRI

FIW

RUS CEU

GER

SVEA

GOTA

NORR SWEmi

Figure 13. Clustering of north European individuals by Admixture software. Each thin vertical line represents an individual, and colors denote proportions of ancestry in each of the K inferred clusters (2 to 5) dominated by eastern Finns (yellow), Swedes or southern Swedes (red), northern Swedes (green), Germans (light blue), and others (purple). Abbreviations: British (BRI), Utah residents with ancestry from northern and western Europe (CEU), Germans (GER), Russians (RUS); other abbreviations as in Table 1. Unpublished.

78 Results Table 2. Comparison of clustering results from Admixture and Structure.

Study population(s) Northern Europe Finland Sweden Admixture analysis SNPs a 25,000 201,000 60,000 number of runs 11 10 14 c cV error b K = 1 0.5955445 0.540306 0.62188 K = 2 0.5949245 0.540472 0.62185 K = 3 0.5948736 nd 0.62199 K = 4 0.5949291 nd nd Structure analysis most likely K 2 or 6 2 2 SNPs a 7,200 6,400 5,700 nd = not determined. a Approximate number of SNPs used in the analysis. b cross-validation error for K clusters, averaged over runs. The smallest error is underlined. c cV error for K = 3 based on four runs.

A B C

CEU BRI GER RUS SWE CEU BRI GER RUS SWE CEU BRI GER RUS SWE mix mix mix n n n 1 1 1 5 5 5 10 10 10 latitude

1.0 1.0 1.0 0.5 0.5 0.5 0.0 0.0 0.0 55 60 65 70 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 longitude longitude longitude

Figure 14. Local ancestries in three clusters (A-C) by Admixture software. Circles colored according to average proportion of ancestry of individuals originating from the location in question; circle sizes correspond to number of individuals per location. For comparison, panels above the maps depict the average ancestry proportions in five populations (circle size not to scale). Abbreviations as in Table 1 and Figure 13. Unpublished.

Results 79 K = 3 yielded a clearly lower likelihood), with a clinally varying frequency from north to south and an enrichment in the province of Västerbotten (Figures S6 and 5A in IV). An Admixture analysis (based on ca. 60,000 SNPs) found two clusters to be a slightly more probable solution than one cluster (Table 2). The geographic trend in K = 2 was similar to that seen in the Structure analysis (E.S., unpublished results). In the smaller dataset with 34 SNPs (Dataset 2), the Geneland software detected two clusters within Finland (Figure 3A in II). These mostly cor- responded to the division between eastern and western Finland, although the border between the clusters lay somewhat further north and east than did the main geographical division of the birthplaces in the data. When Swedish individuals were added to the analysis (Figure 3B in II), Geneland found three clusters: eastern Finland, western Finland and Sweden, with the exception that the Swedish-speakers from the Finnish Ostrobothnia clustered with the Swedes. In contrast, Structure found only one cluster in each of these datasets. In the STR dataset of Finns, Geneland and Structure found no substructure.

1.3. iBS distributions between populations

As MDS results may not be representive of the full genetic variation of populations, the distributions of IBS between populations were examined using the predefined population divisions. The populations of interest were compared to the HapMap populations, i.e., the IBS distributions to be compared consisted of the IBS similarities between pairs of individuals with one individual from the population of interest and the other from the HapMap population in question (see section 3.3 in Materials and Methods). As expected, the general level of similarity was lowest with the African (YRI), highest with the European (CEU), and intermediate with the East Asian (CHB and JPT) HapMap populations (Figure 15A). Notably, the similarities of the study populations with the East Asians varied relatively widely, with Germany and Götaland showing the lowest similarity and eastern Finland and Russia the highest, whereas the comparisons with CEU exhibited the opposite order (Bonferroni-corrected p < 0.015 for all distribution pairs, except Svealand vs. Germany being nonsignificant in the comparison with CEU, Svealand vs. Norrland with CHB and JPT, and Russia vs. eastern Finland and Götaland vs. Germany in both comparisons). Interestingly, a similar comparison with the Russian reference population revealed a dif- ferent pattern (Figure 15B), with Germany and especially western Finland showing the highest similarity, and Norrland and eastern Finland the lowest similarity with the Russians (Bonferroni-corrected p < 0.031 for western Finland vs. all populations except Germany, and for Germany vs. eastern Finland and Norrland).

80 Results A

YRI CHB + JPT CEU

0.20

0.15

0.10

frequency

0.05 0.00 0.66 0.68 0.70 0.72 IBS similarity

B C 0.20 0.20 NORR

SVEA

0.15 0.15

GOTA 0.10

0.10 FIW frequency

frequency FIE 0.05 0.05 GER

RUS

0.00 0.00 0.710 0.715 0.720 0.725 0.710 0.715 0.720 0.725 IBS similarity IBS similarity

Figure 15. Distributions of pairwise identity-by-state similarities between study populations and four HapMap populations (A) and Russians (B) and within study populations (C). T riangles denote the median of the distribution of the corresponding color. Each curve in A and B consists of similarities of individual pairs with one individual from the population denoted by curve color and the other from the HapMap or Russian population; analo- gously, in C, all pairs are depicted in which both individuals originate from the population in question. Abbreviations: Yoruba from Ibadan, Nigeria (YRI, n = 105); Han Chinese from Beijing, China (CHB, n = 78); Japanese from Tokyo, Japan (JPT, n = 84); Utah residents with ancestry from northern and western Europe (CEU, n = 101); Germans (GER, n = 492); Russians (RUS, n = 25); other abbreviations as in Table 1. Adapted from Study IV with permission.

1.4. SNP-based inspection of eastern admixture

The varying affinity to eastern Asians was also detectable in a SNP-wise analysis of Dataset 3: when the allele frequencies in the study populations were compared to those of CEU and CHB+JPT, the proportion of SNPs whose allele frequency deviated from the CEU frequency towards the eastern Asian frequency (instead of away from it) was significantly higher (p < 1 x 10-5) in Finland, especially in eastern Finland, than in the central European populations (Table S2 in III).

Results 81 1.5.D istances between populations: FST and allele frequency differences

In European populations, FST distances and geographical distances mostly corresponded, with the exception of eastern Finns, Sardinians, and Basques, who exhibited longer genetic distances than their geographic locations would imply (Table S2 and Figure S3A in IV). A similar pattern was evident in the general levels of allele frequency differences between populations (Table 3): eastern Finland differed the most, while Svealand and Götaland were rela- tively close to Germany and Great Britain, and western Finland was in fact closer to Norrland and Svealand than to eastern Finland. Within Sweden, the allele frequency differences between Svealand and Götaland were small, but from them to Norrland they were of the same magnitude or larger than those between Germany and Great Britain.

Table 3. Degree of allele frequency differences between pairs of populations.

λ FIE FIW NORR SVEA GOTA GER BRI FIE 1.00 1.71 2.59 2.62 2.91 3.08 3.30 FIW 1.71 1.00 1.56 1.52 1.70 1.82 2.05 NORR 2.59 1.56 1.00 1.12 1.20 1.36 1.46 SVEA 2.62 1.52 1.12 1.00 1.03 1.16 1.28 GOTA 2.91 1.70 1.20 1.03 1.00 1.13 1.21 GER 3.08 1.82 1.36 1.16 1.13 1.00 1.11 BRI 3.30 2.05 1.46 1.28 1.21 1.11 1.00 Measured as λ, the ratio of observed vs. expected chi-square test statistics; λ = 1 indicates no difference. Abbreviations: Germany (GER), Great Britain (BRI); other abbreviations as in Table 1. Reproduced from IV with permission.

On a province level, the FST values between most province pairs within eastern and western Finland were significant (Table S3 and Figure S3B in IV). In contrast, despite the larger sample sizes for many Swedish provinces, the FST values between province pairs within Svealand and Götaland were non-significant. In Norrland, the provinces of Norrbotten and Västerbotten differed significantly from each other and from the other provinces (Tables S3 and S4 and Figures S3B and S3C in IV).

1.6. Hierarchical population subdivision, AMOVA, and inbreeding

In Dataset 2, a hierarchical analysis of individual, county (i.e., province), region, andy countr levels in Finland and Sweden resulted in significant F-statistics except for Fcountry/total (Table 2 in II). Within Sweden, the genetic variation between coun- ties and regions in Sweden was non-significant (Table 1 in II), while Findividual/country and Findividual/county ewer significantly inflated (at 0.027 and 0.025, P < 0.0001; Table 2 in II), which may relate to increased genotyping error rate due to whole-genome amplification of the template. Eastern and western Finland explained a small

82 Results but significant portion (0.42%, p < 0.0001; Table 1 in II) of the genetic variation within Finland, and in the four-level analysis (Table 2 in II) both Fcounty/region and

Fregion/country were significant (p < 0.01). The latter was significant also in the STR data (Dataset 1; p < 0.05). Level of inbreeding differed between the three Swedish regions (p < 0.0002), and was stronger in the north, especially in coastal Västerbotten (Figure 5B in IV).

1.7. iBS distributions within populations

The variation within populations was measured as the distribution of IBS similarities between all pairs of individuals originating from the same pop- ulation (Figure 15C). The similarities were highest in eastern Finland and successively lower in western Finland, Götaland, Germany, Svealand, and Norrland (Bonferroni-corrected p < 0.034 for eastern Finland vs. all other populations, Norrland vs. Götaland and western Finland, and western Finland vs. Svealand). A similar pattern emerged in the smaller SNP dataset (Dataset 2), where the mean IBS in Finland was higher than in Sweden (0.650 and 0.641, respectively), and higher in eastern than in western Finland (0.656 and 0.649, respectively), indicating reduced heterozygosity in Finland, especially in eastern Finland. Within Sweden, the IBS similarities ranged from 0.640 in Götaland and 0.639 in Svealand to 0.6370 in Norrland.

1.8. linkage disequilibrium

Degree of linkage disequilibrium (LD) differed significantly between most of the populations (Bonferroni-corrected p < 0.002 for all pairwise com- parisons, except Germany vs. Great Britain and Svealand vs. Götaland as nonsignificant). It was weakest in Germany and Great Britain, followed by Svealand and Götaland, Norrland and western Finland, and was clearly strongest in eastern Finland (Figure S8 in IV).

1.9. correlation of genetic and geographic distances

In Dataset 3, genetic distances between Finnish individuals correlated signifi- cantly with their geographic distances (r = 0.31, p < 10−6). In Sweden (Dataset 5), the correlation was significant for the whole country (r = 0.066, p < 0.0001) as well as regionally in Norrland (r = 0.164, p < 0.0001), Svealand (r = 0.011, p < 0.0001), and Götaland (r = 0.036, p < 0.0001). In the smaller SNP set (Dataset 2), the genetic and geographic distances between counties correlated significantly in Finland (r = 0.32, p = 0.046), and in Finland and Sweden combined (r = 0.39, p < 0.0003), whereas in Sweden and in the STR data (Dataset 1), the correlation was not significant (r = -0.014, p = 0.54 and r = 0.01, p = 0.49, respectively).

Results 83 2. Subpopulation difference scanning (SDS) approach

2.1. population simulations

In the STR data (Dataset 1), 90% of the markers had allele frequency differ- ences below 0.13 between eastern and southwestern, and below 0.15 between eastern and western Finland. Similar distributions of allele frequency dif- ferences occurred in the population simulations in which the population expanded from an initial size of 400 or 800 individuals to 10,000 individuals in 40 generations (Table 4). Judging from the simulations with expansion time of 20 generations and initial sizes of 100 and 800 individuals (Figure 3 in I), an initial size of ca. 300 could produce a relatively similar result in 20 generations – as could, obviously, a number of other parameter combi- nations that remain unsimulated.

Table 4. Frequency differences from various population simulations with initial haplotype frequency 0.2.

Initial size Initial D’ Time (gen) Sample size HFD 90% GFD 80% 100 0.5 20 100 0.19 0.27 100 0.7 20 100 0.19 0.23 100 0.5 40 100 0.25 0.33 100 0.7 40 100 0.25 0.29 400 0.5 40 100 0.16 0.21 400 0.7 40 100 0.16 0.19 800 0.5 20 200 0.10 0.13 800 0.7 20 200 0.09 0.12 800 0.5 40 200 0.12 0.15 800 0.7 40 200 0.12 0.14 Initial D’: average LD at 1-cM distances in the initial population. Time (gen): duration of exponential population growth (generations). HFD 90%: haplotype frequency difference threshold producing a 90% exclusion of the genome. GFD 80%: gene frequency difference yielding a 80% detection power of the gene. Results based on 100 simulations, except for initial sizes of 400 on 1000 simulations. Adapted from Salmela et al. J Med Genet 43: 590-597 (2006) with permission.

The simulations underwent further study of how different genotyping densities and sample sizes would affect the power to detect a true gene frequency difference, when detection is based on the haplotype frequency difference of nearby markers genotyped in samples from the daughter populations (Table 4). In the simulations closest resembling the STR data, a gene with a frequency difference of at least 0.14 to 0.20 would have a nearby haplotype frequency difference exceeding the 90th percentile of the haplotype difference distribution, when assuming a 1-cM genotyping density and a sample size of 100 to 200. Thus, an exclusion of 90% of the genome

84 Results based on such a genotyping would result in an 80% power of detecting (i.e., 20% risk of excluding) the correct gene. How likely is it that a disease-causing gene would have a frequency dif- ference of 0.14 to 0.20? The regional incidence difference caused by a gene is a product of the gene frequency difference between the regions and the risk increase caused by the gene. Acute myocardial infarction (AMI) shows yearly incidences in the range of 860 per 100,000 35- to 84-year-old men in the high and 710 in the low incidence areas of Finland (Havulinna et al. 2008). Part of this difference may obviously be caused by regional differences in non-genetic risk factors. Let us assume that the lifetime genetic incidence difference for AMI caused by a single dominant, incompletely penetrant gene would be for example 0.04. Then the gene’s frequency difference would exceed the 0.20 calculated above (meaning that the gene would be detectable with a 90% exclusion of the genome at least 80% of the time) if the gene produced at most a 0.12 increase in disease risk (Figure 4 in I).

2.2. east-west differences in the genome-wide SNP dataset

In Dataset 3, the eastern Finnish individuals represent the high-incidence area and the western Finnish individuals the low-incidence area for AMI. When the genotype frequencies in these areas were compared, 90% of the SNPs had a frequency difference below 0.12 (E.S., unpublished results). As expected for differences caused by genetic drift and unrelated to a particular phenotype, the large population differences were much more scattered across the genome than were the typical differences in case-control association studies (Figure 16).

.4

0

.3

0

.2

0

.1

0

frequency difference

.0 0 13579111315171921X chromosome

Figure 16. Genomic locations of differences between eastern and western Finland, with absolute genotype frequency differences plotted according to SNP location; chromosomes are shown in an alternating color. Unpublished.

Results 85 We then analyzed whether genes in the genome areas showing the high- est east-west difference (above 13.5%) were enriched into any functionally relevant categories. Indeed, the SNPs (n = 11,615) that had the highest east- west frequency differences showed significant enrichment to four KEGG pathways. Interestingly, three of these, arrhythmogenic right ventricular cardiomyopathy (p = 0.0004), hypertrophic cardiomyopathy (p = 0.0053) and dilated cardiomyopathy (p = 0.0198) were among the four pathways that KEGG categorizes as being related to cardiovascular diseases, and only the viral myocarditis pathway was absent. The fourth enriched pathway was neuroactive ligand-receptor interaction (p = 0.0017) (E.S., unpublished results). Based on the east-west differences in Dataset 3, we also calculated how large were the incidence differences a subset of cardiovascular GWAS SNPs could explain. We used the OR reported by the GWAS and estimated allele frequencies from those in Dataset 3 and from the haplotype frequencies in HapMap CEU. The combined effect of variants from eight independent genome regions could explain a difference of ca. 1.3% in disease incidence between the high- and low-incidence regions (Table 5). In four of these regions, the SNPs (rs6455689, rs11556924, rs12618342, and rs17676758) harbored genotype frequency differences that were among the highest 1.2% in the dataset, and would thus be among the top candidates to be picked up by the SDS approach (E.S., unpublished results).

86 Results h g f e d c b a SNP T U Associated phenotype in all studies is coronary heart disease, except sudden cardiac arrest for SNP t rs3127599, rs7767084 and rs10755578. by SNPs rs2048327, haplotype formed ellcome s e s combined h S chr 6r8526G1.91 G rs8055236 16 rs501120 10 able 5. npublished. haplotype 6 s989A1.29 A rs599839 1 7 haplotype 6 6 rs9818870 3 2 W Wild et al. 201 Arking et al. 201 A régouët et al. 2009. s associated with cardiovascular diseases in GWA chunkert et al. 2011. rdmann et al. 2009. amani et al. 2007 rs4665058 rs11556924 rs16893526 E F NP stimated incidence differencesstimated incidence and low high between AM rom T rust a a GW 1. C 1. CTT CCTC leeOR allele ase A 1.13 G C 1.92 A T T study G C ontrol 1.33 1.15 1.09 1.20 1.82 C study onsortium 2007. onsortium h b g e d b c f f s7778G rs17676758 rs501120 0 rs11556924 rs6455689 0 G rs6455689 rs16893526 rs9818870 A rs12618342 rs599839 S NP F rom population dataset leedis allele C C C T T T S tance es. k)ws east west (kb) 49 37 37 72 0 0 0 - 0.137 0.850 0.725 0.610 0.436 0.436 0.989 0.933 0.137 0.082 0.441 0.835 0.777 rqec ikfeunyexplained frequency risk frequency I incidence regions explained by a subset of explained regions incidence 0.241 0.874 0.309 0.309 0.290 increase 0.047 0.066 0.022 0.011 0.041 0.139 0.028 0.043 0.080 0.056 rs4665058. 0.795 0.850 0.725 0.610 0.151 0.129 0.989 0.933 0.137 0.082 0.031 0.028 0.835 0.777 etes inc. diff east west Calculated 0.874 .1 0.077 0.811 0.008 -0.045 0.158 0.247 0.090 0.157 0.238 0.029 0.327 1.278 (%)

Results 87 Discussion

1. tECHnical aspects

1.1. Quality control of genome-wide SNP data for population genetics

Even though genome-wide association study (GWAS) datasets have proven useful in population genetics, some of the standard quality control (QC) procedures developed for GWAS settings may be nonoptimal for population genetics use. For example, deviations from Hardy-Weinberg equilibrium (HWE) that can signal errors in genotype calling may also stem from pop- ulation substructure. Likewise, applying fixed QC thresholds for sample heterozygosity or identity-by-state (IBS) relatedness will ignore the fact that these measures can vary because of population history rather than because of technical errors. In the presence of marked population structure, unnecessar- ily strict QC procedures can exclude markers and individuals with the largest genetic differences. This could underestimate the strength of population structure. In a GWAS, in turn, unnecessarily strict QC would have a more indirect effect: a loss of power to detect an association. On the other hand, QC with too lenient thresholds can accept samples or markers of inferior genotyping quality. In GWASes, the genotyping quality can be re-checked for the SNPs that show the strongest associations. Such subsequent checks will rarely be possible in the genome-wide analyses of population structure which usually do not indicate particular loci. In this study, the QC procedures tried to avoid some of the above pit- falls. When filtering for sample heterozygosities and IBS relatedness, we examined the distribution of the measures in each population and used population-specific thresholds. Similarly, the HWE checks were done for each population separately, to minimize the number of deviations that could originate from population structure. Overall, the QC procedures and the filtering thresholds we chose were stricter for markers than for individuals. In a population genetic study – contrary to a GWAS – the individuals are of primary interest, whereas single markers will have little effect on the results that are genome-wide averages. Moreover, the exact number of markers surviving QC is irrelevant for many of the analysis methods, because they are unable to handle more than a few thousand markers. The GWAS data for different populations often originate from sepa- rate datasets genotyped on varying platforms or in several laboratories. Consequently, technical differences between the datasets may inflate the

88 Discussion observed population differences. As long as the technical differences affect relatively few markers, their effects can be minimized by choice of analysis methods. For example, we clustered individuals with multidimensional scaling (MDS) instead of principal component analysis (PCA), because the former is somewhat less sensitive to single deviating markers. Additionally, in QC we compared genetically close populations from different datasets and removed clearly outlying markers.

1.2.C omparison of model-based methods of individual clustering

This study clustered individuals with three model-based software programs: Structure, Admixture, and Geneland. Analyses of the same dataset with multiple softwares enabled comparisons of Structure vs. Admixture and Structure vs. Geneland. The main trends detected by the Structure and Admixture softwares did not differ. It was obvious, however, that Admixture tended to infer a smaller number of clusters than did Structure, i.e., it had lower resolution. This was most visible in the analysis of Finns; even with a more than 30-fold number of markers, Admixture was able to infer only one cluster, whereas Structure found two. In the other populations, the reduction in resolution was smaller, possibly due to the larger number of individuals. Additionally, in the analysis of north Europeans, the clusters inferred by Structure appeared somewhat more separated than were those produced by Admixture (cf. Figure 3 in Study IV and Figure 13). Obviously, therefore, although the faster algorithm of Admixture allows the use of larger numbers of markers, this does not immediately translate into better performance. The comparison of Structure and Geneland used a small dataset with 34 SNPs. Structure found only one cluster, which is unsurprising because previous studies have shown for example that Structure needs 60 or more random markers to correctly assign individuals to populations even on a continental level (Bamshad et al. 2003, Turakulov & Easteal 2003). Geneland, in turn, inferred two clusters in Finland and a third one in Sweden. Thus, the spatial prior of Geneland clearly improved the clustering power in this small dataset.

Discussion 89 2. pOpulation structure in Finland and Sweden

In our genome-wide analysis of population structure and substructure in northern Europe, Finns clearly differed from other populations. Eastern Finns also differed from western Finns – as much as western Finns differed from Swedes, and markedly more than Germans from British. The Finns also harbored a small but apparent eastern influence, which was pronounced in eastern Finland. Furthermore, the genetic diversity of eastern Finland was substantially reduced. In Sweden, the population of the northern part of the country differed from that of the rest of Sweden as well as from that of central Europe and showed an internal substructure, whereas the population of the southern part was relatively homogeneous and genetically close to Germans and other central Europeans. These patterns, discussed in detail below, are generally consistent with earlier genetic research into Finns and Swedes. They are also compatible with the history of these populations, which is characterized by low popula- tion densities and a relatively limited gene flow. Overall, our data showed a general correlation between genetic and geographic distances which agrees with that of earlier studies of Europe (Lao et al. 2008, Novembre et al. 2008). Interestingly, however, the nature of this correlation varied regionally: in the north, a given geographic distance corresponded to a much longer genetic distance than in the south.

2.1. comparison of Finland and Sweden

In our analyses, genetic substructure appeared stronger in Finland than in Sweden. The Finnish and Swedish datasets are not directly comparable, however, because of their different ascertainment. In the Finnish samples, geographic information consists of the grandparental birthplaces of the subjects, and the data therefore represent the historical rather than the contemporary population. For the Swedes, geographic information is the place of birth in a sample representing the contemporary population as a whole (i.e., including descendants of recent immigrants) (Study II), or is the place of residence in a sample of Swedish-born individuals (Study IV). Recent (post-WW2) trends of immigration, urbanization, and internal migration have likely reduced and blurred subpopulation differences in both countries. It is therefore impossible to say to what degree the stronger Finnish substruc- ture in our data represents a temporal difference in population structure and to what degree current (or past) differences between these countries. In theory, the discrepancy between the datasets could be alleviated by either

90 Discussion tracking the ancestry of the Swedish samples or by correcting the Finnish results for the degree of internal migration during recent generations. Despite their geographic proximity, the northernmost Finnish and Swedish provinces studied (Northern Ostrobothnia and Norrbotten, respec- tively) were not close to each other genetically. Instead, the northern Swedes showed a closer genetic affinity to western Finns. This clearly contrasts with the correlation of genetic with geographic distances in Europe (Lao et al. 2008, Novembre et al. 2008). The likely explanation to this seemingly coun- terintuitive trend lies in the population history of Northern Ostrobothnia: apart from the very coastal areas, this province became inhabited during the 16th century population movement originating from eastern Finland (Keränen 2003, Pitkänen 2007). Our results suggest that the amount of recent Finnish genetic influence in northernmost Sweden has been rather restricted geographically or in magnitude, or both, despite contact across the border. This observation disagrees somewhat with earlier findings of increased links between the northern areas (Nylander & Beckman 1991, Kittles et al. 1998, Karlsson et al. 2006, Lappalainen et al. 2008, 2009). Admittedly, the disagreement may in part stem from the fact that in our study the Finnish areas directly adjacent to the border remained unsampled. Meanwhile, the genetic affinity of northern Swedish to western Finnish areas may reflect the apparently more powerful contacts around and across the Gulf of Bothnia ever since ancient times (Siiriäinen 2003a; see also sections 5.1 and 6.1 of Review of the Literature). The Finnish province genetically closest to the Swedes was Swedish- speaking Ostrobothnia. Although the small sample size (20 individuals) limits conclusions, this result is well in line with historical records of Swedish immigration to the area in the early Middle Ages. It is also compatible with results depicting this area as genetically intermediate between Sweden and the rest of Finland (Virtaranta-Knowles et al. 1991). Obviously, exact estimates of admixture proportions may vary between studies, depending on the origin of both the Swedish and especially the Finnish reference samples (cf. section 2.2.2). In our data, a strong admixture of the Swedish-speakers with the Finnish-speaking population seemed evident. For example, in the multidimensional scaling (MDS) analyses, Swedish-speaking Ostrobothnians differed from their Finnish-speaking neighbors (Southern Ostrobothnians) only when Swedes were included in the analysis. This underlines both the close relations of these populations and the importance of including in MDS relevant reference populations. Conversely, genetic distances to Sweden were relatively long: Swedish-speaking Ostrobothnia appeared approximately equally far from Swedish provinces as the two northernmost – and most differing – provinces of mainland Sweden (Norrbotten and Västerbotten).

Discussion 91 2.2. Finland

2.2.1. Eastern influence Most of the features of the Finnish genetic structure could be explained by a scenario in which a gene pool of European origin is subject to genetic drift, both in founder effects during the gradual inhabitation of Finland and in subsequent small and isolated breeding units. However, the Finnish autosomal gene pool also carried signs of a small but evident contribution from the east, visible as an increased genetic affinity of the Finnish samples to the East Asian HapMap reference populations (from China and Japan), and it also emerged in a SNP-based analysis. This influence was stronger in eastern than in western Finland. Eastern elements have earlier been detected in the Finns, most con- spicuously in the very high frequency of the Y-chromosomal haplogroup N3 (Guglielmino et al. 1990, Lahermo et al. 1999, Meinilä et al. 2001, Lappalainen et al. 2006; see also section 5.3.2 of the Literature). In our data, this effect appeared to be much smaller. The difference between the Y chromosome and the autosomes may result from random factors (like stronger genetic drift in the Y chromosome), or it may signal a male in related population-history events. Moreover, it can reflect the Far Eastern reference population used; comparisons with eastern Finno-Ugric or other eastern European or northern Asian reference populations may have shown larger effects. Unfortunately, very few eastern reference populations were publicly available at the time of the original analysis. Interestingly, signs of contacts between -Ural populations and East Asians (Japanese) have been observed in X-chromosomal markers (Laan et al. 2005). While another eastern reference population, the Russians, also showed an increased affinity to East Asians, no such increase was apparent between the Russians and the (eastern) Finns. Provided that the Russian samples (from Vologda) are a representative reference, this suggests that Finland and Russia have separate histories of the admixtures occurring after the eastern influence. The eastern influence in Finland would thus predate the arrival of ancestors of Russians in their current areas, which occurred in the end of the first millennium AD. In view of archaeological knowledge of ample contacts from Finland towards the east, possible timings for the influence still abound, especially as the varying degree of eastern influences within Finland may stem from later phenomena. Furthermore, some part of the influence in Finns could result from contacts with the Saami who are known to harbor a relatively high frequency of eastern elements (Guglielmino et al. 1990, Johansson et al. 2008, Huyghe et al. 2011). Earlier studies have provided estimates of the degree of eastern influence in the Finnish gene pool (Nevanlinna 1984, Guglielmino et al. 1990), but we

92 Discussion did not attempt to quantify it. It seemed superfluous given the limited rele- vance of the available reference population: a comparison with East Asians would likely produce only a minimum estimate of the percentage of eastern influence (see above), and it would be very difficult to know how seriously this would underestimate the “true” proportion. Additionally, such estimates can be highly dependent on the method and software used for the inference (for an example of Admixture vs. Structure, see Figure 1 in Alexander et al. 2009). Moreover, regardless of how carefully the researchers themselves state the factors limiting the validity of their estimates, the numbers often tend to get cited without any such disclaimers.

2.2.2. East-west difference In terms of the difference between eastern and western Finland, the results of this study are compatible with the duality of the Y chromosome (Kittles et al. 1998, Raitio et al. 2001, Hedman et al. 2004, Lappalainen et al. 2006, Palo et al. 2007, 2008) but discordant with the homogeneous pattern of mtDNA (Vilkki et al. 1988, Hedman et al. 2007). They also agree with the results of later genome-wide analyses (Jakkula et al. 2008, McEvoy et al. 2009), and with the distributions of many of the Finnish disease heritage (FDH) diseases (Norio 2003a). The differences between the population histories of the two regions are many. They start from the different direction of predominant contact already in ancient times, and extend to the much shorter agricultural history in the east which involves events of extreme genetic drift (see section 5.1 in the Literature). The drift-related phenomena that have taken place during Finnish population history are the standard explanation for the existence and distribution of FDH diseases. However, these diseases are caused by extremely rare alleles which are prone to drift, whereas drift may not be the only or main mechanism playing a role in the distribution of more common variants. Admittedly, the common variants have not been immune to the drift either, as demonstrated by the reduced genetic diversity detected in eastern Finland in this and other studies (Raitio et al. 2001, Hedman et al. 2004, Lappalainen et al. 2006, Service et al. 2006, Jakkula et al. 2008). Nevertheless, the earlier events of population history, especially the higher degree of European (mainly Baltic and Scandinavian) influence in western Finland, may also serve to explain a part of the east-west difference observed. Although the genetic difference between eastern and western Finland was large, it is impossible to say how sharp a genetic border exists between the areas, because of the sampling gap separating the eastern and western samples in our data. Later studies with differing sampling distributions have not suggested the existence of a strict border (Jakkula et al. 2008, McEvoy et al. 2009). In fact, it is likely that the difference is clinal: even if

Discussion 93 the original phenomena producing the difference would have been strictly limited geographically, later developments such as migration and marriage across the (hypothetical) border would probably have blurred it. Nevertheless, the cline would be exceptionally steep, because the difference between the sampled areas is much larger than usually observed across similar distances for example in central Europe (Heath et al. 2008, Nelis et al. 2009).

Interestingly, while eastern and western Finns differed clearly in the FST and allele frequency analyses, their IBS similarity was much higher than the other measures would have led one to expect. This illustrates the fact that the sensitivity of these measures for various population genetic forces can differ. For example genetic drift, which has been powerful in eastern Finland, may be more strongly reflected in the SNP-based results, whereas the shared history of eastern and western Finland may have led to allele frequency combinations that do not produce equally marked differences in the individual-based IBS measures (cf. Figure S4 in Study III). The genetic substructure observed within Finland can obviously pose a problem for association studies, unless the substructure is accounted for. Naturally, methods for stratification correction may in part reduce spurious associations in datasets where the geographic distributions of cases and controls differ. Minimizing such differences already during study design (prior to genotyping) is nevertheless advisable. Whenever possible, efforts to collect Finnish samples should record information on parental or pref- erably grandparental origin, and in association studies, cases and controls should match at least on the level of their east/west ancestry. Moreover, because the genetic differences within Finland were larger than between many neighboring countries in central Europe (Heath et al. 2008, Nelis et al. 2009), in a wider context our results also emphasize how the geographical units relevant to association analysis sampling are deducible neither from national borders nor from perceived cultural homogeneity.

2.2.3. Genetic vs. archaeological view of Finnish population history The current archaeology-based view of Finnish population history supports general continuity since the early settlers and more or less frequent small- scale contact with neighboring populations rather than a few unequivocally detectable migration waves. This seems to conflict with the views frequently cited by geneticists that the Finnish population derives from major settlement events 2000 and 4000 years ago. Direct time estimates based on genetic data appear scarce, however, and the existing ones, because of their wide confidence intervals, may be associated with several archaeological events. Thus, they can hardly be used to argue for specific timepoints unsupported by non-genetic evidence.

94 Discussion Meanwhile, some indirect evidence of population age can perhaps be derived from calculations made in the course of FDH studies. In these calculations, the Finnish population has typically been assumed to be 2000 years old. Considering that this assumption more resembles the long-aban- doned theory of an Iron Age population replacement than it does the current archaeological view of no major discontinuity, what may appear surprising is that the calculations have been very successful in predicting the loca- tion of disease genes. However, it is worth noting that such calculations correspond to forward-time population simulations in the sense that a fit to the observations does not automatically support the correctness of the parameters employed. Indeed, to my knowledge, no systematic studies exist on other combinations of population history parameters that could have yielded equally accurate predictions. Besides, the FDH alleles are rare, and the diseases are only a handful, and therefore may not form a representative picture of the forces that have affected the Finnish gene pool as a whole. Moreover, even when precise and representative genetic timings exist, they do not necessarily equal the presence of migration waves. Rather, they can detect a reduction in population size or the start of a population expan- sion. These could of course reflect a migration or founder event, but could equally well result from a temporary reduction in the local population (a bottleneck), compensated for by local population growth rather than gene flow. In an extreme scenario, a population may go through a strong bottleneck in the absence of gene flow. As a result, the genes of the post-bottleneck population may have resided in the area for millennia, and still appear much younger. Conversely, if a migration involves relatively large numbers of individuals, the resulting genetic timings may greatly antedate the actual migration event. Furthermore, genetic timing in itself does not directly reveal the geographic location of a genetic event. Such limitations of the genetic timing methods may appear obvious to a geneticist. They are often worth stating explicitly, however, when dealing with questions on Finnish population history that stimulate multidisciplinary interest. Otherwise, the comparisons of results from different disciplines may exhibit discrepancies that are, in fact, largely illusory.

2.3. Sweden

Although the overall genetic structure detected in Sweden was clinal, a difference was visible between southern and northern Sweden, and a sub- structure existed within the north. This agrees with earlier findings (Karlsson et al. 2006, Einarsdottir et al. 2007, Lappalainen et al. 2009) as well as with a later genome-wide study (Humphreys et al. 2011). Interestingly, a pronounced genetic substructure exists in northern Finland (Jakkula et al. 2008), suggesting the involvement of similar population history processes

Discussion 95 in both areas. In southern and central Sweden (Svealand and Götaland), the population structure was clinal but subtle. The individuals were genetically relatively close to central Europeans, which is in congruence with previous results (see section 6.3.1 of the Literature for references). Unfortunately, the Svealand provinces of Värmland and Dalarna which have shown interesting features in several other studies (Karlsson et al. 2006, Lappalainen et al. 2009, Humphreys et al. 2011) could not be analyzed in detail because of small sample size (n = 21). The difference between northern and southern Sweden is important from an association analysis perspective. While it is of the magnitude that should be possible to correct for, the difference might become problematic if genome-wide information is unavailable, for example in candidate gene or replication studies. Meanwhile, any sampling done within southern Sweden requires less rigorous matching. Obviously, obtaining ancestry information would still be important in order to exclude recent northern migrants. A more thorough analysis of the effects of the Swedish substructure on association studies has been reported elsewhere (Humphreys et al. 2011). Norrland differed from the rest of Sweden in much the same manner as did eastern Finland from western Finland. Interestingly, Norrland did not exhibit the reduced genetic diversity which was very apparent in eastern Finland and which would be characteristic of a small and isolated population affected by genetic drift. Of course, drift-related phenomena may have had less extreme an effect in Norrland where the population is older, and has possibly been larger and more constant in size. Additionally, the presence of several subpopulations – as detected in this and other studies (Gustavson & Holmgren 1991, Einarsdottir et al. 2007 and references therein) – may increase the overall diversity if the subpopulations have independent his- tories of genetic drift. Furthermore, the combination of normal diversity with pronounced LD suggests the possible involvement of admixture in simultaneously increasing both genetic variation and genetic distance from southern Sweden. Admixture obviously is not a far-fetched idea in the light of the known population history of Norrland, with its contact and assimilation with Finns and Saami (see section 6.3.2 in the Literature). A trace of the increased eastern contribution observed in (eastern) Finland was also visible in northern and central Sweden (i.e., relative to southern Sweden). The origin and age of these elements remain open. In line with the relatively high frequencies of the Y-chromosomal haplogroup N3 in these areas (Karlsson et al. 2006, Lappalainen et al. 2009), they may reflect rela- tively recent contacts with the Finns (although in our dataset, these appeared to be rather limited in the north, cf. section 2.1) or Saami, or they may stem from ancient contacts, possibly with the same people who originally brought the influence to Finland. Later analyses of the fine-scale geographical distri-

96 Discussion bution of these elements (for instance of their enrichment in the northern or coastal areas) could possibly shed more light on their origin. In Study III of this thesis, Swedish reference individuals demonstrated an interesting spread in the second dimension of the MDS plot. We suspected that this was an indication of population substructure within Sweden, but lacking detailed geographic information, we could not study this directly. In an indirect analysis, we examined the original IBS distances between the Swedes, i.e., the input data of the MDS. Since we found no notable dif- ferences between the IBS distribution within the Swedish population and distributions within most other populations, we originally concluded that the spread pattern was probably some internal artefact of the MDS calcu- lation. Later, however, in the analyses of Study IV, the Swedish individuals lacking geographical information behaved very similarly to the individuals representing northern Sweden. It thus appears that the former individuals may originate predominantly from the northern parts of Sweden and that the MDS spread in Study III already reflected the population substructure which was later detectable in Study IV. This underlines the ability of MDS to discern the small portion of total genetic variation between individuals that actually represents a geographic (or other systematic) pattern.

2.4. Limitations of the results

The representativeness of subjects is crucial in population genetics. This study used geographical (ancestry) information rather than ethnicity, with the exception that native language served as an additional classification cri- terion for the Swedish-speaking Ostrobothnians. The datasets had differing ascertainment criteria, which limits the comparisons between them (see section 2.1). Especially the Swedish datasets, with no ancestry information, may contain some recent migrants. Indeed, the analyses revealed several individuals potentially with an immigrant background. Furthermore, (sub) populations in the data were defined by geography rather than by interbreed- ing. However, any major discrepancy between the classifications based on administrative provinces and the actual genetic structure of the population should have manifested itself in the individual-based analyses. Gaps in the geographic distribution of samples restricted the conclusions in two locations where large genetic differences were seen, namely between eastern and western Finland and between Sweden and Finland in the north. However, neither of these differences should be an artefact caused by the sampling gap alone. In the former case, more-continuous sampling would yield information on the exact location and clinality of the difference (see section 2.2.2), but genetic distances between the currently sampled areas would obviously remain the same. In the latter case, the country border also coincides with a boundary between two datasets genotyped with different

Discussion 97 arrays in separate laboratories. The difference observed unlikely results merely from a dataset difference, because the Swedish samples actually showed smaller genetic distances from other samples of the Finnish dataset (see section 2.1). In many of the analyses, small sample size can limit resolution. This was the case at least to some degree in eastern and western Finland and in north- ern Sweden: larger sample sizes may have revealed fine-scale population substructures with higher precision. Within southern Sweden, on the other hand, the fact that we observed no strong population structure should not be due to immediate limitations by sample size, except in a few provinces. Of the reference populations, the Russian sample was very small, which may have affected its behavior in the Structure and MDS analyses. However, its small size should not have affected the IBS distributions that underlie the relative timing of the eastern influence in Finland. Another question is of course how representative is the Russian sample of the Russian population as a whole, because it originated from only one area. The same applies to the German reference data that originated from one small area in northern Germany and are thus unlikely to reflect the whole variation in the population. In the presence of a fine-scale population substructure, like the one seen in Finland (Nevanlinna 1972), the sampling distribution may substantially affect the results. For example, samples from very limited areas could differ more than do wider samples from the same regions (Figure 17). In our study, however, the size of the east-west difference should be largely unaffected by such phenomena, because our Finnish samples represent relatively wide geographic areas. In fact, in comparisons of the east-west difference with the genetic distance between Germans and British, it was perhaps the latter that was more likely inflated, due to the limited geographic coverage of the German sample (see above). regional allele frequency

0.10 0.70 0.20

0.70 0.10 0.20

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

Figure 17. Effect of geographic sampling scale on observed population structure. When frequencies of alleles (colored circles) vary locally, samples from small areas (orange rectangles) may differ more from each other than do samples representing larger areas (violet squares).

98 Discussion Apart from the LD calculations, the analyses of this study use single SNPs, and in the case of genome-wide datasets, the results were typically averaged across the genome. This is a valid approach, but it may mask some interesting patterns of variation between genome areas. Furthermore, haplotype-based analyses could detect for instance migrational patterns more clearly than the single-SNP analyses can. Multimarker methods could also allow the timing of population events, whereas here the observations have been connected with population history phenomena only at the level of discussion. Unfortunately, few if any such methods were available at the time of our original analyses, and even now, many of the tools available are designed for use in more diverged populations, for example on a continental scale.

Discussion 99 3. Subpopulation difference scanning (SDS) approach

3.1. the idea

Ine gen mapping, population stratification (i.e., population structure) can be a nuisance because it can produce spurious results in an association analysis. In contrast, other gene-mapping methods may be indifferent to population structure or may even benefit from it. This thesis introduces and theoretically tests a new mapping approach, subpopulation difference scanning (SDS), which directly utilizes population substructure to localize genes for diseases that differ in incidence between otherwise closely related subpopulations. The underlying idea of the SDS approach is that when subpopulations diverge from each other by genetic drift (including founder effects), their allele frequencies will fluctuate, and varying amounts of frequency differ- ence between the subpopulations will accumulate at the loci. A frequency difference gained by a disease-predisposing locus will lead, other things being equal, to differing incidences of the disease in the subpopulations (Figure 18). Conversely, a locus with no frequency difference cannot cause an incidence difference. Thus, loci underlying incidence differences could possibly be located by studying subpopulation frequency differences and by excluding from further analyses any genome areas that do not show a large enough difference between subpopulations.

penetrance

other risks

carriers non-carriers carriers non-carriers Figure 18. Disease incidence in two subpopulations. The subpopulations (large squares) have a different frequency of carriers of a risk variant and therefore different proportions of healthy (white) and diseased (gray) individuals, even though penetrance of the risk variant and background risk for the disease in the subpopulations are equal. Note that the difference in disease incidence between the subpopulations is the carrier frequency difference times the additional risk caused by the variant. Adapted from Salmela et al. J Med Genet 43: 590-597 (2006) with permission.

100 Discussion 3.2. theoretical testing, advantages, and limitations

We assessed the theoretical feasibility of this approach in the Finnish popula- tion, where differences between the eastern and western parts of the country exist both in genetic markers and in cardiovascular disease incidences (see sections 5.3.4 and 5.3.6 of the Literature). The extent of genetic differences was examined from a small set of autosomal STR markers, and population simulations resembling the Finnish population history allowed estimation of the sample sizes and genotyping densities necessary for successful gene localization. The simulations that produced subpopulation frequency dif- ferences similar to those observed suggested that a 90% exclusion of the genome could be achieved at a 20% risk of excluding the true gene with use of a 1-cM genotyping density and samples of 100 to 200 individuals from the subpopulations, provided that the gene would have a frequency difference of at least 0.20. In general, the SDS approach will work best for diseases with large incidence differences – or, to be more precise, for diseases in which a large absolute incidence difference results from a single gene. Furthermore, the smaller the effect that this gene has on disease risk, the easier it will be to locate. This may seem counterintuitive, bearing in mind that for both linkage and association analysis the trend is the opposite. However, it makes sense when remembering that the incidence difference is the product of the risk difference and the gene frequency difference (Figure 18). Thus, for a given incidence difference, a small risk difference will lead to a frequency difference that is large and therefore easier to detect. (How likely it is for an incidence difference to result from an allele with a small rather than a large effect is of course a different question, the answer to which depends on the extent of genetic drift.) Conversely, even large incidence differences caused by single genes can be undetectable by SDS if the incidence difference is due to a large risk instead of a large frequency difference. It is therefore advisable to test the subpopulation frequencies of known genetic risk factors before proceeding to a full-scale SDS effort. The target populations of SDS are limited to those in which the differ- ences between the subpopulations are moderate. In populations with large differences, the gene’s frequency difference would not likely be among the largest, whereas in subtly differing subpopulations, the accurate detection of differences would require large sample sizes. It seems that the east-west differences within Finland fall into the intermediate range. However, the Finnish population does not fully conform to the assumptions of the the- oretical analysis in another respect: the simulations assume that the sub- population differences stem purely from genetic drift, whereas the Finnish population also harbors a small, geographically variable eastern influence. We have not tested how this might affect SDS dynamics.

Discussion 101 Above, we have formulated and tested the SDS approach in the context of two subpopulations, but it could be extended to more than two. That would be a simple way to enhance the exclusion of genome areas, because the further analyses could be limited to loci whose frequency difference pattern is compatible with the incidence pattern in all of the subpopulations. Even then, many loci that are unrelated to the phenotype of interest will differ because of genetic drift alone. The number and percentage of false positives will therefore be considerably larger than in a typical association scan. It is thus essential to confirm the findings with, for example, case-control association analyses of selected markers. Alternatively, the confirmation could rely on existing GWAS data from the same or a closely related pop- ulation. In fact, combining frequency difference information with GWAS data would provide a means of detecting a wider range of causal genes (in terms of their combination of frequency difference and risk) than would the SDS approach alone. SDS looks only at subpopulation allele frequencies and not at the distribu- tion of those alleles among healthy and diseased individuals. It is therefore independent of the decay of LD between the disease gene and its nearby markers after the frequency differences have formed. Most of the differences will arise from the heavy genetic drift during the early generations when LD is still strong, and a relatively sparse marker set will suffice to detect them. Nevertheless, the estimate of necessary genotyping density in the original study (1 cM) may be overly optimistic: the simulations did not account for the fact that not all markers will be equally informative for detecting the genetic drift. However, this factor should be amply compensated for by the current marker densities of genome-wide arrays. The immediate flip side of the strong initial LD is that the SDS resolution will remain limited. Notably, the essential LD in the present population may be much weaker, and there- fore the marker density needed for the follow-up association analyses higher. The original analyses of SDS performance were conducted before the GWAS era, and many of the aspects used to advertise SDS in the original publication have not turned out to be major obstacles in the GWASes. Several of the advantages of SDS nevertheless remain. Population sampling may be easier and cheaper than the laborious phenotyping needed to collect association study cases. Naturally, if GWAS control data with appropriate geographical information exist, they will be of immediate use. Additionally, once such a genotyping has been made, the results will be valid for any phe- notype that shows the same geographic pattern, in contrast to the collections of cases for association studies that are strictly phenotype-specific.

102 Discussion 3.3. Analyses of genome-wide population data

Since the original SDS analyses, we have genotyped a genome-wide set of SNPs in Finns who represent the areas of high and low acute myocardial infarction (AMI) incidence. These data can be compared to some of the original analysis assumptions. Firstly, in the SNP dataset, the high- and low-incidence areas had a slightly smaller allele frequency difference than in the STR data. With a shift in this direction, the performance of SDS should improve. Secondly, the SNP data can shed light on the number of loci that may contribute to the incidence difference. Our functional enrichment analysis suggested that relatively many of the SNPs with a large frequency difference are related to cardiovascular functions. Together, such loci might explain a substantial portion of the incidence difference. This could violate the basic requirement of SDS that a large incidence difference should be caused by a single gene. On the other hand, several SNPs that have been associated with cardio- vascular phenotypes in GWASes resided in genome areas that harbored very large frequency differences. These areas and GWAS SNPs would therefore be among those that an SDS analysis would select for further inspection. Naturally, this evidence of the performance of SDS is indirect, and the fea- sibility of the approach still needs to be demonstrated by other means, for example in a systematic case-control association testing of the most widely differing SNPs.

Discussion 103 Clonc uding remarks and future prospects

In the past few years, genome-wide SNP datasets have firmly established themselves as a suitable and useful material for population genetic analyses. In this thesis, such data served for the first time in a study of population structure within Finland and Sweden. Additionally, the thesis introduced a novel gene mapping approach, subpopulation difference scanning (SDS). The results paint a picture of two northern European populations that have long been characterized by small population sizes and relatively limited contact with neighboring populations. Through genetic drift, these factors have led to distinct genetic features and reduced diversity. Specifically, northern Sweden differed genetically from the rest of Sweden, in congruence with its history of small breeding units on the one hand and an admixture between Swedes, Finns, and Saami on the other. In Finland, a predomi- nantly European gene pool was dotted with signs of an eastern influence, possibly of ancient origin. The genetic difference between the eastern and western parts of Finland was striking, albeit in line with long-standing regional differences in the extent of contact with neighboring populations, as well as with more recent features of the history of the eastern Finnish settlement. These east-west differences could serve in mapping genes for diseases whose incidences also vary within the country, but the feasibility of this SDS approach remains to be proven in practice by testing the most differing loci in an association setting. It can be argued that in populations studied earlier, analyses of genome- wide SNP datasets have rarely revealed population patterns that would have been previously unseen or unexpected. The analyses have, however, yielded information on the strength of such patterns on a genome-wide level, i.e., in the parts of the genome directly relevant to the majority of phenotypes. This study, for example, showed that an east-west difference within Finland, previously known from the Y chromosome but not from mtDNA, exists also in the autosomal markers. This demonstrates that this difference is important to acknowledge in association studies. Indeed, genome-wide datasets can effectively complement the frequently employed mtDNA and Y-chromosomal markers, which may be prone to overstate the importance of some population history events, especially of sex-specific or drift-related phenomena. In the future, our understanding of the genetic history of human popu- lations will undoubtedly increase, not least because of methodological and technical advances. The emergence of new datasets will enable more com-

104 Concluding remarks and future prospects prehensive analyses. In the case of Finland and Sweden, reference data from various populations in northern would be of special interest. New computational methods specifically developed for genome-wide SNP data can enhance their potential for inference and timing of demographic events such as migrations; in our datasets, such methods could obviously refine many of the current interpretations. The increasing use of sequencing will provide an unprecedented source of data also for population history studies, and help to circumvent some of the shortcomings of the SNP arrays, for example, the biases of SNP selection. Meanwhile, with constantly increasing sampling densities, the need for methods that detect and quantify clinal structures within continuous populations instead of measuring differences between predefined population units will become ever more pressing.

Concluding remarks and future prospects 105 Acknowledgements

This study was carried out in the Finnish Genome Center (currently the Technology Center of the Institute for Molecular Medicine Finland (FIMM)), at the Department of Medical Genetics, and in the Molecular Medicine Research Program of the Research Programs Unit in the University of Helsinki. I thank the current and previous directors of these institutes for providing excellent research facilities for this study. My work was funded by the University of Helsinki, Graduate School in Computational Biology, Bioinformatics, and Biometry (ComBi), and a grant from the Sigrid Juselius Foundation through Professor Juha Kere. Other main supporters of the project have been the Academy of Finland, the Emil Aaltonen founda- tion, the Finnish Cultural Foundation and the Swedish Research Council. Detailed information on further sources of funding appears in the respective publications. I thank my supervisors, Juha Kere and Päivi Lahermo, for their trust and support throughout this project. You have formed a perfect combination of inexhaustible optimism and ideas with down-to-earth realism on how things don’t always – or ever – go as planned and how everything tends to take longer than expected (and then some more). If there is anything I would want to criticize, then it would be your incredible patience. With a bit less of it, I might have had to finish on somewhat closer to a standard schedule… I thank my former Professor, Marja-Liisa Savontaus, for introducing me to my supervisors – and to , for that matter. Unfortunately the spatiotemporal overlap of our contributions in the project has been smaller than I would have wished; I would have had a lot to learn from you. I am extremely grateful to my coauthor and closest collaborator Tuuli Lappalainen. We started by doing science on Skype (a practice which approx- imately doubled my typing speed) when our offices were separated by three floors,d an our collaboration grew tighter than our feet could bear. Later we actually shared an office and a countless number of variably scientific but invariably long and fulfilling face-to-face discussions that I still miss. Even after that glass ceiling in our office proved unable to stop you, the former means of communication has been sustained, this time across 2000 kilo- meters, though undoubtedly you would often have had more urgent things to do. Meanwhile, you have taught me a lot about science and how to do it, and it is hard for me to even imagine what would have ever become of this thesis project (let alone when!) without your contribution, enthusiasm, efficiency, determination, ideas, and the incredible number of working hours you invested in our papers.

106 Acknowledgements Of my other coauthors, I thank Ulf Hannelius for adopting “my” micro- satellite dataset and giving it a role in a larger context – and letting me use the result in my thesis. I hope it was a win-win deal. I acknowledge the late Olli Taskinen for creating the fanciest figure of my first-ever paper; his untimely death was a great loss. I thank my coauthors who provided the samples and datasets without which this study would have been impossible, in particular Pertti Sistonen, whose collection of Finnish subjects played a role in all papers of this thesis, and Per Hall on whose Swedish dataset the last paper stands. Naturally, I also thank all the sample-donors in those datasets – population simulations may be fancy, but they will take one only so far. I am grateful to all my coauthors for their contributions, comments and advice, insight, and for providing me views on different ways of doing successful science. I also warmly thank the reviewers of this thesis, Professors Jaakko Ignatius and Marjo-Riitta Järvelin, for their constructive and valuable comments. My heartfelt thanks go to the IT experts at FIMM/GIU for fixing my computer countless numbers of times in countless numbers of ways, to the staff at the Finnish Genome Center, especially Timo Miettinen, Riitta Koskinen, Anne Leinonen, Jouko Siro, Kari Tuomainen, and Kyösti Sutinen, for their help on various matters great and small, to laboratory technicians Sirkku Ekström, Anu Yliperttula, Päivi Hantula, Anitta Hottinen, Ahmed Mohammed, Leinonen, Auli Saarinen, Riitta Känkänen, and Riitta Lehtinen for their skillful assistance in the laboratory, to Anu Puomila for her invaluable help with the Finnish samples, to Ingegerd Fransson and Marika Rönnholm for their superb contribution in the Affymetrix genotyping, and to David Brodin at the Bioinformatics and Expression Analysis facility at Karolinska Institutet for much advice on the wonders of Affymetrix genotype calling and export. I am grateful to Carol Norris for teaching me more than half of what I know about scientific writing (and an unknown portion of what I’ve already forgotten), and to Sami Kilpinen and Henrik Edgren for teaching me pro- gramming in R and Perl which have proved more than useful in this work. I thank Päivi Onkamo, Outi Vesakoski, and Jaakko Häkkinen for pointing out several interesting references to me. I am grateful to the Finnish daycare system and its main incarnations Marja and Leea for providing me the luxury of thinking more than one scientific thought consecutively. To those archaeologists and linguists whose results I may have misun- derstood or ignorantly overlooked, I offer my sincere apologies. Although as a kid I wanted to become an archaeologist, and as a teenager a linguist, studying human genetics has not miraculously turned me into either. (The same apologies of course apply to any omission of results from my fellow geneticists, but there I have many fewer excuses.) Special thanks to those of

Acknowledgements 107 you who have, throughout the years, taken the trouble of correcting some of my misconceptions. I thank the former and current members of Juha’s research group, includ- ing but not limited to Outi, Sini, Dario, Emma, Lena, Kata, Tiina, Nina, Riitta K., Riitta L., , Eira, Satu M., Satu W., Mari M., Mari T., Ville, Anna, Auli, Päivi A., Päivi S., Annika, Lilli, Inkeri, Johanna, and Yan. In the variable combinations of office mates and lunch company, you have formed my primary peer group and scientific social circle, supporting me through the oddities of (academic) life. Our projects rarely overlapped in mission or methodology, but the ups and downs of doing and publishing science are largely in common, and you have always been there to listen to me (which, as we all know, takes a lot of time). The same applies to the folks at the Finnish Genome Center, especially Anna and Riitta with whom I shared an office for many years, as well as Anna, Päivi T. and Timo – I think we had quite an effective and comforting “valituspiiri,” until too many of us got our theses ready and no longer had anything to complain of… Additionally, I thank Päivi Saavalainen and her celiac-disease group for letting me sit in their office last winter; even more vital to my mood and motivation than the seat by a window to the outside world was the company of Beta, Amarjit, and Dawit. Special thanks to Beta for sharing your insight into the inner workings of Haploview etc., and into various other issues related and unrelated to science – and for occasionally making my day by speaking a word or two of Icelandic. So, thanks, everyone – science wouldn’t be half as much fun without you guys! I thank my friends who have lent me their ears (and occasionally also shoulders) in the smaller and larger storms that life includes, academically and in general. Mysteriously, some of you are still here despite the non-re- turned calls, unreplied-to emails, and my general negligence in social obli- gations during the years and especially this last spring. So thank you Riikka, Johku, Hannu, Silvie, Yelena, Madhu, Jeremy, HJ, Outi, Inari, Biijá. And a separate thanks commonly to all my dancing friends for the dizzying polskas, breathtaking sottiisis and a few very divine hambos of these years – and for the combination of good company and bad jokes, as the definition has it. I thank my parents for their love and support, and for the curiosity they gave me – I don’t know whether its transmission is genetic, epigenetic, or environmental, but I do know it comes pretty handy in science. I thank my son Otso for generously letting something as mundane as thesis writing rob him of so many effective hours that could instead have been devoted to building Legos and playing Kimble, and for providing such a wonderful counterbalance to science by exactly such activities. Thesis or no thesis, you are by far my most important achievement in life, and the amount of joy you have brought me is immense. Last but not least, I thank my husband Jouni. It is not a small number of computer problems you have fixed, short programs you have written, and technical tips you have given for my work throughout

108 Acknowledgements these years. More importantly, when I stayed at work until 6, 8, or 10 PM (in February, March, and April, respectively), delving into the secrets of Uralic linguistics or R plotting, it was you who loaded the dishwasher, filled the fridge, and watched the kid – for the main reward of a few minutes in the company of a thoroughly stressed, sleep-deprived, and grumpy me. I know it takes a very special person to tolerate that – and that’s who you are. Eihän tästä kaikesta olisi tullut mitään ilman sinua.

In summary: I thank you all! :-)

Helsinki, August 2012

Acknowledgements 109 References

1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73.

Aalto-Setälä K, Viikari J, Åkerblom HK, Kuusela V, Kontula K. DNA polymorphisms of the apolipoprotein B and A-I/C-III genes are associated with variations of serum low density lipoprotein cholesterol level in childhood. J Lipid Res. 1991 Sep;32(9):1477-87.

Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009 Sep;19(9):1655-64.

Amaral PP, Dinger ME, Mercer TR, Mattick JS. The eukaryotic genome as an RNA machine. Science. 2008 Mar 28;319(5871):1787-9.

Analitis A, Katsouyanni K, Biggeri A, Baccini M, Forsberg B, Bisanti L, Kirchmayer U, Ballester F, Cadum E, Goodman PG, Hojs A, Sunyer J, Tiittanen P, Michelozzi P. Effects of cold weather on mortality: results from 15 European cities within the PHEWE project. Am J Epidemiol. 2008 Dec 15;168(12):1397-408.

Anderson S, Bankier AT, Barrell BG, de Bruijn MH, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe BA, Sanger F, Schreier PH, Smith AJ, Staden R, Young IG. Sequence and organi- zation of the human mitochondrial genome. Nature. 1981 Apr 9;290(5806):457-65.

Arking DE, Junttila MJ, Goyette P, Huertas-Vazquez A, Eijgelsheim M, Blom MT, Newton-Cheh C, Reinier K, Teodorescu C, Uy-Evanado A, Carter-Monroe N, Kaikkonen KS, Kortelainen ML, Boucher G, Lagacé C, Moes A, Zhao X, Kolodgie F, Rivadeneira F, Hofman A, Witteman JC, Uitterlinden AG, Marsman RF, Pazoki R, Bardai A, Koster RW, Dehghan A, Hwang SJ, Bhatnagar P, Post W, Hilton G, Prineas RJ, Li M, Köttgen A, Ehret G, Boerwinkle E, Coresh J, Kao WH, Psaty BM, Tomaselli GF, Sotoodehnia N, Siscovick DS, Burke GL, Marbán E, Spooner PM, Cupples LA, Jui J, Gunson K, Kesäniemi YA, Wilde AA, Tardif JC, O’Donnell CJ, Bezzina CR, Virmani R, Stricker BH, Tan HL, Albert CM, Chakravarti A, Rioux JD, Huikuri HV, Chugh SS. Identification of a sudden cardiac death susceptibility locus at 2q24.2 through genome-wide association in European ancestry individuals. PLoS Genet. 2011 Jun;7(6):e1002158.

Aulchenko YS, Ripke S, Isaacs A, van Duijn CM. GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007 May 15;23(10):1294-6.

Aulchenko YS, Struchalin MV, Belonogova NM, Axenovich TI, Weedon MN, Hofman A, Uitterlinden AG, Kayser M, Oostra BA, van Duijn CM, Janssens AC, Borodin PM. Predicting human height by Victorian and genomic methods. Eur J Hum Genet. 2009 Aug;17(8):1070-5.

Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LB. Human popu- lation genetic structure and inference of group membership. Am J Hum Genet. 2003 Mar;72(3):578-89.

Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011 Sep 27;12(11):745-55.

Barbujani G, Sokal RR. Zones of sharp genetic change in Europe are also linguistic boundaries. Proc Natl Acad Sci U S A. 1990 Mar;87(5):1816-9.

110 References Barbujani G, Magagni A, Minch E, Cavalli-Sforza LL. An apportionment of human DNA diversity. Proc Natl Acad Sci U S A. 1997 Apr 29;94(9):4516-9.

Barnes M. Languages and ethnic groups. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003; 94-102.

Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005 Jan 15;21(2):263-5.

Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka R, Bradley DG,r Shrive MD. Measuring European population stratification with microarray genotype data. Am J Hum Genet. 2007 May;80(5):948-56.

Baudou E. Forntiden fram till 1000-talet. In: Larsson HA (ed). Boken om Sveriges historia. Forum, Stockholm. 1999; 9-46.

Beckman L, Sikström C, Mikelsaar AV, Krumina A, Ambrasiene D, Kucinskas V, Beckman G. Transferrin variants as markers of migrations and admixture between populations in the Baltic Sea region. Hum Hered. 1998 Jul-Aug;48(4):185-91.

Benedictow OJ. Demographic conditions. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003; 237-49.

Bergman G. Kortfattad svensk språkhistoria. Norstedt, Stockholm. 2003; 255 p.

Bittles AH, Egerbladh I. The influence of past endogamy and consanguinity on genetic disorders in northern Sweden. Ann Hum Genet. 2005 Sep;69(Pt 5):549-58.

Burstedt MS, Sandgren O, Holmgren G, Forsman-Semb K. Bothnia dystrophy caused by mutations in the cellular retinaldehyde-binding protein gene (RLBP1) on chromosome 15q26. Invest Ophthalmol Vis Sci. 1999 Apr;40(5):995-1000.

Butler JM. Genetics and genomics of core short tandem repeat loci used in human identity testing. J Forensic Sci. 2006 Mar;51(2):253-65.

C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998 Dec 11;282(5396):2012-8.

Carpelan C. Käännekohtia Suomen esihistoriassa aikavälillä 5100...1000 eKr. In: Fogelberg P (ed). Pohjan poluilla: suomalaisten juuret nykytutkimuksen mukaan. Bidrag till kännedom av natur och folk 153. Finska Vetenskaps-Societeten, Helsinki. 1999; 249-80.

Cavalli-Sforza LL, Menozzi P, Piazza A. The history and geography of human genes. Princeton University Press, Princeton. 1994; 518 p.

Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33.

Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 2005 Nov;15(11):1496-502.

Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam AC, Ovington NR, Stevens HE, Nutland S, Howson JM, Faham M, Moorhead M, Jones HB, Falkowski M, Hardenbol P, Willis TD, Todd JA. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet. 2005 Nov;37(11):1243-6.

References 111 Colbourne JK, Pfrender ME, Gilbert D, Thomas WK, Tucker A, Oakley TH, Tokishita S, Aerts A, Arnold GJ, Basu MK, Bauer DJ, Cáceres CE, Carmel L, Casola C, Choi JH, Detter JC, Dong Q, Dusheyko S, Eads BD, Fröhlich T, Geiler-Samerotte KA, Gerlach D, Hatcher P, Jogdeo S, Krijgsveld J, Kriventseva EV, Kültz D, Laforsch C, Lindquist E, Lopez J, Manak JR, Muller J, Pangilinan J, Patwardhan RP, Pitluck S, Pritham EJ, Rechtsteiner A, Rho M, Rogozin IB, Sakarya O, Salamov A, Schaack S, Shapiro H, Shiga Y, Skalitzky C, Smith Z, Souvorov A, Sung W, Tang Z, Tsuchiya D, Tu H, Vos H, Wang M, YI, Yamagata H, Yamada T, Ye Y, Shaw JR, Andrews J, Crease TJ, Tang H, Lucas SM, Robertson HM, Bork P, Koonin EV, Zdobnov EM, Grigoriev IV, Lynch M, Boore JL. The ecoresponsive genome of Daphnia pulex. Science. 2011 Feb 4;331(6017):555-61.

Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet. 2006 Nov;38(11):1251-60.

Coop G, Wen X, Ober C, Pritchard JK, Przeworski M. High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans. Science. 2008 Mar 7;319(5868):1395-8.

Cox TF, Cox MAA. Multidimensional scaling. Chapman and Hall/CRC, Boca Raton. 2001; 308 p.

Dahlbäck G. The towns. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003; 611-34.

Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES. High-resolution haplotype structure in the human genome. Nat Genet. 2001 Oct;29(2):229-32. de Koning AP, Gu W, Castoe TA, Batzer MA, Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011 Dec;7(12):e1002384.

De La Vega FM, Isaac H, Collins A, Scafe CR, Halldórsson BV, Su X, Lippert RA, Wang Y, Laig- Webster M, Koehler RT, Ziegle JS, Wogan LT, Stevens JF, Leinen KM, Olson SJ, Guegler KJ, You X, Xu LH, Hemken HG, Kalush F, Itakura M, Zheng Y, de Thé G, O’Brien SJ, Clark AG, Istrail S, Hunkapiller MW, Spier EG, Gilbert DA. The linkage disequilibrium maps of three human chromosomes across four populations reflect their demographic history and a common underlying recombination pattern. Genome Res. 2005 Apr;15(4):454-62.

Dermitzakis ET, Reymond A, Antonarakis SE. Conserved non-genic sequences - an unexpected feature of mammalian genomes. Nat Rev Genet. 2005 Feb;6(2):151-7.

Dray S, Dufour AB. The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software. 2007;22:1–20.

Eberle MA, Rieder MJ, Kruglyak L, Nickerson DA. Allele frequency matching between SNPs reveals an excess of linkage disequilibrium in genic regions of the human genome. PLoS Genet. 2006 Sep 8;2(9):e142.

Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010 Jun;11(6):446-50.

Einarsdóttir K, Humphreys K, Bonnard C, J, Iles MM, Sjölander A, Li Y, Chia KS, Liu ET, Hall P, Liu J, Wedrén S. Linkage disequilibrium mapping of CHEK2: common variation and breast cancer risk. PLoS Med. 2006 Jun;3(6):e168.

112 References Einarsdottir E, Egerbladh I, Beckman L, Holmberg D, Escher SA. The genetic population structure of northern Sweden and its implications for mapping genetic diseases. Hereditas. 2007 Nov;144(5):171-80.

ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SC, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA; NISC Comparative Sequencing Program; Baylor College of Medicine Human Genome Sequencing Center; Washington University Genome Sequencing Center; Broad Institute; Children’s Hospital Oakland Research Institute, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CW, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JN, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PI, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrímsdóttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VV, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B, de Jong PJ. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007 Jun 14;447(7146):799-816.

References 113 Erdmann J, Grosshennig A, Braund PS, König IR, Hengstenberg C, Hall AS, Linsel-Nitschke P, Kathiresan S, Wright B, Trégouët DA, Cambien F, Bruse P, Aherrahrou Z, Wagner AK, Stark K, Schwartz SM, Salomaa V, Elosua R, Melander O, Voight BF, O’Donnell CJ, Peltonen L, Siscovick DS, Altshuler D, Merlini PA, Peyvandi F, Bernardinelli L, Ardissino D, Schillert A, Blankenberg S, Zeller T, Wild P, Schwarz DF, Tiret L, Perret C, Schreiber S, El Mokhtari NE, Schäfer A, März W, Renner W, Bugert P, Klüter H, Schrezenmeir J, Rubin D, Ball SG, Balmforth AJ, Wichmann HE, Meitinger T, Fischer M, Meisinger C, Baumert J, Peters A, Ouwehand WH; Italian Atherosclerosis, Thrombosis, and Vascular Biology Working Group; Myocardial Infarction Genetics Consortium; Wellcome Trust Case Control Consortium; Cardiogenics Consortium, Deloukas P, Thompson JR, Ziegler A, Samani NJ, Schunkert H. New susceptibility locus for coronary artery disease on chromosome 3q22.3. Nat Genet. 2009 Mar;41(3):280-2.

Evanno G, Regnaut S, Goudet J. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol. 2005 Jul;14(8):2611-20.

Excoffier L, Laval G, Schneider S. Arlequin (version 3.0): an integrated software package for population genetics data analysis. Evol Bioinform Online. 2007 Feb 23;1:47-50.

Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus geno- type data: linked loci and correlated allele frequencies. Genetics. 2003 Aug;164(4):1567-87.

Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat Rev Genet. 2006 Feb;7(2):85-97.

Finnilä S, Lehtonen MS, Majamaa K. Phylogenetic network for European mtDNA. Am J Hum Genet. 2001 Jun;68(6):1475-84.

François O, Currat M, Ray N, Han E, Excoffier L, Novembre J. Principal component analysis under population genetic models of range expansion and admixture. Mol Biol Evol. 2010 Jun;27(6):1257-68.

Gattepaille LM, Jakobsson M. Combining markers into haplotypes can improve population structure inference. Genetics. 2012 Jan;190(1):159-74.

Gilissen C, Hoischen A, Brunner HG, Veltman JA. Disease gene identification strategies for exome sequencing. Eur J Hum Genet. 2012 May;20(5):490-7.

Goudet J. Hierfstat, a package for R to compute and test hierarchical F-statistics. Mol Ecol Notes. 2005;5:184-6.

Guglielmino CR, Piazza A, Menozzi P, Cavalli-Sforza LL. Uralic genes in Europe. Am J Phys Anthropol. 1990 Sep;83(1):57-68.

Guillot G, Estoup A, Mortier F, Cosson JF. A spatial statistical model for landscape genetics. Genetics. 2005 Jul;170(3):1261-80.

Guillot G, Santos F, Estoup A. Analysing georeferenced population genetics data with Geneland: a new algorithm to deal with null alleles and a friendly graphical user interface. Bioinformatics. 2008 Jun 1;24(11):1406-7.

Guillot G, Foll M. Correcting for ascertainment bias in the inference of population structure. Bioinformatics. 2009 Feb 15;25(4):552-4.

Guldberg P, Henriksen KF, Sipilä I, Güttler F, de la Chapelle A. Phenylketonuria in a low incidence population: molecular characterisation of mutations in Finland. J Med Genet. 1995 Dec;32(12):976-8.

114 References Gustavson KH, Holmgren G. Ärftliga sjukdomar som är vanliga i Sverige: “det svenska sjuk- domsarvet”. Finska Läkaresällskapets Handlingar. 1991;151:194-8.

Gyllerup S, Lanke J, LH, Schersten B. Cold climate is an important factor in explaining regional differences in coronary mortality even if serum cholesterol and other established risk factors are taken into account. Scott Med J. 1993 Dec;38(6):169-72.

Gyllerup S. Cold climate and coronary mortality in Sweden. Int J Circumpolar Health. 2000 Oct;59(3-4):160-3.

Haataja L. Jälleenrakentava Suomi. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, . 2003; 728-831.

Häger BÅ. Frihetstiden och den gustavianska tiden 1718-1809. In: Larsson HA (ed). Boken om Sveriges historia. Forum, Stockholm. 1999; 181-218.

Häkkinen K. Mistä sanat tulevat: suomalaista etymologiaa. Tietolipas 117. Suomalaisen Kirjallisuuden Seura, Helsinki. 1990; 326 p.

Häkkinen K. Suomalaisten esihistoria kielitieteen valossa. Tietolipas 147. Suomalaisen Kirjallisuuden Seura, Helsinki. 1996; 224 p.

Häkkinen J. Kantauralin ajoitus ja paikannus: perustelut puntarissa. Journal de la Société Finno-Ougrienne. 2009;92:9-56.

Hamilton MB. Population Genetics. Wiley-Blackwell, Chichester. 2009; 407 p.

Hammar N, Ahlbom A, Theorell T. Geographical differences in myocardial infarction incidence in eight Swedish counties, 1976-1981. Epidemiology. 1992 Jul;3(4):348-55.

Hammar N, Andersson T, Reuterwall C, Nilsson T, Knutsson A, Hallqvist J, Ahlbom A. Geographical differences in the incidence of acute myocardial infarction in Sweden. Analyses of possible causes using two parallel case-control studies. J Intern Med. 2001 Feb;249(2):137-44.

Hannelius U, Lindgren CM, Melén E, Malmberg A, von Dobeln U, Kere J. Phenylketonuria screening registry as a resource for population genetic studies. J Med Genet. 2005 Oct;42(10):e60.

Hannelius U, Gherman L, Mäkelä VV, Lindstedt A, Zucchelli M, Lagerberg C, Tybring G, Kere J, Lindgren CM. Large-scale zygosity testing using single nucleotide polymorphisms. Twin Res Hum Genet. 2007 Aug;10(4):604-25.

Hartl DL, Clark AG. Principles of Population Genetics. 4th ed. Sinauer Associates, Sunderland. 2007; 652 p.

Havulinna AS, Pääkkönen R, Karvonen M, Salomaa V. Geographic patterns of incidence of ischemic stroke and acute myocardial infarction in Finland during 1991-2003. Ann Epidemiol. 2008 Mar;18(3):206-13.

Heath SC, Gut IG, Brennan P, McKay JD, Bencko V, Fabianova E, Foretova L, Georges M, Janout V, Kabesch M, Krokan HE, Elvestad MB, Lissowska J, Mates D, Rudnai P, Skorpen F, Schreiber S, Soria JM, Syvänen AC, Meneton P, Herçberg S, Galan P, Szeszenia-Dabrowska N, Zaridze D, Génin E, Cardon LR, Lathrop M. Investigation of the fine structure of European populations with applications to disease association studies. Eur J Hum Genet. 2008 Dec;16(12):1413-29.

References 115 Hedlund E, Kaprio J, Lange A, Koskenvuo M, Jartti L, Rönnemaa T, Hammar N. Migration and coronary heart disease: A study of Finnish twins living in Sweden and their co-twins residing in Finland. Scand J Public Health. 2007;35(5):468-74.

Hedman M, Pimenoff V, Lukka M, Sistonen P, Sajantila A. Analysis of 16 Y STR loci in the Finnish population reveals a local reduction in the diversity of male lineages. Forensic Sci Int. 2004 May 28;142(1):37-43.

Hedman M, Brandstätter A, Pimenoff V, Sistonen P, Palo JU, Parson W, Sajantila A. Finnish mitochondrial DNA HVS-I and HVS-II population data. Forensic Sci Int. 2007 Oct 25;172(2-3):171-8.

Hedrick PW. Genetics of populations. 3rd ed. Jones and Bartlett Publishers, Sudbury. 2005; 737 p.

Heikinheimo H, Fortelius M, Eronen J, Mannila H. Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters. J Biogeogr. 2007;34(6):1053-64.

Hewitt G. The genetic legacy of the ice ages. Nature. 2000 Jun 22;405(6789):907-13.

Hillier LW, Coulson A, Murray JI, Bao Z, Sulston JE, Waterston RH. Genomics in C. elegans: so many genes, such a little worm. Genome Res. 2005 Dec;15(12):1651-60.

Hofreiter M, Serre D, Poinar HN, Kuch M, Pääbo S. Ancient DNA. Nat Rev Genet. 2001 May;2(5):353-9.

Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. Design and analysis of admixture mapping studies. Am J Hum Genet. 2004 May;74(5):965-78.

Hoggart CJ, O’Reilly PF, Kaakinen M, Zhang W, Chambers JC, Kooner JS, Coin LJ, Jarvelin MR. Fine-scale estimation of location of birth from genome-wide single-nucleotide poly- morphism data. Genetics. 2012 Feb;190(2):669-77.

Holmlund G, Nilsson H, Lango-Warensjo A, Rosén B, Lindblom B. Y-chromosome STR haplotypes in a Swedish population. In: Brinkmann B, Carracedo A (eds). Progress in Forensic Genetics 9. International Congress Series 1239. Elsevier, Amsterdam. 2003; 343–8.

Holmlund G, Nilsson H, Karlsson A, Lindblom B. Y-chromosome STR haplotypes in Sweden. Forensic Sci Int. 2006 Jun 27;160(1):66-79.

Honkola T, Vesakoski O, Korhonen K, Lehtinen J, Syrjänen K, Wahlberg N. Cultural and climatic changes shape the evolutionary history of the . Unpublished.

Hovatta I, Terwilliger JD, Lichtermann D, Mäkikyrö T, Suvisaari J, Peltonen L, Lönnqvist J. Schizophrenia in the genetic isolate of Finland. Am J Med Genet. 1997 Jul 25;74(4):353-60.

Humphreys K, Grankvist A, Leu M, Hall P, Liu J, Ripatti S, Rehnström K, Groop L, Klareskog L, Ding B, Grönberg H, Xu J, Pedersen NL, Lichtenstein P, Mattingsdal M, Andreassen OA, O’Dushlaine C, Purcell SM, Sklar P, Sullivan PF, Hultman CM, Palmgren J, Magnusson PK. The genetic structure of the Swedish population. PLoS One. 2011;6(8):e22547.

Huurre M. 9000 vuotta Suomen esihistoriaa. 9th ed. Otava, Helsinki. 2005; 272 p.

Huyghe JR, Fransen E, Hannula S, Van Laer L, Van Eyken E, Mäki-Torkko E, Lysholm- Bernacchi A, Aikio P, Stephan DA, Sorri M, Huentelman MJ, Van Camp G. Genome-wide SNP analysis reveals no gain in power for association studies of common variants in the Finnish Saami. Eur J Hum Genet. 2010 May;18(5):569-74.

116 References Huyghe JR, Fransen E, Hannula S, Van Laer L, Van Eyken E, Mäki-Torkko E, Aikio P, Sorri M, Huentelman MJ, Van Camp G. A genome-wide analysis of population structure in the Finnish Saami with implications for genetic association studies. Eur J Hum Genet. 2011 Mar;19(3):347-52.

Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Detection of large-scale variation in the human genome. Nat Genet. 2004 Sep;36(9):949-51.

Ikäheimo I, Silvennoinen-Kassinen S, Tiilikainen A. HLA five-locus haplotypes in Finns. Eur J Immunogenet. 1996 Aug;23(4):321-8.

Ingman M, Gyllensten U. A recent genetic link between Sami and the Volga-Ural region of Russia. Eur J Hum Genet. 2007 Jan;15(1):115-20.

International HapMap Consortium. The International HapMap Project. Nature. 2003 Dec 18;426(6968):789-96.

International HapMap Consortium. A haplotype map of the human genome. Nature. 2005 Oct 27;437(7063):1299-320.

International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, Zhao H, Zhou J, Gabriel SB, Barry R, Blumenstiel B, Camargo A, Defelice M, Faggart M, Goyette M, Gupta S, Moore J, Nguyen H, Onofrio RC, Parkin M, Roy J, Stahl E, Winchester E, Ziaugra L, Altshuler D, Shen Y, Yao Z, Huang W, Chu X, He Y, Jin L, Liu Y, Shen Y, Sun W, Wang H, Wang Y, Wang Y, Xiong X, Xu L, Waye MM, Tsui SK, Xue H, Wong JT, Galver LM, Fan JB, Gunderson K, Murray SS, Oliphant AR, Chee MS, Montpetit A, Chagnon F, Ferretti V, Leboeuf M, Olivier JF, Phillips MS, Roumy S, Sallée C, Verner A, Hudson TJ, Kwok PY, Cai D, Koboldt DC, Miller RD, Pawlikowska L, Taillon-Miller P, Xiao M, Tsui LC, Mak W, Song YQ, Tam PK, Nakamura Y, Kawaguchi T, Kitamoto T, Morizono T, Nagashima A, Ohnishi Y, Sekine A, Tanaka T, Tsunoda T, Deloukas P, Bird CP, Delgado M, Dermitzakis ET, Gwilliam R, Hunt S, Morrison J, Powell D, Stranger BE, Whittaker P, Bentley DR, Daly MJ, de Bakker PI, Barrett J, Chretien YR, Maller J, McCarroll S, Patterson N, Pe’er I, Price A, Purcell S, Richter DJ, Sabeti P, Saxena R, Schaffner SF, Sham PC, Varilly P, Altshuler D, Stein LD, Krishnan L, Smith AV, Tello-Ruiz MK, Thorisson GA, Chakravarti A, Chen PE, Cutler DJ, Kashuk CS, Lin S, Abecasis GR, Guan W, Li Y, Munro HM, Qin ZS, Thomas DJ, McVean G, Auton A, Bottolo L, Cardin N, Eyheramendy S, Freeman C, Marchini J, Myers S, Spencer C, Stephens M, Donnelly P, Cardon LR, Clarke G, Evans DM, Morris AP, Weir BS, Tsunoda T, Mullikin JC, Sherry ST, Feolo M, Skol A, Zhang H, Zeng C, Zhao H, Matsuda I, Fukushima Y, Macer DR, Suda E, Rotimi CN, Adebamowo CA, Ajayi I, Aniagwu T, Marshall PA, Nkwodimmah C, Royal CD, Leppert MF, Dixon M, Peiffer A, Qiu R, Kent A, Kato K, Niikawa N, Adewole IF, Knoppers BM, Foster MW, Clayton EW, Watkin J, Gibbs RA, Belmont JW, Muzny D, Nazareth L, Sodergren E, Weinstock GM, Wheeler DA, Yakub I, Gabriel SB, Onofrio RC, Richter DJ, Ziaugra L, Birren BW, Daly MJ, Altshuler D, Wilson RK, Fulton LL, Rogers J, Burton J, Carter NP, Clee CM, Griffiths M, Jones MC, McLay K, Plumb RW, Ross MT, Sims SK, Willey DL, Chen Z, Han H, Kang L, Godbout M, Wallenburg JC, L’Archevêque P, Bellemare G, Saeki K, Wang H, An D, Fu H, Li Q, Wang Z, Wang R, Holden AL, Brooks LD, McEwen JE, Guyer MS, Wang VO, Peterson JL, Shi M, Spiegel J, Sung LM, Zacharia LF, Collins FS, Kennedy K, Jamieson R, Stewart J. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007 Oct 18;449(7164):851-61.

References 117 International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Peltonen L, Dermitzakis E, Bonnen PE, Altshuler DM, Gibbs RA, de Bakker PI, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, Wheeler D, Gibbs RA, Muzny DM, Barnes C, Darvishi K, Hurles M, Korn JM, Kristiansson K, Lee C, McCarrol SA, Nemesh J, Dermitzakis E, Keinan A, Montgomery SB, Pollack S, Price AL, Soranzo N, Bonnen PE, Gibbs RA, Gonzaga-Jauregui C, Keinan A, Price AL, Yu F, Anttila V, Brodeur W, Daly MJ, Leslie S, McVean G, Moutsianas L, Nguyen H, Schaffner SF, Zhang Q, Ghori MJ, McGinnis R, McLaren W, Pollack S, Price AL, Schaffner SF, Takeuchi F, Grossman SR, Shlyakhter I, Hostetter EB, Sabeti PC, Adebamowo CA, Foster MW, Gordon DR, Licinio J, Manca MC, Marshall PA, Matsuda I, Ngare D, Wang VO, Reddy D, Rotimi CN, Royal CD, Sharp RR, Zeng C, Brooks LD, McEwen JE. Integrating common and rare genetic variation in diverse human populations. Nature. 2010 Sep 2;467(7311):52-8.

International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004 Oct 21;431(7011):931-45.

Jakkula E, Rehnström K, Varilo T, Pietiläinen OP, Paunio T, Pedersen NL, deFaire U, Järvelin MR, Saharinen J, Freimer N, Ripatti S, Purcell S, Collins A, Daly MJ, Palotie A, Peltonen L. The genome-wide patterns of variation expose significant substructure in a founder population. Am J Hum Genet. 2008 Dec;83(6):787-94.

Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R, Bras JM, Schymick JC, Hernandez DG, Traynor BJ, Simon-Sanchez J, Matarin M, Britton A, van de Leemput J, Rafferty I, Bucan M, Cann HM, Hardy JA, Rosenberg NA, Singleton AB. Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008 Feb 21;451(7181):998-1003.

Jartti L, Rönnemaa T, Kaprio J, Järvisalo MJ, Toikka JO, Marniemi J, Hammar N, Alfredsson L, Saraste M, Hartiala J, Koskenvuo M, Raitakari OT. Population-based twin study of the effects of migration from Finland to Sweden on endothelial function and intima-media thickness. Arterioscler Thromb Vasc Biol. 2002a May 1;22(5):832-7.

Jartti L, Raitakari OT, Kaprio J, Järvisalo MJ, Toikka JO, Marniemi J, Hammar N, Luotolahti M, Koskenvuo M, Rönnemaa T. Increased carotid intima-media thickness in men born in east Finland: a twin study of the effects of birthplace and migration to Sweden on subclinical atherosclerosis. Ann Med. 2002b;34(3):162-70.

Jartti L, Rönnemaa T, Raitakari OT, Hedlund E, Hammar N, Lassila R, Marniemi J, Koskenvuo M, Kaprio J. Migration at early age from a high to a lower coronary heart disease risk country lowers the risk of subclinical atherosclerosis in middle-aged men. J Intern Med. 2009 Mar;265(3):345-58.

Jobling MA, Hurles M, Tyler-Smith C. Human Evolutionary Genetics: Origins, peoples & disease. Garland Publishing, Abingdon. 2004; 523 p.

Johansson S, Nordin P, Wilhelmsen L, Tibblin G, Johansson BW, Hansen O, Ahlmark G, Jacobsson S, Michaeli E, Gillnäs T. Geographical variation and time trends in the attack rate of coronary heart disease in five Swedish cities. J Intern Med. 1992 May;231(5):511-20.

Johansson A, Vavruch-Nilsson V, Edin-Liljegren A, Sjölander P, Gyllensten U. Linkage dis- equilibrium between microsatellite markers in the Swedish Sami relative to a worldwide selection of populations. Hum Genet. 2005 Jan;116(1-2):105-13.

118 References Johansson A, Ingman M, Mack SJ, Erlich H, Gyllensten U. Genetic origin of the Swedish Sami inferred from HLA class I and class II allele frequencies. Eur J Hum Genet. 2008 Nov;16(11):1341-9.

Johnson NA, Coram MA, Shriver MD, Romieu I, Barsh GS, London SJ, Tang H. Ancestral com- ponents of admixed genomes in a Mexican cohort. PLoS Genet. 2011 Dec;7(12):e1002410.

Jorde LB, Watkins WS, Bamshad MJ. Population genomics: a bridge from evolutionary history to genetic medicine. Hum Mol Genet. 2001 Oct 1;10(20):2199-207.

Jousilahti P, Vartiainen E, Tuomilehto J, Pekkanen J, Puska P. Role of known risk factors in explaining the difference in the risk of coronary heart disease between eastern and southwestern Finland. Ann Med. 1998 Oct;30(5):481-7.

Juonala M, Viikari JS, Kähönen M, Taittonen L, Rönnemaa T, Laitinen T, Mäki-Torkko N, Mikkilä V, Räsänen L, Akerblom HK, Pesonen E, Raitakari OT. Geographic origin as a determinant of carotid artery intima-media thickness and brachial artery flow-mediated dilation: the Cardiovascular Risk in Young Finns study. Arterioscler Thromb Vasc Biol. 2005 Feb;25(2):392-8.

Juonala M, Viikari JS, Raitakari OT. Miksi sepelvaltimotauti on edelleen enemmän itä- kuin länsisuomalaisten vaiva? Duodecim. 2006;122(1):55-61.

Juvonen V, Kulmala SM, Ignatius J, Penttinen M, Savontaus ML. Dissecting the epidemi- ology of a trinucleotide repeat disease - example of FRDA in Finland. Hum Genet. 2002 Jan;110(1):36-40.

Kaessmann H, Zöllner S, Gustafsson AC, Wiebe V, Laan M, Lundeberg J, Uhlén M, Pääbo S. Extensive linkage disequilibrium in small human populations in Eurasia. Am J Hum Genet. 2002 Mar;70(3):673-85.

Kaipio J, Näyhä S, Valtonen V. Fluoride in the drinking water and the geographical variation of coronary heart disease in Finland. Eur J Cardiovasc Prev Rehabil. 2004 Feb;11(1):56-62.

Kallio P. Suomen kantakielten absoluuttista kronologiaa. Virittäjä. 2006;110(1):2-25.

Karlsson AO, Wallerström T, Götherström A, Holmlund G. Y-chromosome diversity in Sweden - a long-time perspective. Eur J Hum Genet. 2006 Aug;14(8):963-70.

Karvonen M, Moltchanova E, Viik-Kajander M, Moltchanov V, Rytkönen M, Kousa A, Tuomilehto J. Regional inequality in the risk of acute myocardial infarction in Finland: a case study of 35- to 74-year-old men. Heart Drug 2002;2:51-60.

Keränen J. Kustaa ja uskonpuhdistuksen aika. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003; 138-91.

Kere, J Norio R, Savilahti E, Estivill X, de la Chapelle A. Cystic fibrosis in Finland: a molecular and genealogical study. Hum Genet. 1989 Aug;83(1):20-5.

Kere J, Estivill X, Chillón M, Morral N, Nunes V, Norio R, Savilahti E, de la Chapelle A. Cystic fibrosis in a low-incidence population: two major mutations in Finland. Hum Genet. 1994 Feb;93(2):162-6.

Kere J. Human population genetics: lessons from Finland. Annu Rev Genomics Hum Genet. 2001;2:103-28.

Kidd KK, Pakstis AJ, Speed WC, Kidd JR. Understanding human DNA sequence variation. J Hered. 2004 Sep-Oct;95(5):406-20.

References 119 Kirchhoff M, Davidsen M, Brønnum-Hansen H, Hansen B, Schnack H, Eriksen LS, Madsen M, Schroll M. Incidence of myocardial infarction in the Danish MONICA population 1982- 1991. Int J Epidemiol. 1999 Apr;28(2):211-8.

Kittles RA, Perola M, Peltonen L, Bergen AW, Aragon RA, Virkkunen M, Linnoila M, Goldman D, Long JC. Dual origins of Finns revealed by Y chromosome haplotype variation. Am J Hum Genet. 1998 May;62(5):1171-9.

Kittles RA, Bergen AW, Urbanek M, Virkkunen M, Linnoila M, Goldman D, Long JC. Autosomal, mitochondrial, and Y chromosome DNA variation in Finland: evidence for a male-specific bottleneck. Am J Phys Anthropol. 1999 Apr;108(4):381-99.

Knopp T, Merilä J. The postglacial recolonization of Northern Europe by Rana arvalis as revealed by microsatellite and mitochondrial DNA analyses. Heredity (Edinb). 2009 Feb;102(2):174-81.

Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, Shlien A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K. A high-resolution recombination map of the human genome. Nat Genet. 2002 Jul;31(3):241-7.

Kousa A, Moltchanova E, Viik-Kajander M, Rytkönen M, Tuomilehto J, Tarvainen T, Karvonen M. Geochemistry of ground water and the incidence of acute myocardial infarction in Finland. J Epidemiol Community Health. 2004 Feb;58(2):136-9.

Kousa A, Havulinna AS, Moltchanova E, Taskinen O, Nikkarinen M, Eriksson J, Karvonen M. Calcium:magnesium ratio in local groundwater and incidence of acute myocardial infarction among males in rural Finland. Environ Health Perspect. 2006 May;114(5):730-4.

Krawczak M, Nikolaus S, von Eberstein H, Croucher PJ, El Mokhtari NE, Schreiber S. PopGen: population-based recruitment of patients and controls for the analysis of complex geno- type-phenotype relationships. Community Genet. 2006;9(1):55-61.

Kristiansson K, Naukkarinen J, Peltonen L. Isolated populations and complex disease gene identification. Genome Biol. 2008;9(8):109.

Kuulasmaa K, Tunstall-Pedoe H, Dobson A, Fortmann S, Sans S, Tolonen H, Evans A, Ferrario M, Tuomilehto J. Estimation of contribution of changes in classic risk factors to trends in coronary-event rates across the WHO MONICA Project populations. Lancet. 2000 Feb 26;355(9205):675-87.

Laakso J (ed). Uralilaiset kansat: tietoa suomen sukukielistä ja niiden puhujista. WSOY, Porvoo. 1991; 329 p.

Laan M, Pääbo S. Demographic history and linkage disequilibrium in human populations. Nat Genet. 1997 Dec;17(4):435-8.

Laan M, Wiebe V, Khusnutdinova E, Remm M, Pääbo S. X-chromosome as a marker for population history: linkage disequilibrium and haplotype study in Eurasian populations. Eur J Hum Genet. 2005 Apr;13(4):452-62.

Lahermo P, Sajantila A, Sistonen P, Lukka M, Aula P, Peltonen L, Savontaus ML. The genetic relationship between the Finns and the Finnish Saami (Lapps): analysis of nuclear DNA and mtDNA. Am J Hum Genet. 1996 Jun;58(6):1309-22.

120 References Lahermo P, Savontaus ML, Sistonen P, Béres J, de Knijff P, Aula P, Sajantila A. Y chromosomal polymorphisms reveal founding lineages in the Finns and the Saami. Eur J Hum Genet. 1999 May-Jun;7(4):447-58.

Laine A. Suomi sodassa. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003; 690-727.

Laitinen V, Lahermo P, Sistonen P, Savontaus ML. Y-chromosomal diversity suggests that Baltic males share common Finno-Ugric-speaking forefathers. Hum Hered. 2002;53(2):68-78.

Laks T, Tuomilehto J, Jõeste E, Mäeots E, Salomaa V, Palomäki P, Pullisaar O, Karu K, Torppa J. Alarmingly high occurrence and case fatality of acute coronary heart disease events in Estonia: results from the Tallinn AMI register 1991-94. J Intern Med. 1999 Jul;246(1):53-60.

Lander ES, Botstein D. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science. 1987 Jun 19;236(4808):1567-70.

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ; International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921.

References 121 Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011 Feb 10;470(7333):187-97.

Lao O, Lu TT, Nothnagel M, Junge O, Freitag-Wolf S, Caliebe A, Balascakova M, Bertranpetit J, Bindoff LA, Comas D, Holmlund G, Kouvatsi A, Macek M, Mollet I, Parson W, Palo J, Ploski R, Sajantila A, Tagliabracci A, Gether U, Werge T, Rivadeneira F, Hofman A, Uitterlinden AG, Gieger C, Wichmann HE, Rüther A, Schreiber S, Becker C, Nürnberg P, Nelson MR, Krawczak M, Kayser M. Correlation between genetic and geographic structure in Europe. Curr Biol. 2008 Aug 26;18(16):1241-8.

Lappalainen T, Koivumäki S, Salmela E, Huoponen K, Sistonen P, Savontaus ML, Lahermo P. Regional differences among the Finns: a Y-chromosomal perspective. Gene. 2006 Jul 19;376(2):207-15.

Lappalainen T, Laitinen V, Salmela E, Andersen P, Huoponen K, Savontaus ML, Lahermo P. Migration waves to the Baltic Sea region. Ann Hum Genet. 2008 May;72(Pt 3):337-48.

Lappalainen T. in the Baltic Sea region: features of population history and natural selection. PhD thesis. Helsinki University Print, Helsinki. 2009; 83 p.

Lappalainen T, Hannelius U, Salmela E, von Döbeln U, Lindgren CM, Huoponen K, Savontaus ML, Kere J, Lahermo P. Population structure in contemporary Sweden--a Y-chromosomal and mitochondrial DNA analysis. Ann Hum Genet. 2009 Jan;73(1):61-73.

Lappalainen T, Salmela E, Andersen PM, Dahlman-Wright K, Sistonen P, Savontaus ML, Schreiber S, Lahermo P, Kere J. Genomic landscape of positive natural selection in Northern European populations. Eur J Hum Genet. 2010 Apr;18(4):471-8.

Leskinen H. Suomen murteiden synty. In: Fogelberg P (ed). Pohjan poluilla: suomalaisten juuret nykytutkimuksen mukaan. Bidrag till kännedom av Finlands natur och folk 153. Finska Vetenskaps-Societeten, Helsinki. 1999; 358-71.

Leu M, Humphreys K, Surakka I, Rehnberg E, Muilu J, Rosenström P, Almgren P, Jääskeläinen J, Lifton RP, Kyvik KO, Kaprio J, Pedersen NL, Palotie A, Hall P, Grönberg H, Groop L, Peltonen L, Palmgren J, Ripatti S. NordicDB: a Nordic pool and portal for genome-wide control data. Eur J Hum Genet. 2010 Dec;18(12):1322-6.

Levo A, Jääskeläinen J, Sistonen P, Sirén MK, Voutilainen R, Partanen J. Tracing past popu- lation migrations: genealogy of steroid 21-hydroxylase (CYP21) gene mutations in Finland. Eur J Hum Genet. 1999 Feb-Mar;7(2):188-96.

Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC. The diploid genome sequence of an individual human. PLoS Biol. 2007 Sep 4;5(10):e254.

Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008 Feb 22;319(5866):1100-4.

Lindkvist T. Early political organisation. Introductory survey. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003a; 160-7.

122 References Lindkvist T. Kings and provinces in Sweden. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003b; 221-34.

Lindqvist H. A : from Ice Age to our age. Norstedts, Stockholm. 2006; 727 p.

Litt M, Luty JA. A hypervariable microsatellite revealed by in vitro amplification of a dinucleo- tide repeat within the cardiac muscle actin gene. Am J Hum Genet. 1989 Mar;44(3):397-401.

Lu TT, Lao O, Nothnagel M, Junge O, Freitag-Wolf S, Caliebe A, Balascakova M, Bertranpetit J, Bindoff LA, Comas D, Holmlund G, Kouvatsi A, Macek M, Mollet I, Nielsen F, Parson W, Palo J, Ploski R, Sajantila A, Tagliabracci A, Gether U, Werge T, Rivadeneira F, Hofman A, Uitterlinden AG, Gieger C, Wichmann HE, Ruether A, Schreiber S, Becker C, Nürnberg P, Nelson MR, Kayser M, Krawczak M. An evaluation of the genetic-matched pair study design using genome-wide SNP data from the European population. Eur J Hum Genet. 2009 Jul;17(7):967-75.

Lundmark PE, Liljedahl U, Boomsma DI, Mannila H, Martin NG, Palotie A, Peltonen L, Perola M, Spector TD, Syvänen AC. Evaluation of HapMap data in six populations of European descent. Eur J Hum Genet. 2008 Sep;16(9):1142-50.

Mallory JP. In search of the Indo-Europeans: language, archaeology and myth. Thames & Hudson, London. 1989; 288 p.

Malmström H, Gilbert MT, Thomas MG, Brandström M, Storå J, Molnar P, Andersen PK, Bendixen C, Holmlund G, Götherström A, Willerslev E. Ancient DNA reveals lack of continuity between neolithic hunter-gatherers and contemporary Scandinavians. Curr Biol. 2009 Nov 3;19(20):1758-62.

Malmström H, Linderholm A, Lidén K, Storå J, Molnar P, Holmlund G, Jakobsson M, Götherström A. High frequency of lactose intolerance in a prehistoric hunter-gatherer population in northern Europe. BMC Evol Biol. 2010 Mar 30;10:89.

Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009 Oct 8;461(7265):747-53.

Mäntylä I. Suurvaltakausi. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003a; 192-275.

Mäntylä I. Kustavilainen aika. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003b; 314-57.

Mao X, Bigham AW, Mei R, Gutierrez G, Weiss KM, Brutsaert TD, Leon-Velarde F, Moore LG, Vargas E, McKeigue PM, Shriver MD, Parra EJ. A genomewide admixture mapping panel for Hispanic/Latino populations. Am J Hum Genet. 2007 Jun;80(6):1171-8.

Martin ER. Linkage disequilibrium and association analysis. In: Haines JL, Pericak-Vance MA (eds). Genetic analysis of complex disease. John Wiley & Sons, Hoboken. 2006; 329-353.

McEvoy BP, Montgomery GW, McRae AF, Ripatti S, Perola M, Spector TD, Cherkas L, Ahmadi KR, Boomsma D, Willemsen G, Hottenga JJ, Pedersen NL, Magnusson PK, Kyvik KO, Christensen K, Kaprio J, Heikkilä K, Palotie A, Widen E, Muilu J, Syvänen AC, Liljedahl

References 123 U, Hardiman O, Cronin S, Peltonen L, Martin NG, Visscher PM. Geographical structure and differential natural selection among North European populations. Genome Res. 2009 May;19(5):804-14.

McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009 Oct;5(10):e1000686.

Meigs JB, Shrader P, Sullivan LM, McAteer JB, Fox CS, Dupuis J, Manning AK, Florez JC, Wilson PW, D’Agostino RB Sr, Cupples LA. Genotype score in addition to common risk factors for prediction of type 2 diabetes. N Engl J Med. 2008 Nov 20;359(21):2208-19.

Meinander CF. Kivikautemme väestöhistoria. In: Gallén J (ed). Suomen väestön esihistorialliset juuret. Bidrag till kännedom av Finlands natur och folk 131. Finska Vetenskaps-Societeten, Helsinki. 1984; 21-48.

Meinilä M, Finnilä S, Majamaa K. Evidence for mtDNA admixture between the Finns and the Saami. Hum Hered. 2001;52(3):160-70.

Melin J, Johansson AW, Hedenborg S. Sveriges historia: koncentrerad uppslagsbok - fakta, årtal, kartor, tabeller. 3rd ed. Prisma, Stockholm. 2003; 508 p.

Mitchell D, Willerslev E, Hansen A. Damage and repair of ancient DNA. Mutat Res. 2005 Apr 1;571(1-2):265-76.

Moisio AL, Sistonen P, Weissenbach J, de la Chapelle A, Peltomäki P. Age and origin of two common MLH1 mutations predisposing to hereditary colon cancer. Am J Hum Genet. 1996 Dec;59(6):1243-51.

Mökkönen, Teemu. Studies on Stone Age Housepits in Fennoscandia (4000 - 2000 cal BC): Changes in ground plan, site location and degree of sedentism. PhD thesis. Helsinki University Print, Helsinki. 2011; 86 p.

Moorjani P, Patterson N, Hirschhorn JN, Keinan A, Hao L, Atzmon G, Burns E, Ostrer H, Price AL, Reich D. The history of African gene flow into Southern Europeans, Levantines, and Jews. PLoS Genet. 2011 Apr;7(4):e1001373.

Morin PA, Martien KK, Taylor BL. Assessing statistical power of SNPs for population structure and conservation studies. Mol Ecol Resour. 2009 Jan;9(1):66-73.

Morris RW, Whincup PH, Lampe FC, Walker M, Wannamethee SG, Shaper AG. Geographic variation in incidence of coronary heart disease in Britain: the contribution of established risk factors. Heart. 2001 Sep;86(3):277-83.

Morris RW, Whincup PH, Emberson JR, Lampe FC, Walker M, Shaper AG. North-south gradients in Britain for stroke and CHD: are they explained by the same factors? Stroke. 2003 Nov;34(11):2604-9.

Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across the human genome. Science. 2005 Oct 14;310(5746):321-4.

Mykkänen K, Savontaus ML, Juvonen V, Sistonen P, Tuisku S, Tuominen S, Penttinen M, Lundkvist J, Viitanen M, Kalimo H, Pöyhönen M. Detection of the founder effect in Finnish CADASIL families. Eur J Hum Genet. 2004 Oct;12(10):813-9.

Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000 Sep;156(1):297-304.

Nanovfszky G (ed). The Finno-Ugric world. Teleki László Foundation, Budapest. 2004; 548 p.

124 References Näyhä S. Environmental temperature and mortality. Int J Circumpolar Health. 2005 Dec;64(5):451-8.

Nelis M, Esko T, Mägi R, Zimprich F, Zimprich A, Toncheva D, Karachanak S, Piskácková T, Balascák I, Peltonen L, Jakkula E, Rehnström K, Lathrop M, Heath S, Galan P, Schreiber S, Meitinger T, Pfeufer A, Wichmann HE, Melegh B, Polgár N, Toniolo D, Gasparini P, D’Adamo P, Klovins J, Nikitina-Zake L, Kucinskas V, Kasnauskiene J, Lubinski J, Debniak T, Limborska S, Khrunin A, Estivill X, Rabionet R, Marsal S, Julià A, Antonarakis SE, Deutsch S, Borel C, Attar H, Gagnebin M, Macek M, Krawczak M, Remm M, Metspalu A. Genetic structure of Europeans: a view from the North-East. PLoS One. 2009;4(5):e5472.

Nerbrand C, Svärdsudd K, Hörte LG, Tibblin G. Geographical variation of mortality from cardiovascular diseases. The Project ’Myocardial Infarction in mid-Sweden’. Eur Heart J. 1991 Jan;12(1):4-9.

Nerbrand C, Svärdsudd K, Ek J, Tibblin G. Cardiovascular mortality and morbidity in seven counties in Sweden in relation to water hardness and geological settings. The project: myocardial infarction in mid-Sweden. Eur Heart J. 1992 Jun;13(6):721-7.

Nevanlinna HR. The Finnish population structure. A genetic and genealogical study. Hereditas. 1972;71(2):195-236.

Nevanlinna HR. Suomalaisten juuret geneettisen merkkiominaisuustutkimuksen valossa. In: Gallén J (ed). Suomen väestön esihistorialliset juuret. Bidrag till kännedom av Finlands natur och folk 131. Finska Vetenskaps-Societeten, Helsinki. 1984; 157–74.

Nielsen R, Hubisz MJ, Clark AG. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004 Dec;168(4):2373-82.

Norio R, Nevanlinna HR, Perheentupa J. Hereditary diseases in Finland; rare flora in rare soil. Ann Clin Res. 1973 Jun;5(3):109-41.

Norio R. Finnish Disease Heritage I: characteristics, causes, background. Hum Genet. 2003a May;112(5-6):441-56.

Norio R. Finnish Disease Heritage II: population prehistory and genetic roots of Finns. Hum Genet. 2003b May;112(5-6):457-69.

Norio R. The Finnish Disease Heritage III: the individual diseases. Hum Genet. 2003c May;112(5-6):470-526.

Norman H. Emigrationen. In: Melin J, Johansson AW, Hedenborg S. Sveriges historia: kon- centrerad uppslagsbok - fakta, årtal, kartor, tabeller. 3rd ed. Prisma, Stockholm. 2003; 277.

Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nat Genet. 2008 May;40(5):646-9.

Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, Stephens M, Bustamante CD. Genes mirror geography within Europe. Nature. 2008 Nov 6;456(7218):98-101.

Nychka D. fields: Tools for spatial data. R package version 4.1. 2007.

Nygård T, Kallio V. Rajamaa. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003; 534-93.

Nylander PO, Beckman L. Population studies in northern Sweden. XVII. Estimates of Finnish and Saamish influence. Hum Hered. 1991;41(3):157-67.

References 125 O’Dushlaine CT, Morris D, Moskvina V, Kirov G, Consortium IS, Gill M, Corvin A, Wilson JF, Cavalleri GL. Population structure and genome-wide patterns of variation in Ireland and Britain. Eur J Hum Genet. 2010a Nov;18(11):1248-54.

O’Dushlaine C, McQuillan R, Weale ME, Crouch DJ, Johansson A, Aulchenko Y, Franklin CS, Polašek O, Fuchsberger C, Corvin A, Hicks AA, Vitart V, Hayward C, Wild SH, Meitinger T, van Duijn CM, Gyllensten U, Wright AF, Campbell H, Pramstaller PP, Rudan I, Wilson JF. Genes predict of origin in rural Europe. Eur J Hum Genet. 2010b Nov;18(11):1269-70.

Official Statistics of Finland (OSF): Preliminary population statistics [e-publication]. ISSN=2243-3627. Helsinki: Statistics Finland. Accessed in August 2012.

Olkkonen T. Modernisoituva sääty-yhteiskunta. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003; 462-533.

Orrman E. Suomen väestön ja asutuksen kehitys keskiajalla. In: Fogelberg P (ed). Pohjan poluilla: suomalaisten juuret nykytutkimuksen mukaan. Bidrag till kännedom av Finlands natur och folk 153. Finska Vetenskaps-Societeten, Helsinki. 1999; 375-84.

Pajunen P, Pääkkönen R, Juolevi A, Hämäläinen H, Keskimäki I, Laatikainen T, Moltchanov V, Niemi M, Rintanen H, Salomaa V. Trends in fatal and non-fatal coronary heart disease events in Finland during 1991-2001. Scand Cardiovasc J. 2004 Dec;38(6):340-4.

Palo JU, Hedman M, Ulmanen I, Lukka M, Sajantila A. High degree of Y-chromosomal divergence within Finland--forensic aspects. Forensic Sci Int Genet. 2007 Jun;1(2):120-4.

Palo JU, Pirttimaa M, Bengs A, Johnsson V, Ulmanen I, Lukka M, Udd B, Sajantila A. The effect of number of loci on geographical structuring and forensic applicability of Y-STR data in Finland. Int J Legal Med. 2008 Nov;122(6):449-56.

Palo JU, Ulmanen I, Lukka M, Ellonen P, Sajantila A. Genetic markers and population history: Finland revisited. Eur J Hum Genet. 2009 Oct;17(10):1336-46.

Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing com- plexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008 Dec;40(12):1413-5.

Parducci L, Jørgensen T, Tollefsrud MM, Elverland E, Alm T, Fontana SL, Bennett KD, Haile J, Matetovici I, Suyama Y, Edwards ME, Andersen K, Rasmussen M, Boessenkool S, Coissac E, Brochmann C, Taberlet P, Houmark-Nielsen M, Larsen NK, Orlando L, Gilbert MT, Kjær KH, Alsos IG, Willerslev E. Glacial survival of boreal trees in northern Scandinavia. Science. 2012 Mar 2;335(6072):1083-6.

Pastinen T, Perola M, Ignatius J, Sabatti C, Tainola P, Levander M, Syvänen AC, Peltonen L. Dissecting a population genome for targeted screening of disease mutations. Hum Mol Genet. 2001 Dec 15;10(26):2961-72.

Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O’Brien SJ, Altshuler D, Daly MJ, Reich D. Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 2004 May;74(5):979-1000.

Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006 Dec;2(12):e190.

Pekkanen J, Manton KG, Stallard E, Nissinen A, Karvonen MJ. Risk factor dynamics, mortality and life expectancy differences between eastern and western Finland: the Finnish Cohorts of the Seven Countries Study. Int J Epidemiol. 1992 Apr;21(2):406-19.

126 References Peltonen L, Jalanko A, Varilo T. Molecular genetics of the Finnish disease heritage. Hum Mol Genet. 1999;8(10):1913-23.

Peltonen L. Positional cloning of disease genes: advantages of genetic isolates. Hum Hered. 2000 Jan-Feb;50(1):66-75.

Peltonen L, Palotie A, Lange K. Use of population isolates for mapping complex traits. Nat Rev Genet. 2000 Dec;1(3):182-90.

Pesonen E, Viikari J, Akerblom HK, Räsänen L, Louhivuori K, Sarna S. Geographic origin of the family as a determinant of serum levels of lipids in Finnish children. Circulation. 1986 Jun;73(6):1119-26.

Pesonen E, Norio R, Hirvonen J, Karkola K, Kuusela V, Laaksonen H, Möttönen M, Nikkari T, Raekallio J, Viikari J, et al. Intimal thickening in the coronary arteries of infants and children as an indicator of risk factors for coronary heart disease. Eur Heart J. 1990 Aug;11 Suppl E:53-60.

Pettersson G. 2005. Svenska språket under sjuhundra år: en historia om svenskan och dess utforskande. 2nd ed. Studentlitteratur, Lund. 2005; 290 p.

Pettitt P, Niskanen M. Neanderthals in Susiluola cave, Finland, during the last interglacial period? Fennoscandia archaeologica 2005;22:79–87.

Pickrell J, Pritchard J. Inference of population splits and mixtures from genome-wide allele frequency data. Available from Nature Precedings 2012 http://hdl.handle.net/10101/ npre.2012.6956.1.

Pitkänen K. Suomen väestön historialliset kehityslinjat. In: Koskinen S, Martelin T, Notkola IL, Notkola V, Pitkänen K, Jalovaara M, Mäenpää E, Ruokolainen A, Ryynänen M, Söderling I (eds). Suomen väestö. Gaudeamus, Helsinki. 2007; 41-75.

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006 Aug;38(8):904-9.

Price AL, Patterson N, Yu F, Cox DR, Waliszewska A, McDonald GJ, Tandon A, Schirmer C, Neubauer J, Bedoya G, Duque C, Villegas A, Bortolini MC, Salzano FM, Gallo C, Mazzotti G, Tello-Ruiz M, Riba L, Aguilar-Salinas CA, Canizales-Quinteros S, Menjivar M, Klitz W, Henderson B, Haiman CA, Winkler C, Tusie-Luna T, Ruiz-Linares A, Reich D. A genome- wide admixture map for Latino populations. Am J Hum Genet. 2007 Jun;80(6):1024-36.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and popu- lation-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559-75.

Puska P, Vartiainen E, Tuomilehto J, Salomaa V, Nissinen A. Changes in premature deaths in Finland: successful long-term prevention of cardiovascular diseases. Bull World Health Organ. 1998;76(4):419-25.

R Development Core Team. R: A language and environment for statistical computing 2.6.2. Available: http://www.R-project.org via the Internet. R Foundation for Statistical Computing, Vienna. 2008; 2673 p.

Raitio M, Lindroos K, Laukkanen M, Pastinen T, Sistonen P, Sajantila A, Syvänen AC. Y-chromosomal SNPs in Finno-Ugric-speaking populations analyzed by minisequencing on microarrays. Genome Res. 2001 Mar;11(3):471-82.

References 127 Rankama T, Kankaanpää J. Eastern arrivals in post-glacial Lapland: the Sujala Site 10000 cal BP. Antiquity. 2008;88(318):884–99.

Read A, Donnai D. New Clinical Genetics. 2nd ed. Scion Publishing Limited, Banbury. 2011; 442 p.

Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME. Global variation in copy number in the human genome. Nature. 2006 Nov 23;444(7118):444-54.

Reich D, Patterson N, De Jager PL, McDonald GJ, Waliszewska A, Tandon A, Lincoln RR, DeLoa C, Fruhan SA, Cabre P, Bera O, Semana G, Kelly MA, Francis DA, Ardlie K, Khan O, Cree BA, Hauser SL, Oksenberg JR, Hafler DA. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nat Genet. 2005 Oct;37(10):1113-8.

Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996 Sep 13;273(5281):1516-7.

Roesdahl E, Sørensen PM. Viking culture. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003; 121-46.

Rootsi S, Magri C, Kivisild T, Benuzzi G, Help H, Bermisheva M, Kutuev I, Barać L, Pericić M, Balanovsky O, Pshenichnov A, Dion D, Grobei M, Zhivotovsky LA, Battaglia V, Achilli A, Al-Zahery N, Parik J, King R, Cinnioğlu C, Khusnutdinova E, Rudan P, Balanovska E, Scheffrahn W, Simonescu M, Brehm A, Goncalves R, Rosa A, Moisan JP, Chaventre A, Ferak V, Füredi S, Oefner PJ, Shen P, Beckman L, Mikerezi I, Terzić R, Primorac D, Cambon- Thomsen A, Krumina A, Torroni A, Underhill PA, Santachiara-Benerecetti AS, Villems R, Semino O. Phylogeography of Y-chromosome haplogroup I reveals distinct domains of prehistoric gene flow in europe. Am J Hum Genet. 2004 Jul;75(1):128-37.

Rootsi S, Zhivotovsky LA, Baldovic M, Kayser M, Kutuev IA, Khusainova R, Bermisheva MA, Gubina M, Fedorova SA, Ilumäe AM, Khusnutdinova EK, Voevoda MI, Osipova LP, Stoneking M, Lin AA, Ferak V, Parik J, Kivisild T, Underhill PA, Villems R. A counter-clock- wise northern route of the Y-chromosome haplogroup N from Southeast towards Europe. Eur J Hum Genet. 2007 Feb;15(2):204-11.

Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic structure of human populations. Science. 2002 Dec 20;298(5602):2381-5.

Rosenblum EB, Novembre J. Ascertainment bias in spatially structured populations: a case study in the eastern fence lizard. J Hered. 2007 Jul-Aug;98(4):331-6.

Rosengren A, Stegmayr B, Johansson I, Huhtasaari F, Wilhelmsen L. Coronary risk factors, diet and vitamins as possible explanatory factors of the Swedish north-south gradient in coronary disease: a comparison between two MONICA centres. J Intern Med. 1999 Dec;246(6):577-86.

Sajantila A, Lahermo P, Anttinen T, Lukka M, Sistonen P, Savontaus ML, Aula P, Beckman L, Tranebjaerg L, Gedde-Dahl T, Issel-Tarver L, DiRienzo A, Pääbo S. Genes and languages in Europe: an analysis of mitochondrial lineages. Genome Res. 1995 Aug;5(1):42-52.

128 References Sajantila A, Pääbo S. Language replacement in Scandinavia. Nat Genet. 1995 Dec;11(4):359-60.

Sajantila A, Salem AH, Savolainen P, Bauer K, Gierig C, Pääbo S. Paternal and maternal DNA lineages reveal a bottleneck in the founding of the Finnish population. Proc Natl Acad Sci U S A. 1996 Oct 15;93(21):12035-9.

Salomaa V, Miettinen H, Kuulasmaa K, Niemelä M, Ketonen M, Vuorenmaa T, Lehto S, Palomäki P, Mähönen M, Immonen-Räihä P, Arstila M, Kaarsalo E, Mustaniemi H, Torppa J, Tuomilehto J, Puska P, Pyörälä K. Decline of coronary heart disease mortality in Finland during 1983 to 1992: roles of incidence, recurrence, and case-fatality. The FINMONICA MI Register Study. Circulation. 1996 Dec 15;94(12):3130-7.

Salomaa V, Rosamond W, Mähönen M. Decreasing mortality from acute myocardial infarction: effect of incidence and prognosis. J Cardiovasc Risk. 1999 Apr;6(2):69-75.

Salomaa V, Ketonen M, Koukkunen H, Immonen-Räihä P, Jerkkola T, Kärjä-Koskenkari P, Mähönen M, Niemelä M, Kuulasmaa K, Palomäki P, Mustonen J, Arstila M, Vuorenmaa T, Lehtonen A, Lehto S, Miettinen H, Torppa J, Tuomilehto J, Kesäniemi YA, Pyörälä K. Decline in out-of-hospital coronary heart disease deaths has contributed the main part to the overall decline in coronary heart disease mortality rates among persons 35 to 64 years of age in Finland: the FINAMI study. Circulation. 2003 Aug 12;108(6):691-6.

Salonen JT, Puska P, Mustaniemi H. Changes in morbidity and mortality during comprehensive community programme to control cardiovascular diseases during 1972-7 in North Karelia. Br Med J. 1979 Nov 10;2(6199):1178-83.

Samani NJ, Erdmann J, Hall AS, Hengstenberg C, Mangino M, Mayer B, Dixon RJ, Meitinger T, Braund P, Wichmann HE, Barrett JH, König IR, Stevens SE, Szymczak S, Tregouet DA, Iles MM, Pahlke F, Pollard H, Lieb W, Cambien F, Fischer M, Ouwehand W, Blankenberg S, Balmforth AJ, Baessler A, Ball SG, Strom TM, Braenne I, Gieger C, Deloukas P, Tobin MD, Ziegler A, Thompson JR, Schunkert H; WTCCC and the Cardiogenics Consortium. Genomewide association analysis of coronary artery disease. N Engl J Med. 2007 Aug 2;357(5):443-53.

Sammallahti P. Saamelaisten esihistoriallinen tausta kielitieteen valossa. In: Gallén J (ed). Suomen väestön esihistorialliset juuret. Bidrag till kännedom av Finlands natur och folk 131. Finska Vetenskaps-Societeten, Helsinki. 1984; 137-56.

Samuelsson J. Vasatiden 1521-1611. In: Larsson HA (ed). Boken om Sveriges historia. Forum, Stockholm. 1999; 101-32.

Sans S, Kesteloot H, Kromhout D. The burden of cardiovascular diseases mortality in Europe. Task Force of the European Society of Cardiology on Cardiovascular Mortality and Morbidity Statistics in Europe. Eur Heart J. 1997 Dec;18(12):1231-48.

Sarantaus L, Huusko P, Eerola H, Launonen V, Vehmanen P, Rapakko K, Gillanders E, Syrjäkoski K, Kainu T, Vahteristo P, Krahe R, Pääkkönen K, Hartikainen J, Blomqvist C, Löppönen T, Holli K, Ryynänen M, Bützow R, Borg A, Wasteson Arver B, Holmberg E, Mannermaa A, Kere J, Kallioniemi OP, Winqvist R, Nevanlinna H. Multiple founder effects and geographical clustering of BRCA1 and BRCA2 families in Finland. Eur J Hum Genet. 2000 Oct;8(10):757-63.

Sarti C, Vartiainen E, Torppa J, Tuomilehto J, Puska P. Trends in cerebrovascular mortality and in its risk factors in Finland during the last 20 years. Health Rep. 1994;6(1):196-206.

References 129 Sawyer B, Sawyer P. Scandinavia enters Christian Europe. In: Helle K (ed). The Cambridge his- tory of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003; 147-59.

Sawyer P. The Viking expansion. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003; 105-20.

Schulz HP. The Susiluola Cave site in Western Finland - Evidence of the Northernmost Middle Palaeolithic Settlement in Europe. In: Burdukiewicz JM, Wisniewski A (eds). Middle Palaeolithic Human Activity and Palaeoecology: New Discoveries and Ideas. Studia Archeologiczne. 2010; 41:47–67.

Schunkert H, König IR, Kathiresan S, Reilly MP, Assimes TL, H, Preuss M, Stewart AF, Barbalic M, Gieger C, Absher D, Aherrahrou Z, Allayee H, Altshuler D, Anand SS, Andersen K, Anderson JL, Ardissino D, Ball SG, Balmforth AJ, Barnes TA, Becker DM, Becker LC, Berger K, Bis JC, Boekholdt SM, Boerwinkle E, Braund PS, Brown MJ, Burnett MS, Buysschaert I; Cardiogenics, Carlquist JF, Chen L, Cichon S, Codd V, Davies RW, Dedoussis G, Dehghan A, Demissie S, Devaney JM, Diemert P, Do R, Doering A, Eifert S, Mokhtari NE, Ellis SG, Elosua R, Engert JC, Epstein SE, de Faire U, Fischer M, Folsom AR, Freyer J, Gigante B, Girelli D, Gretarsdottir S, Gudnason V, Gulcher JR, Halperin E, Hammond N, Hazen SL, Hofman A, Horne BD, Illig T, Iribarren C, Jones GT, Jukema JW, Kaiser MA, Kaplan LM, Kastelein JJ, Khaw KT, Knowles JW, Kolovou G, Kong A, Laaksonen R, Lambrechts D, Leander K, Lettre G, Li M, Lieb W, Loley C, Lotery AJ, Mannucci PM, Maouche S, Martinelli N, McKeown PP, Meisinger C, Meitinger T, Melander O, Merlini PA, Mooser V, Morgan T, Mühleisen TW, Muhlestein JB, Münzel T, Musunuru K, Nahrstaedt J, Nelson CP, Nöthen MM, Olivieri O, Patel RS, Patterson CC, Peters A, Peyvandi F, Qu L, Quyyumi AA, Rader DJ, Rallidis LS, Rice C, Rosendaal FR, Rubin D, Salomaa V, Sampietro ML, Sandhu MS, Schadt E, Schäfer A, Schillert A, Schreiber S, Schrezenmeir J, Schwartz SM, Siscovick DS, Sivananthan M, Sivapalaratnam S, Smith A, Smith TB, Snoep JD, Soranzo N, Spertus JA, Stark K, Stirrups K, Stoll M, Tang WH, Tennstedt S, Thorgeirsson G, Thorleifsson G, Tomaszewski M, Uitterlinden AG, van Rij AM, Voight BF, Wareham NJ, Wells GA, Wichmann HE, Wild PS, Willenborg C, Witteman JC, Wright BJ, Ye S, Zeller T, Ziegler A, Cambien F, Goodall AH, Cupples LA, Quertermous T, März W, Hengstenberg C, Blankenberg S, Ouwehand WH, Hall AS, Deloukas P, Thompson JR, Stefansson K, Roberts R, Thorsteinsdottir U, O’Donnell CJ, McPherson R, Erdmann J; CARDIoGRAM Consortium, Samani NJ. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat Genet. 2011 Mar 6;43(4):333-8.

Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M. Large-scale copy number polymorphism in the human genome. Science. 2004 Jul 23;305(5683):525-8.

Seldin MF, Shigeta R, Villoslada P, Selmi C, Tuomilehto J, Silva G, Belmont JW, Klareskog L, Gregersen PK. European population substructure: clustering of northern and southern populations. PLoS Genet. 2006 Sep 15;2(9):e143.

Service S, DeYoung J, Karayiorgou M, Roos JL, Pretorious H, Bedoya G, Ospina J, Ruiz- Linares A, Macedo A, Palha JA, Heutink P, Aulchenko Y, Oostra B, van Duijn C, Jarvelin MR, Varilo T, Peddle L, Rahman P, Piras G, Monne M, Murray S, Galver L, Peltonen L, Sabatti C, Collins A, Freimer N. Magnitude and distribution of linkage disequilibrium in population isolates and implications for genome-wide association studies. Nat Genet. 2006 May;38(5):556-60.

130 References Service S; International Collaborative Group on Isolated Populations, Sabatti C, Freimer N. Tag SNPs chosen from HapMap perform well in several population isolates. Genet Epidemiol. 2007 Apr;31(3):189-94.

Shaper AG, Elford J. Place of birth and adult cardiovascular disease: the British Regional Heart Study. Acta Paediatr Scand Suppl. 1991;373:73-81.

Siiriäinen A. The Stone and Bronze Ages. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003a; 43-59.

Siiriäinen A. Esihistoria. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003b; 14-43.

Similä M, Taskinen O, Männistö S, -Koski M, Karvonen M, Laatikainen T, Valsta L. Terveyttä edistävä ruokavalio, lihavuus ja seerumin kolesteroli karttoina. Kansanterveyslaitoksen julkaisuja B20/2005. Kansanterveyslaitos, Helsinki. 2005; 36 p.

Sirén MK, Sareneva H, Lokki ML, Koskimies S. Unique HLA antigen frequencies in the Finnish population. Tissue Antigens. 1996 Dec;48(6):703-7.

Sistonen P, Virtaranta-Knowles K, Denisova R, Kucinskas V, Ambrasiene D, Beckman L. The LWb blood group as a marker of prehistoric Baltic migrations and admixture. Hum Hered. 1999 Jun;49(3):154-8.

Skoglund P, Malmström H, Raghavan M, Storå J, Hall P, Willerslev E, Gilbert MT, Götherström A, Jakobsson M. Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science. 2012 Apr 27;336(6080):466-9.

Slatkin M. Linkage disequilibrium in growing and stable populations. Genetics. 1994 May;137(1):331-6.

Smith MW, Patterson N, Lautenberger JA, Truelove AL, McDonald GJ, Waliszewska A, Kessing BD, Malasky MJ, Scafe C, Le E, De Jager PL, Mignault AA, Yi Z, De The G, Essex M, Sankale JL, Moore JH, Poku K, Phair JP, Goedert JJ, Vlahov D, Williams SM, Tishkoff SA, Winkler CA, De La Vega FM, Woodage T, Sninsky JJ, Hafler DA, Altshuler D, Gilbert DA, O’Brien SJ, Reich D. A high-density admixture map for disease gene discovery in African Americans. Am J Hum Genet. 2004 May;74(5):1001-13.

Smith MW, O’Brien SJ. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nat Rev Genet. 2005 Aug;6(8):623-32.

Stadin K. Stormaktstiden 1611-1718. In: Larsson HA (ed). Boken om Sveriges historia. Forum, Stockholm. 1999; 133-80.

Statistics Finland. http://www.stat.fi/ Accessed in April 2012.

Statistics Sweden. http://www.scb.se/ Accessed in May 2012.

Stephens JC, Briscoe D, O’Brien SJ. Mapping by admixture linkage disequilibrium in human populations: limits and guidelines. Am J Hum Genet. 1994 Oct;55(4):809-24.

Strachan T, Read A. Human Molecular Genetics. 4th ed. Garland Science, Taylor & Francis Group, New York. 2011; 781 p.

Sundell T, Heger M, Kammonen J, Onkamo P. Modelling a neolithic population bottleneck in Finland: a genetic simulation. Fennoscandia Archaeologica. 2010;27:3-19.

Takala H. The Ristola Site in Lahti and the Earliest Postglacial Settlement of South Finland. Lahti City Museum, Lahti. 2004; 205 p.

References 131 Tallavaara M, Pesonen P, Oinonen M. Prehistoric population history in eastern Fennoscandia. J Archaeological Sci. 2010 Feb;37(2):251-60.

Tambets K, Rootsi S, Kivisild T, Help H, Serk P, Loogväli EL, Tolk HV, Reidla M, Metspalu E, Pliss L, Balanovsky O, Pshenichnov A, Balanovska E, Gubina M, Zhadanov S, Osipova L, Damba L, Voevoda M, Kutuev I, Bermisheva M, Khusnutdinova E, Gusar V, Grechanina E, Parik J, Pennarun E, Richard C, Chaventre A, Moisan JP, Barác L, Pericić M, Rudan P, Terzić, R Mikerezi I, Krumina A, Baumanis V, Koziel S, Rickards O, De Stefano GF, Anagnou N, Pappa KI, Michalodimitrakis E, Ferák V, Füredi S, Komel R, Beckman L, Villems R. The western and eastern roots of the Saami--the story of genetic “outliers” told by mitochondrial DNA and Y chromosomes. Am J Hum Genet. 2004 Apr;74(4):661-82.

Teo SM, Ku CS, Naidoo N, Hall P, Chia KS, Salim A, Pawitan Y. A population-based study of copy number variants and regions of homozygosity in healthy Swedish individuals. J Hum Genet. 2011 Jul;56(7):524-33.

Terwilliger JD, Zöllner S, Laan M, Pääbo S. Mapping genes through the use of linkage disequi- librium generated by genetic drift: ‘drift mapping’ in small populations with no demographic expansion. Hum Hered. 1998 May-Jun;48(3):138-54.

Tian C, Hinds DA, Shigeta R, Adler SG, Lee A, Pahl MV, Silva G, Belmont JW, Hanson RL, Knowler WC, Gregersen PK, Ballinger DG, Seldin MF. A genomewide single-nucleo- tide-polymorphism panel for Mexican American admixture mapping. Am J Hum Genet. 2007 Jun;80(6):1014-23.

Tian C, Kosoy R, Nassir R, Lee A, Villoslada P, Klareskog L, Hammarström L, Garchon HJ, Pulver AE, Ransom M, Gregersen PK, Seldin MF. European population genetic substruc- ture: further definition of ancestry informative markers for distinguishing among diverse European ethnic groups. Mol Med. 2009 Nov-Dec;15(11-12):371-83.

Tienari PJ, Sumelahti ML, Rantamäki T, Wikström J. Multiple sclerosis in western Finland: evidence for a founder effect. Clin Neurol Neurosurg. 2004 Jun;106(3):175-9.

Tillmar AO, Coble MD, Wallerström T, Holmlund G. Homogeneity in mitochondrial DNA control region sequences in Swedish subpopulations. Int J Legal Med. 2010 Mar;124(2):91-8.

Tillmar AO. Population genetic analysis of 12 X-STRs in Swedish population. Forensic Sci Int Genet. 2012 Mar;6(2):e80-1.

Toiviainen P, Eerola T. Musiikkitietokannat kansanmusiikin tutkimuksessa Musiikin suunta. 2005;2:5-11.

Torroni A, Huoponen K, Francalacci P, Petrozzi M, Morelli L, Scozzari R, Obinu D, Savontaus ML, Wallace DC. Classification of European mtDNAs from an analysis of three European populations. Genetics. 1996 Dec;144(4):1835-50.

Torroni A, Achilli A, Macaulay V, Richards M, Bandelt HJ. Harvesting the fruit of the human mtDNA tree. Trends Genet. 2006 Jun;22(6):339-45.

Trégouët DA, König IR, Erdmann J, Munteanu A, Braund PS, Hall AS, Grosshennig A, Linsel- Nitschke P, Perret C, DeSuremain M, Meitinger T, Wright BJ, Preuss M, Balmforth AJ, Ball SG, Meisinger C, Germain C, Evans A, Arveiler D, Luc G, Ruidavets JB, Morrison C, van der Harst P, Schreiber S, Neureuther K, Schäfer A, Bugert P, El Mokhtari NE, Schrezenmeir J, Stark K, Rubin D, Wichmann HE, Hengstenberg C, Ouwehand W; Wellcome Trust Case Control Consortium; Cardiogenics Consortium, Ziegler A, Tiret L, Thompson JR, Cambien

132 References F, Schunkert H, Samani NJ. Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nat Genet. 2009 Mar;41(3):283-5.

Tunstall-Pedoe H, Kuulasmaa K, Mähönen M, Tolonen H, Ruokokoski E, Amouyel P. Contribution of trends in survival and coronary-event rates to changes in coronary heart disease mortality: 10-year results from 37 WHO MONICA project populations. Monitoring trends and determinants in cardiovascular disease. Lancet. 1999 May 8;353(9164):1547-57.

Tunstall-Pedoe H, Vanuzzo D, Hobbs M, Mähönen M, Cepaitis Z, Kuulasmaa K, Keil U. Estimation of contribution of changes in coronary care to improving survival, event rates, and coronary heart disease mortality across the WHO MONICA Project populations. Lancet. 2000 Feb 26;355(9205):688-700.

Tuomilehto J, Arstila M, Kaarsalo E, Kankaanpää J, Ketonen M, Kuulasmaa K, Lehto S, Miettinen H, Mustaniemi H, Palomäki P, et al. Acute myocardial infarction (AMI) in Finland--baseline data from the FINMONICA AMI register in 1983-1985. Eur Heart J. 1992 May;13(5):577-87.

Turakulov R, Easteal S. Number of SNPS loci needed to detect population structure. Hum Hered. 2003;55(1):37-45.

Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE. Fine-scale structural variation of the human genome. Nat Genet. 2005 Jul;37(7):727-32.

Tyynelä P, Goebeler S, Ilveskoski E, Mikkelsson J, Perola M, Löytönen M, Karhunen PJ. Birthplace predicts risk for prehospital sudden cardiac death in middle-aged men who migrated to metropolitan area: The Helsinki Sudden Death Study. Ann Med. 2009;41(1):57-65.

Tyynelä P, Goebeler S, Ilveskoski E, Mikkelsson J, Perola M, Löytönen M, Karhunen PJ. Birthplace in area with high coronary heart disease mortality predicts the severity of cor- onary atherosclerosis among middle-aged Finnish men who had migrated to capital area: the Helsinki sudden death study. Ann Med. 2010 May 6;42(4):286-95.

Uimari P, Kontkanen O, Visscher PM, Pirskanen M, Fuentes R, Salonen JT. Genome-wide linkage disequilibrium from 100,000 SNPs in the East Finland founder population. Twin Res Hum Genet. 2005 Jun;8(3):185-97.

Vahtola J. Keskiaika. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003a; 44-137.

Vahtola J. Population and settlement. In: Helle K (ed). The Cambridge history of Scandinavia. Volume I: Prehistory to 1520. Cambridge University Press, Cambridge. 2003b; 559-80.

Varilo T, M, Norio R, Santavuori P, Peltonen L, Järvelä I. The age of human mutation: genealogical and linkage disequilibrium analysis of the CLN5 mutation in the Finnish population. Am J Hum Genet. 1996a Mar;58(3):506-12.

Varilo T, Nikali K, Suomalainen A, Lönnqvist T, Peltonen L. Tracing an ancestral mutation: genealogical and haplotype analysis of the infantile onset spinocerebellar ataxia locus. Genome Res. 1996b Sep;6(9):870-5.

Varilo T. The age of the mutations in the Finnish disease heritage; a genealogical and linkage disequilibrium study. PhD thesis. Publications of the National Public Health Institute, KTL A21/1999. National Public Health Institute, Helsinki. 1999; 98 p.

References 133 Varilo T, Paunio T, Parker A, Perola M, Meyer J, Terwilliger JD, Peltonen L. The interval of linkage disequilibrium (LD) detected with microsatellite and SNP markers in chromosomes of Finnish populations with different histories. Hum Mol Genet. 2003 Jan 1;12(1):51-9.

Varilo T, Peltonen L. Isolates and their potential use in complex gene mapping efforts. Curr Opin Genet Dev. 2004 Jun;14(3):316-23.

Vartiainen E, Puska P, Pekkanen J, Tuomilehto J, Jousilahti P. Changes in risk factors explain changes in mortality from ischaemic heart disease in Finland. BMJ. 1994 Jul 2;309(6946):23-7.

Vartiainen E, Laatikainen T, Peltonen M, Juolevi A, Männistö S, Sundvall J, Jousilahti P, Salomaa, V Valsta L, Puska P. Thirty-five-year trends in cardiovascular risk factors in Finland. Int J Epidemiol. 2010 Apr;39(2):504-18.

Venken T, Claes S, Sluijs S, Paterson AD, van Duijn C, Adolfsson R, Del-Favero J, Van Broeckhoven C. Genomewide scan for affective disorder susceptibility loci in families of a northern Swedish isolated population. Am J Hum Genet. 2005 Feb;76(2):237-48.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigó R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott

134 References J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X. The sequence of the human genome. Science. 2001 Feb 16;291(5507):1304-51.

Vihavainen T. Hyvinvointi-Suomi. In: Zetterberg S (ed). Suomen historian pikkujättiläinen. WSOY, Porvoo. 2003; 832-915.

Vikor LS. The Nordic language area and the languages in the north of Europe. In: Bandle O, Braunmüller K, Jahr EH, Karker A, Naumann HP, Teleman U, Elmevik L, Widmark G (eds). The Nordic languages: an international handbook of the history of the North Germanic languages. Walter de Gruyter, Berlin. 2002; 1-12.

Vilkki J, Savontaus ML, Nikoskelainen EK. Human mitochondrial DNA types in Finland. Hum Genet. 1988 Dec;80(4):317-21.

Virtaranta-Knowles K, Sistonen P, Nevanlinna HR. A population genetic study in Finland: com- parison of the Finnish- and Swedish-speaking populations. Hum Hered. 1991;41(4):248-64.

Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012 Jan 13;90(1):7-24.

Vuorela I. Viljelytoiminnan alku Suomessa paleoekologisen tutkimuksen kohteena. In: Fogelberg P (ed). Pohjan poluilla: suomalaisten juuret nykytutkimuksen mukaan. Bidrag till kännedom av Finlands natur och folk 153. Finska Vetenskaps-Societeten, Helsinki. 1999; 143-51.

Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K. The discovery of single-nucleotide poly- morphisms--and inferences about human demographic history. Am J Hum Genet. 2001 Dec;69(6):1332-47.

Weber JL, May PE. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am J Hum Genet. 1989 Mar;44(3):388-96.

Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007 Jun 7;447(7145):661-78.

Wiik K. Eurooppalaisten juuret. Atena, Jyväskylä. 2002; 503 p.

Wild PS, Zeller T, Schillert A, Szymczak S, Sinning CR, Deiseroth A, Schnabel RB, Lubos E, Keller T, Eleftheriadis MS, Bickel C, Rupprecht HJ, Wilde S, Rossmann H, Diemert P, Cupples LA, Perret C, Erdmann J, Stark K, Kleber ME, Epstein SE, Voight BF, Kuulasmaa K, Li M, Schäfer AS, Klopp N, Braund PS, Sager HB, Demissie S, Proust C, König IR, Wichmann HE, Reinhard W, Hoffmann MM, Virtamo J, Burnett MS, Siscovick D, Wiklund PG, Qu L, El Mokthari NE, Thompson JR, Peters A, Smith AV, Yon E, Baumert J, Hengstenberg C, März W, Amouyel P, Devaney J, Schwartz SM, Saarela O, Mehta NN, Rubin D, Silander K, Hall AS, Ferrieres J, Harris TB, Melander O, Kee F, Hakonarson H, Schrezenmeir J, Gudnason V, Elosua R, Arveiler D, Evans A, Rader DJ, Illig T, Schreiber S, Bis JC, Altshuler D, Kavousi M, Witteman JC, Uitterlinden AG, Hofman A, Folsom AR, Barbalic M, Boerwinkle E, Kathiresan S, Reilly MP, O’Donnell CJ, Samani NJ, Schunkert H, Cambien F, Lackner KJ, Tiret L, Salomaa V, Munzel T, Ziegler A, Blankenberg S. A genome-wide association study identifies LIPA as a susceptibility gene for coronary artery disease. Circ Cardiovasc Genet. 2011 Aug 1;4(4):403-12.

Wilhelmsen L, Rosengren A, Johansson S, Lappas G. Coronary heart disease attack rate, inci- dence and mortality 1975-1994 in Göteborg, Sweden. Eur Heart J. 1997 Apr;18(4):572-81.

References 135 Willer CJ, Scott LJ, Bonnycastle LL, Jackson AU, Chines P, Pruim R, Bark CW, Tsai YY, Pugh EW, Doheny KF, Kinnunen L, Mohlke KL, Valle TT, Bergman RN, Tuomilehto J, Collins FS, Boehnke M. Tag SNP selection for Finnish individuals based on the CEPH Utah HapMap database. Genet Epidemiol. 2006 Feb;30(2):180-90.

Willerslev E, Cooper A. Ancient DNA. Proc Biol Sci. 2005 Jan 7;272(1558):3-16.

Winkler CA, Nelson GW, Smith MW. Admixture mapping comes of age. Annu Rev Genomics Hum Genet. 2010 Sep 22;11:65-89.

Wray NR. Allele frequencies and the r2 measure of linkage disequilibrium: impact on design and interpretation of association studies. Twin Res Hum Genet. 2005 Apr;8(2):87-94.

Xu S, Jin L. A genome-wide analysis of admixture in Uyghurs and a high-density admixture map for disease-gene discovery. Am J Hum Genet. 2008 Sep;83(3):322-36.

Zerjal T, Dashnyam B, Pandya A, Kayser M, Roewer L, Santos FR, Schiefenhövel W, Fretwell N, Jobling MA, Harihara S, Shimizu K, Semjidmaa D, Sajantila A, Salo P, Crawford MH, Ginter EK, Evgrafov OV, Tyler-Smith C. Genetic relationships of Asians and Northern Europeans, revealed by Y-chromosomal DNA analysis. Am J Hum Genet. 1997 May;60(5):1174-83.

Zerjal T, Beckman L, Beckman G, Mikelsaar AV, Krumina A, Kucinskas V, Hurles ME, Tyler-Smith C. Geographical, linguistic, and cultural influences on genetic diversity: Y-chromosomal distribution in Northern European populations. Mol Biol Evol. 2001 Jun;18(6):1077-87.

Zhang B, Kirov S, Snoddy J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W741-8.

Zhu X, Luke A, Cooper RS, Quertermous T, Hanis C, Mosley T, Gu CC, Tang H, Rao DC, Risch N, Weder A. Admixture mapping for hypertension loci with genome-scan markers. Nat Genet. 2005 Feb;37(2):177-81.

Zhu X, Tang H, Risch N. Admixture mapping and the role of population structure for localizing disease genes. Adv Genet. 2008;60:547-69.

136 References