<<

Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2011 The prediction of single nucleotide polymorphisms and their utilization in mapping traits and determining population structure in production animals Danielle Gorbach Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Animal Sciences Commons

Recommended Citation Gorbach, Danielle, "The prediction of single nucleotide polymorphisms and their utilization in mapping traits and determining population structure in production animals" (2011). Graduate Theses and Dissertations. 10336. https://lib.dr.iastate.edu/etd/10336

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].

The prediction of single nucleotide polymorphisms and their utilization in mapping traits and determining population structure in production animals

by

Danielle Marie Gorbach

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Genetics

Program of Study Committee: Max F. Rothschild, Major Professor Jack Dekkers Rohan Fernando Dorian Garrick Daniel Nettleton

Iowa State University

Ames, Iowa

2011

Copyright © Danielle Marie Gorbach, 2011. All rights reserved. ii

TABLE OF CONTENTS

LIST OF FIGURES v

LIST OF TABLES vii

ABSTRACT ix

CHAPTER 1. GENERAL INTRODUCTION Introduction 1 Thesis Organization 4 Literature Review 6 References 23

CHAPTER 2. SNP DISCOVERY IN LITOPENAEUS VANNAMEI WITH A NEW COMPUTATIONAL PIPELINE Summary 42 Article 43 Acknowledgements 46 References 46

CHAPTER 3. MINING ESTS TO DETERMINE THE USEFULNESS OF SNPS ACROSS SHRIMP SPECIES Abstract 77 Introduction 78 Materials and Methods 79 Results and Discussion 79 Acknowledgements 81 References 81

CHAPTER 4. USE OF SNP GENOTYPING TO DETERMINE PEDIGREE AND BREED COMPOSITION OF DAIRY CATTLE IN KENYA Summary 84 Introduction 84 Materials and Methods 86 Results 87 Discussion 88 Acknowledgements 89 References 90

CHAPTER 5. POLYDACTYL INHERITANCE IN THE PIG Abstract 94 Introduction 95 Materials and Methods 96 iii

Results 100 Discussion 103 Funding 105 Acknowledgements 105 References 105

CHAPTER 6. WHOLE- ASSOCIATION ANALYSES FOR RESIDUAL FEED INTAKE AND RELATED TRAITS IN SWINE Abstract 116 Introduction 117 Materials and Methods 119 Results 124 Discussion 125 Acknowledgements 128 References 129

CHAPTER 7. PREDICTING PLEIOTROPY FOR FEED INTAKE AND GROWTH TRAITS IN SWINE USING RELATIONSHIPS BETWEEN SNP WINDOWS’ GENOMIC ESTIMATED BREEDING VALUES Abstract 137 Introduction 138 Materials and Methods 139 Results 141 Discussion 143 Acknowledgements 146 References 146

CHAPTER 8. GENERAL CONCLUSIONS 153

APPENDIX A. A -BASED SNP LINKAGE MAP FOR PACIFIC WHITE SHRIMP, LITOPENAEUS VANNAMEI Summary 160 Introduction 161 Materials and Methods 163 Results and Discussion 167 Acknowledgements 171 References 172

APPENDIX B. THE PIG HAS PLENTY TO SQUEAL ABOUT Abstract 185 Introduction 186 Linkage Mapping and Gene Mapping Progress 187 iv

QTL and Candidate 189 Databases and Online Resources 192 Functional Genomics 193 Sequencing Progress 195 SNP Chip 199 A Possible Future for Swine Genomics 200 Conclusions 201 Acknowledgements 202 References 203

APPENDIX C. DEVELOPMENT AND APPLICATION OF HIGH-DENSITY SNP ARRAYS IN GENOMIC STUDIES OF DOMESTIC ANIMALS Abstract 213 Introduction 213 Animal Genome Sequencing and HapMap Projects 215 Design and Development of High-Density SNP Arrays 217 Applications of the SNP Array 221 Future Prospects 231 Acknowledgements 234 References 234

APPENDIX D. LARGE-SCALE SNP ASSOCIATION ANALYSES OF RESIDUAL FEED INTAKE AND ITS COMPONENT TRAITS IN PIGS Introduction 252 Materials and Methods 253 Results and Discussion 254 Conclusions 256 Acknowledgements 256 References 256

ACKNOWLEDGEMENTS 259

v

LIST OF FIGURES

CHAPTER 2. SNP DISCOVERY IN LITOPENAEUS VANNAMEI WITH A NEW COMPUTATIONAL PIPELINE

FIGURE 1. - A flow chart for clustering ESTs and using 48 SNPidentifier to predict SNPs from the resulting contigs.

CHAPTER 3. MINING ESTS TO DETERMINE THE USEFULNESS OF SNPS ACROSS SHRIMP SPECIES

FIGURE 1. - Distribution of predicted SNPs describing 83 usefulness for intra- or interspecies determination or both.

CHAPTER 4. USE OF SNP GENOTYPING TO DETERMINE PEDIGREE AND BREED COMPOSITION OF DAIRY CATTLE IN KENYA

FIGURE 1. - Results from the Structure software for clustering 92 the Kenyan animals with the HapMap cattle.

FIGURE 2. - A comparison of the distribution of inbreeding 93 coefficients in the different groups of cattle based on Ritland’s pair-wise kinship coefficient.

CHAPTER 5. POLYDACTYL INHERITANCE IN THE PIG

FIGURE 1. - Original pigs affected with preaxial polydactyly in 109 a breeding population of Yorkshire swine.

FIGURE 2. - Variation of the preaxial polydactyl phenotype seen 110 in Yorkshire pigs.

FIGURE 3. - Pedigree of Yorkshire animals where polydactyl 111 animals existed.

FIGURE 4. - Pedigree structure of Yorkshire population where 112 polydactyl pigs appeared when females were mated to males outside the breeding population to test inheritance.

CHAPTER 6. WHOLE-GENOME ASSOCIATION ANALYSES FOR RESIDUAL FEED INTAKE AND RELATED TRAITS IN SWINE

FIGURE 1. - Proportion of genetic variance explained by each 135 vi

window of five consecutive markers in the genome for RFI, ADFI, ADG, BF, and LMA.

CHAPTER 7. PREDICTING PLEIOTROPY FOR FEED INTAKE AND GROWTH TRAITS IN SWINE USING RELATIONSHIPS BETWEEN SNP WINDOWS’ GENOMIC ESTIMATED BREEDING VALUES

FIGURE 1. - Distribution of five-SNP window GEBV 152 correlations between ADFI and BF.

APPENDIX A. A GENE-BASED SNP LINKAGE MAP FOR PACIFIC WHITE SHRIMP, LITOPENAEUS VANNAMEI

FIGURE 1. - Sex-averaged linkage maps for Pacific white 179 shrimp.

FIGURE 2. - Levels of LD between SNP pairs located on the 182 same linkage group.

APPENDIX C. DEVELOPMENT AND APPLICATION OF HIGH- DENSITY SNP ARRAYS IN GENOMIC STUDIES OF DOMESTIC ANIMALS

FIGURE 1. - The strategy of reduced representation library 246 sequencing for SNP discovery.

FIGURE 2. - An illustration of the Infinium II assay that was 247 developed for SNP genotyping.

FIGURE 3. - Estimation of number of individuals required in a 248 reference population for genomic selection.

APPENDIX D. LARGE-SCALE SNP ASSOCIATION ANALYSES OF RESIDUAL FEED INTAKE AND ITS COMPONENT TRAITS IN PIGS

FIGURE 1. - Generations and parities of pigs in the ISU 257 selection project with genotyped parities depicted.

FIGURE 2. - Proportion of genetic variance explained by each 258 marker or window of 5 consecutive markers in the genome for each trait.

vii

LIST OF TABLES

CHAPTER 2. SNP DISCOVERY IN LITOPENAEUS VANNAMEI WITH A NEW COMPUTATIONAL PIPELINE

TABLE 1. - Primers used and their success in amplifying and 49 confirming SNPs.

TABLE S1. - Positions of predicted SNPs and the homology of 50 the surrounding sequence to known proteins.

TABLE S2. - Contigs sequenced to confirm predicted SNPs and 74 the additional SNPs found within them.

CHAPTER 5. POLYDACTYL INHERITANCE IN THE PIG

TABLE 1. - Phenotypic results of all matings to produce 113 polydactyl offspring.

TABLE 2. - Evidence against WNT16 from segregation data 114 from matings of pigs producing affected offspring.

TABLE 3. - LOD scores for intervals on 18. 115

CHAPTER 6. WHOLE-GENOME ASSOCIATION ANALYSES FOR RESIDUAL FEED INTAKE AND RELATED TRAITS IN SWINE

TABLE 1. - Significant SNP window results for feed intake and 133 growth traits based on a Bayesian analysis.

TABLE 2. - Significant haplotype blocks created from the top 134 150 SNP windows based on the Bayesian analyses for each trait.

CHAPTER 7. PREDICTING PLEIOTROPY FOR FEED INTAKE AND GROWTH TRAITS IN SWINE USING RELATIONSHIPS BETWEEN SNP WINDOWS’ GENOMIC ESTIMATED BREEDING VALUES

TABLE 1. - Expected genetic correlations between traits and 148 mean and median correlations between 5-SNP wGEBVs between traits.

TABLE 2. - Mean and median covariances between 5-SNP 148 wGEBVs between traits. viii

TABLE 3. - Frequency table of covariances between 5-SNP 148 wGEBVs between RFI and ADFI on a log scale.

TABLE 4. - Frequency table of covariances between 5-SNP 149 wGEBVs between RFI and ADG on a log scale.

TABLE 5. - Frequency table of covariances between 5-SNP 149 wGEBVs between RFI and BF on a log scale.

TABLE 6. - Frequency table of covariances between 5-SNP 149 wGEBVs between ADFI and ADG on a log scale.

TABLE 7. - Frequency table of covariances between 5-SNP 150 wGEBVs between ADFI and BF on a log scale.

TABLE 8. - Frequency table of covariances between 5-SNP 150 wGEBVs between ADG and BF on a log scale.

TABLE 9. - Genes and genomic regions implicated in 151 antagonistic covariances between traits.

APPENDIX A. A GENE-BASED SNP LINKAGE MAP FOR PACIFIC WHITE SHRIMP, LITOPENAEUS VANNAMEI

TABLE 1. - Properties of linkage groups for Pacific white 183 shrimp.

APPENDIX B. THE PIG GENOME PROJECT HAS PLENTY TO SQUEAL ABOUT

TABLE 1. - The online resources and databases useful for the 211 pig genomics community.

APPENDIX C. DEVELOPMENT AND APPLICATION OF HIGH- DENSITY SNP ARRAYS IN GENOMIC STUDIES OF DOMESTIC ANIMALS

TABLE 1. - A summary of the sequenced whole of 249 important domestic animals.

TABLE 2. - Illumina’s BeadChips developed for important 250 domestic animals.

TABLE 3. - Genome-wide association studies with reported 251 candidate genes in domestic animals. ix

ABSTRACT

Single nucleotide polymorphisms (SNPs) are single base changes in the genome that can differ among populations and among individuals within populations. These genetic markers are widely distributed throughout the genome and relatively easy to genotype on a large scale. Several steps are involved in utilizing SNPs in genomic studies. First, SNPs must be discovered and mapped within the genome. Then, populations can be characterized based on their frequencies, and pedigrees can be traced. Whole genome association studies

(WGAS) can then be carried out to discover the genetic underpinnings of phenotypes, and the results can be used for such objectives as improving animal production systems. The primary objective of the research reported herein was to assess the usefulness of SNPs in determining the biological bases of important phenotypes in production animals.

The development of a new software package is described for SNP discovery work from expressed sequence tags in Litopenaeus vannamei and related shrimp species. This program predicted 504 SNPs in L. vannamei and had a validation rate of 44% among SNPs genotyped in a specific population. When this same program was used to predict inter- and intraspecies SNPs from nine shrimp species, 4,597 SNPs were predicted, but only 18 of them segregated in multiple species. Consequently, it was concluded that cross-species SNP prediction was not successful for SNP discovery in L. vannamei.

A population of dairy cattle in Kenya was characterized using SNP genotypes from the 50K cattle SNP chip. Just over 10% of the putative relationships were determined to be inaccurate and heavy use of one AI bull indicated that inbreeding could be a problem if not more closely monitored. Finally, breed composition was predicted via comparisons to the x

HapMap population using the software program, Structure. Most animals were found to be of

Holstein or Guernsey descent, with little native African blood.

WGAS were carried out for several traits in pigs. The 60K porcine SNP chip was used to complete genome scans for polydactyly, residual feed intake (RFI), average daily feed intake (ADFI), average daily gain (ADG), backfat, and loin muscle area in Yorkshire swine. Polydactyly was suspected to be a single gene recessive trait with incomplete penetrance, but a causative was not identified. A 25 Mb region on chromosome 8 was implicated based on a long stretch of homozygosity in affected animals. The other traits examined were known to be polygenic, and as such, many regions across the genome were found to be associated with each trait. Both previously known genes, such as MC4R for

ADFI and ADG, and novel genes, such as CRAT for RFI, were implicated. Although validation is needed, these results represent potential candidates for marker-assisted selection or drug targets for feed additives to improve animal production.

In conclusion, SNPs are useful for investigating the genetic cause of animal phenotypes, but they do not describe all of the components that are working together to create phenotypes. As sequencing methods become faster and cheaper, the assessment of additional types of will undoubtedly reveal more information about the genetic bases of observed phenotypes. The overall process of data analysis, however, may remain similar for several years to come. 1

CHAPTER 1. GENERAL INTRODUCTION

INTRODUCTION

Technology development and usage has rapidly advanced in the agricultural sector in recent decades. In an attempt to improve profitability, producers have quickly adopted new technologies as they have become available. Animal production is no exception. One way to increase profits from animal breeding is to improve the production levels of the offspring each generation. Traditionally, farmers have selected their best animals to mate to produce the next generation, but determining which animals are the best can be a complicated process. Prior to molecular techniques becoming available, advances in statistical methods and phenotype collection methods drove much of the improvement in determining which animals to use as parents. As more and more knowledge was gained about the underlying genetics causing differences in phenotypes between individuals, researchers have put more emphasis on trying to utilize this genetic information to improve livestock breeding.

Early genetic markers were limited in their genomic distribution and were not practical on a large scale due to the slow speed and high cost of genotyping. As genotyping technologies have advanced, different types of genetic markers have gained popularity.

Currently, single nucleotide polymorphisms (SNPs) are the markers of choice for genotyping. These markers are widely distributed across the genome, making it easy to assay much of the genetic variation present on all . Other large advantages of SNPs are the ease with which large numbers of SNPs can be genotyped at once and the low cost of genotyping.

Determining animal merit based on genetics involves many stages of research preceding implementation in the industry. First, genetic markers, such as SNPs, must be 2

identified and validated. Next, one or more populations of animals need to be selected for marker analyses, genotyped for the genetic markers, and phenotyped for traits of interest.

Quality control on the resulting data includes using genotypes to check pedigrees for each animal and considering genotype call rates. Finally, association analyses between genotypes and traits and follow-up studies on the significant results can be conducted. Once significant genetic markers have been validated in the population(s) of interest, breeders have more confidence that marker-assisted selection (MAS) with those markers can improve the selection process in the population(s). Information about the relationships between genotypes and phenotypes are not only useful for MAS but could also be useful in making management decisions about a herd based on the biology underlying important traits.

For some species many SNP loci have been discovered and validated in multiple populations, but for other species, such as Pacific white shrimp (Litopenaeus vannamei), not much genetic information is available. When SNP information is not known, researchers can either use sequencing to discover SNPs with no a priori knowledge or utilize existing sequence databases to predict SNPs in silico. The latter method can create cost savings during validation by limiting sequencing or genotyping to regions predicted to contain SNPs.

The quality control steps that follow genotyping include using the SNP markers to trace pedigrees, to confirm that genotyped animals appear to be correctly identified, and to determine allele frequencies and genotyping call rates. Other characterizations of the population(s) can be completed with this same data, such as predicting population history, calculating inbreeding coefficients, and evaluating linkage disequilibrium (LD) between loci.

These types of analyses are of particular importance in populations, such as African cattle, where written records are incomplete or inaccurate. 3

Once the quality control steps have been completed, association studies can be carried

out to determine the relationships between phenotype and genotype. Use of thousands of

SNPs spread evenly across the genome can give accurate and fairly precise information about

the location of mutations associated with various phenotypes. When association study results

are used in practice, it is important to consider not only the impact they will have on the trait

of interest but also the correlated responses that can occur on other traits that the breeder

wants to improve.

Genetic markers allow for the examination of the genetic merit of an individual

without relying exclusively on phenotypes observed on the individual and his relatives. This

deeper information may enable breeders to distinguish between full-siblings at a young age and when phenotype information is expensive or impossible to collect. In this way, producers use MAS or genomic selection, which predicts breeding values based on genotypes at markers spread throughout the genome, to improve the accuracy of estimated breeding values of selection candidates.

Knowledge of the biological bases for phenotypes could be used to improve the environment of production animals in addition to their genetics. Altering feed composition to improve digestibility or changing culling decisions to better fit the animals’ biology are examples of ways that understanding genes and gene pathways driving phenotypes can be used to improve herd management decisions.

The work presented herein focused on determining the usefulness of SNPs to map traits and determine population origin in production animals. Topics range from advances in methods for SNP discovery to applications of SNPs in association analyses. The SNP prediction methods described utilize expressed sequence tag (EST) data for which no 4

sequence chromatogram files are available, which is not commonly done. Usually the

chromatogram files are used to determine base-calling accuracy, but these methods utilize

other quality control measures in their absence. The software package developed is freely

available to other researchers, and the SNPs found in Pacific white shrimp via this method

have already been used in creating a linkage map and carrying out association studies. Large-

scale SNP studies are described from a population of Kenyan dairy cattle, a family of pigs

with polydactyly, and Yorkshire pig selection lines for residual feed intake. From a practical

standpoint, the ultimate goals of these research projects were to give producers a better

understanding of the genetics of these populations to improve management decisions and

begin to find genetic markers for trait improvement.

THESIS ORGANIZATION

This thesis utilizes the alternative format and includes six manuscripts (chapters 2-7)

which have been published or prepared for publication. All figures and tables are included after the references in the relevant chapter. The remainder of this chapter contains a review of the literature relevant to the topic of SNP usage to improve animal production. Chapter 8 provides overall conclusions from the thesis. The appendices contain additional published manuscripts which Danielle Gorbach has co-authored.

Chapter 2, entitled “SNP discovery in Litopenaeus vannamei with a new computational pipeline,” and chapter 3, entitled “Mining ESTs to determine the usefulness of

SNPs across shrimp species,” cover computational methods for SNP discovery developed and implemented in a new software package primarily created by Danielle Gorbach. She was also composed both manuscripts. 5

As previously described, the next step in SNP-based analyses after SNP discovery is genotyping animals and comparing the genetic information to recorded population history.

Consequently, chapter 4, entitled “Use of SNP genotyping to determine pedigree and breed composition of dairy cattle in Kenya”, describes how SNP genotyping was used to detect pedigree errors, measure inbreeding levels, and predict breed composition on cattle in Kenya, where pedigree records were not expected to be very accurate. Danielle Gorbach completed the majority of the statistical analyses and preparation of the manuscript with some assistance from Linky Makgahlela and feedback from other co-authors.

Once population structure has been determined and adjusted for in statistical analyses, association studies can be carried out to determine which markers appear to impact traits of economic interest. Chapter 5, entitled “Polydactyl inheritance in the pig”, and chapter 6, entitled “Whole-genome association analyses for residual feed intake and related traits in swine”, describe studies where SNPs located throughout the swine genome were analyzed for their associations to different trait phenotypes. Danielle Gorbach was responsible for the whole-genome association study for the polydactyl phenotype and significant manuscript revisions for chapter 5 after Dr. Benny Mote composed the initial version. For chapter 6, she assisted with selecting animals for genotyping, completed much of the DNA isolation, completed all of the statistical analyses for associations between SNPs and traits, and prepared the manuscript.

Finally, the impacts selection on SNP markers has on other traits needs to be examined before markers can be utilized in a breeding program. As an extension of chapter

6, the same SNP genotypes were used to identify genomic regions that might have particularly favorable effects on multiple traits. Danielle Gorbach carried out the statistical 6

analyses and manuscript preparation for chapter 7, entitled “Predicting pleiotropy for feed

intake and growth traits in swine using relationships between SNP windows’ genomic

estimated breeding values”.

LITERATURE REVIEW

Genetic Markers

For many decades biologists have tried to understand and interpret the .

Initially, few tools were available for mapping traits to genomic locations, although researchers had ideas for how to complete such maps (Haldane 1948). Through the years multiple types of markers were developed that allowed researchers to approximate the location of sequences coding for various phenotypes. These markers can act as precursors for targeted sequencing to determine more precise locations, markers for predicting status (e.g. Kan and Dozy 1978), or could be selected on themselves to predict phenotypes in future generations (Soller 1978; Soller and Beckmann 1983).

Restriction fragment length polymorphisms (RFLPs) were one of the earlier genetic sequence variation marker types. When they first became popular, researchers digested DNA with restriction enzymes and then identified the resulting fragments of interest using

Southern blot (Botstein et al. 1980). Later, RFLP methods were combined with the polymerase chain reaction (PCR) to target specific DNA sequences (Saiki et al. 1985), which simplified the process. First used for mapping in adenovirus in 1974 (Grodzicker et al. 1974),

RFLPs were later useful for genetic mapping in other species, including humans, at either the single gene level (Kan and Dozy 1978) or genome-wide (Botstein et al. 1980). Beckmann and Soller (1983) proposed many uses of RFLPs in plants and livestock, including group 7

identification, quantitative trait (QTL) mapping, and MAS. While RFLPs are widely

distributed in the genome, they are difficult to assay on a large-scale due to the requirement for PCR or Southern blotting for each assayed fragment and the use of specific restriction enzymes for each polymorphism.

Microsatellites, repeated short sequences found in tandem in the genome, were discovered in the early 1980s (Miesfeld et al. 1981; Spritz 1981) and found to be more useful as genetic markers than RFLPs (Fredholm et al. 1993). Widely distributed within mammalian genomes (representing 6-14 bp per kbp; Tóth et al. 2000), microsatellites were used to create

linkage maps during the early 1990s for many species (e.g. Ellegren et al. 1994; Weissenbach

et al. 1992). Microsatellites tend to be highly polymorphic, with many per locus

(Fredholm et al. 1993), which reduces the number of families and/or loci that need to be

assayed to obtain the same amount of distinctive information as bi-allelic markers can

provide (Vignal et al. 2002). Although microsatellites can be found throughout the genome,

they are less abundant in the exons than in the non-coding regions of eukaryotes (Hancock

1995) and thus are not overly influential on many eukaryotic protein sequences. This

limitation, as well as the higher abundance of SNPs (Taillon-Miller et al. 1998), the difficulty

to genotype many microsatellites across the genome simultaneously, the ease of statistical

analysis of SNPs (Wakeley et al. 2001), and the greater mutation rate of microsatellites, led

to the replacement of microsatellites with SNPs as the marker of choice in the late 1990s and

early 2000s (International SNP Map Working Group 2001).

A SNP is a single base pair difference between homologous sequences and is usually

bi-allelic (Brookes 1999). SNPs were not first genotyped in the late 1990s, as they were one

of the mutations that could be assayed with RFLPs, but the methods for discovering and 8

genotyping SNPs became much easier and higher-throughput. With RFLPs, a restriction

enzyme needed to be available that would differentially cut depending on the base present at

the SNP position. Improved sequencing and genotyping technologies made it cheaper and

faster to discover and genotype SNPs compared to older gel-based sequencing methods

(Collins et al. 1997; Meldrum 2000; Shuber et al. 1997) and did not depend on whether the

SNPs could be cut with a . The advent of the human genome sequencing project and an initiative from the National Human Genome Research Institute (NHGRI;

Marshall 1997) sparked the development of new SNP genotyping methods and the discovery of millions of SNPs across the human genome (International Human Genome Sequencing

Consortium 2001; International SNP Map Working Group 2001).

Other species piggy-backed off the advances in SNP technology from the human genome projects. Although SNP discovery was much slower in production animal species than in humans, over one thousand SNPs had been discovered in pigs (Fahrenkrug et al.

2002) and chickens (Kim et al. 2002) by early 2002. Sequencing prices have dropped dramatically since the sequencing of the human genome, making it more affordable to do large-scale sequencing of other animals. Most of the major livestock species have used some form of reduced representation libraries (RRLs) to discover SNPs in different lines or breeds

(e.g. chicken: International Chicken Polymorphism Map Consortium 2004; pig: Ramos et al.

2009; Wiedmann et al. 2008; cattle: Van Tassell et al. 2008). Genome sequencing efforts in production animals have led to the discovery of hundreds of thousands of additional SNPs

(e.g. Bovine HapMap Consortium 2009; Wade et al. 2009).

The availability of tens of thousands to millions of SNPs per species has led to the development of SNP chips that are capable of assaying large numbers of SNPs per animal 9

simultaneously. Illumina, Inc. (San Diego, CA) currently offers SNP chips that can assay

approximately the following maximum numbers of SNPs for each species: 2.5 million in

human; 700,000 in cattle; 60,000 in swine; and 50,000 in sheep (www.illumina.com). Their

technology for these large BeadArrays™ works by hybridizing the neighboring genomic

DNA to locus-specific primers, genotyping the SNP by adding a complementary labeled

nucleotide to the end of the primer, and then detecting the label with immunohistochemistry

(Steemers and Gunderson 2007). The ability to genotype so many SNPs at once has

drastically reduced the cost per individual per genotype to less than half a cent for the entire

process from DNA isolation to genotyping with a SNP chip.

While SNPs have phenomenal advantages over other genetic markers based on the

scalability of genotyping, a few drawbacks still exist. Not all causative mutations are in the

form of SNPs, so in order to find causative loci, other types of polymorphisms such as insertions, deletions or translocations must be considered. SNPs can also be genotyped incorrectly if insertions or deletions cause copy number changes to the surrounding sequence

(Fredman et al. 2004). Methods have recently been developed to assess copy number variation (CNVs) based on SNP chip results (Komura et al. 2006), which can be used to identify such genotyping errors. If the copy number is not changed when a sequence moves within the genome (e.g. translocations), then SNP genotypes alone cannot distinguish these genomic differences, and they could negatively impact association studies due to altered LD.

Microsatellites, however, have similar disadvantages in addition to being harder to genotype

on a large-scale. Thus, SNPs are the best genetic markers available today for studying genetic

variation.

10

SNP Discovery Methods

While the genome sequencing methods briefly discussed in the last section can be

useful for finding thousands to millions of SNPs, the sequencing itself can be expensive and

predicted SNPs do not always prove to be segregating in the desired population. Depending

on the sequence data already available, cheaper methods may be used for SNP prediction. In

silico methods for prediction using pre-existing sequence data have been around for more than a decade, beginning with data from the human genome sequencing project (Taillon-

Miller et al. 1998). Several computer packages are available for SNP prediction (e.g. Marth et al. 1999; Nickerson et al. 1997), but significant amounts of sequence data are available for

SNP mining that cannot be mined by commonly used programs because the data sets do not contain the necessary information.

Recognizing the costs of whole-genome sequencing, researchers have developed several methods for SNP discovery that do not require the complete genome sequencing of multiple individuals but still require some sequencing expense. Screening sequence-tagged sites (STSs), which are short, unique regions within the genome, for SNPs (Kwok et al.

1996) was one of the earliest methods for discovering SNPs from fragments of the genome.

A few years later, Altshuler et al. (2000) described a method called reduced representation for SNP discovery in humans. RRLs have been widely implemented

(e.g. Ramos et al. 2009; Van Tassell et al. 2008; Wiedmann et al. 2008) and use restriction enzymes to digest the genome, followed by the selection of only certain fragment sizes for sequencing. RRLs attempt to target similar regions of the genome in multiple individuals to make SNP prediction more likely with less sequence data (Altshuler et al. 2000). Although next-generation sequencing techniques have reduced the cost of sequencing, these methods 11

are still expensive, especially in species without a sequenced genome to which reads can be aligned (Hyten et al. 2010b). Van Tassell et al. (2008) estimated the costs of SNP discovery

in cattle to be approximately $0.48 per putative SNP. Validation rates for SNPs discovered

with this RRL approach in well-characterized genomes range from 79-93% (Barbazuk et al.

2007; Hyten et al. 2010a; Ramos et al. 2009; Van Tassell et al. 2008). In less studied

genomes with recent duplications, such as rainbow trout, SNP validation rates were much

lower (48%; Castaño-Sánchez et al. 2009).

Expressed sequence tags (ESTs), short sequences from cDNAs which are the reverse transcript of mRNAs (Adams et al. 1991), are a form of reduced representation genomic sequence. First used in the early-1990s to obtain gene sequence (Adams et al. 1991), ESTs have been used in many species to mine SNPs (e.g. Kim et al. 2003; Pavy et al. 2006;

Sunyaev et al. 1999). Researchers favor ESTs for SNP discovery because the putative SNPs that result are known to be located in or near expressed genes. Several conditions affect the success of SNP discovery from ESTs including the number of ESTs available, their gene coverage (Sunyaev et al. 1999; Wang et al. 2008), the quality of sequences (Pavy et al.

2006), the number of different individuals sequenced versus the depth of coverage for each

individual, the number of SNPs shared between the sequenced population and the validation

population (Grapes et al. 2006), and the amount of recent duplication in the genome

(Castaño-Sánchez et al. 2009; Pavy et al. 2006).

Litopenaeus vannamei, Pacific white shrimp, is one species for which a decent

amount of EST data is available (http://www.marinegenomics.org; McKillen et al. 2005;

O’Leary et al. 2006), but only a small fraction of the genome has been studied (Ciobanu et al.

2010; Robalino et al. 2007). L. vannamei are the most commonly farmed species of shrimp in 12

the world (Ciobanu et al. 2010), making them interesting to geneticists who want to improve

their disease resistance (e.g. Alcivar-Warren et al. 2007b; Dhar et al. 2007; Gross et al.

2001). Researchers have focused on discovering and mapping microsatellite markers in

Pacific white shrimp (Alcivar-Warren et al. 2007a, b; Meehan et al. 2003), instead of the

easier to use SNPs commonly found across the genome. Without other genome resources

readily available such as a physical map, mining the thousands of available ESTs for SNP

markers is the best method to begin to identify SNPs across the genome that could be useful

for MAS and tracing pedigrees.

One strategy that has had some success in discovering SNPs in species with limited

genomic information is to mine SNPs using EST data from genes that are known to have a

high incidence of SNPs in related species (Grapes et al. 2006). The successful use of SNP

chips in related species, such as the use of the bovine chip for bison (Pertoldi et al. 2010) and

the porcine chip for African warthogs, Babyrousa, and others (Megens et al. 2010), show that

SNPs and the primers used to assay them are occasionally simultaneously conserved across

species. Consequently, it makes sense to use the ESTs available from other shrimp species to

predict SNPs in L. vannamei, if not enough are discovered from Pacific white shrimp ESTs

alone.

The primary computer packages used for predicting SNPs from sequence data are

Phred, Phrap, PolyPhred, Polybayes, and Consed. Usually data are inputted into Phred first, and that output is manipulated by Phrap. Then, either PolyPhred or Polybayes is used on those results. Finally, Consed is used to visualize the end results. Phred is used for base- calling from sequence trace files (Ewing and Green 1998; Ewing et al. 1998). Phrap is used to align the resulting sequences and create contigs (P. Green, unpublished data). 13

Polymorphisms in the data can then be predicted using PolyPhred (Nickerson et al. 1997) or

Polybayes (Marth et al. 1999). Finally, Consed is used to visualize the data to confirm the

results or edit the alignments (Gordon 2004; Gordon et al. 1998). Many studies have shown

the success of this pipeline for predicting SNPs (e.g. Kim et al. 2003; Pavy et al. 2006;

Sunyaev et al. 1999), but these programs rely on sequence trace files as input, and these are not always available. For the large number of public ESTs in shrimp species, these trace files are not available. Thus, another method needs to be developed to discover SNPs in these

ESTs. Although it may sound easy to predict SNPs from sequence data, EST sequences are notorious for their low quality, so special algorithms are needed to decipher sequence errors from true SNPs, especially in the absence of data describing base-calling accuracy.

Population Characterization with SNPs

Once SNPs have been discovered, they can be used for many purposes. Before starting on more complex uses, such as whole-genome association studies (WGAS), it is best to examine the genetics of the study population to detect genotyping, pedigree, and/or individual identification errors. These types of errors, as well as unknown population stratification, can have negative impacts on later statistical tests (Tian et al. 2008). In addition to being useful for improving data quality, many of these data evaluations are informative for understanding the population history (e.g. Bradley et al. 1996; Giuffra et al. 2000; Larson et al. 2005), which does not always exist in the written record. Tracing pedigrees, predicting population origin, examining genetic differences between groups, calculating inbreeding levels, and evaluating LD via SNP data can all provide information about population history.

Researchers have been using data to trace pedigrees for more than a decade (Herbinger et al. 1995; Perez-Enriquez et al. 1999; Soller and Beckmann 1983) 14

because obtaining accurate pedigree information is important for genetics studies and breeding value estimation (Israel and Weller 2000). In populations where ancestry is difficult to track, such as pasture-breeding systems where multiple sires are housed with dams throughout the breeding season, genetic markers can easily be used to choose amongst a list of potential sires. These methods work well for determining who cannot have sired a given offspring, but cannot be used to prove maternity or paternity unless all possible parents have been genotyped (Visscher et al. 2002). Traditionally, pedigree tracing used blood groups and protein polymorphisms (Stormont 1967) or, preferably, microsatellites due to their greater allelic diversity (Glowatzki-Mullis et al. 1995). As SNPs became cheaper and easier to genotype, however, the usage of SNPs to determine ancestry increased (Heaton et al. 2002;

Rohrer et al. 2007). Identifying genotypes that do not match between individuals that are recognized to be parent-offspring based on overall genotypes also informs decisions about which markers to eliminate from further analysis because of inaccurate genotyping calls

(Weller et al. 2010). Finally, genotypes can be used to detect animal identification errors by comparing to known relatives, determining gender from SNP genotypes on the sex chromosomes, or comparing to earlier samples from the same individual. Misidentification can negatively impact research results or cause problems in tracing animals through the production system.

Phylogenetics and related evolutionary methods essentially seek to trace pedigrees on a longer-term scale; they predict how groups diverged, classify individuals into groups, and describe the crosses that have occurred between groups. For more than a decade, researchers have been using genetic markers to try to understand the origins of different species and populations (e.g. Bradley et al. 1996; Giuffra et al. 2000; Larson et al. 2005). Traditionally, 15

microsatellite markers have been used for diversity studies due to their many alleles per

locus, but use of genome-wide SNPs may more accurately capture the total genetic architecture because SNPs have lower mutation rates. Several recent studies have used thousands of SNPs to trace animal origins and to predict breed structure (e.g. Bovine

HapMap Consortium 2009; McKay et al. 2008). These results can be useful for understanding population history, determining the breed origin of meat products (Ramos et al. 2011), and accounting for population stratification in genetics studies (Sham et al. 2009).

Identifying allele frequency differences between groups, be they breeds, lines, races, etc., can lead to the discovery of genome regions that have been under different selective pressures in the groups (Hudson et al. 1987). While some of these differences are the result of random genetic drift, other genomic regions were selected on by either natural or artificial selection and are keys to understanding the genetics underlying phenotypic differences.

Genotype frequencies that differ significantly from Hardy-Weinberg equilibrium within an unselected population can also indicate markers with genotyping problems (Xu et al. 2002).

SNPs can also be used to assess inbreeding levels by computing the probabilities that loci are identical-by-descent (IBD; Leutenegger et al. 2003). When pedigree records are incomplete or inaccurate, marker data can be used to predict whether a population is being well-managed with respect to minimizing inbreeding for the purpose of preventing inbreeding depression and loss of genetic variability. Inbreeding depression reduces performance in a population due to increased levels of deleterious homozygous recessive loci

(Wright 1922). Inbreeding levels should also be reduced to maximize the level of genetic variation that can be selected on in the population and reduce the probability of deleterious mutations becoming fixed. 16

Finally, information about population history can also be obtained by considering LD

levels based on SNP genotypes. LD is the non-random association of alleles at two loci.

Historical effective population size can be estimated from data on LD levels (Hayes et al.

2003; Hill 1981). Levels of LD can also inform decisions about necessary marker densities

for genomics studies (Ardlie et al. 2002).

Many of these uses of SNP markers can be applied to remote breeding populations where herd management practices could use improvement. The Kenyan dairy cattle population, for example, has recently started introducing semen from international Holstein bulls via artificial insemination, but whether this semen has been used effectively to improve population performance has not been studied. The African records for pedigree and

performance are also likely to be questionable (Rege et al. 2001), making genetic data more

reliable for assessing this Kenyan population.

Trait Mapping

For several decades, researchers have been trying to map trait phenotypes to genomic positions. Originally, genetic marker knowledge was very limited and expensive to obtain, so few loci were considered in each study. Two approaches that have been used extensively to map loci are QTL analyses based on linkage and the candidate gene approach based on statistical association, both of which can use SNPs for mapping. QTL studies use markers throughout the genome and look for shared segregation patterns between the trait of interest and positions at or intermediate to the markers (Lander and Botstein 1989). The candidate gene approach, on the other hand, involves computing statistical associations between traits and markers within or near genes that are likely to be involved in those traits based on biological function or position under a known QTL (Lusis 1988). Both methods have had 17

many successes (e.g. Kim et al. 2000; Malek et al. 2001; Maller et al. 2006; Rothschild et al.

1996; Stefansson et al. 2002), but as more genetic markers have become available and

genotyping prices have dropped, whole genome association studies (WGAS) have become

more popular.

A WGAS or genome-wide association study (GWAS) is essentially a candidate gene

study where most or all genes in the genome are considered candidates (Risch and

Merikangas 1996). This method, which relies on LD between markers and causative

mutations that affect the trait being studied, has advantages over linkage studies because it

can identify common mutations with smaller effects, maps associations with higher

precision, and does not require genotyping of family members for analysis (Hirschhorn and

Daly 2005). Population stratification can cause problems when analyzing WGAS data if it is

not accounted for during the statistical analysis, which is not an issue with linkage studies.

Another advantage of WGAS is that they do not require functional information about genes,

which is still not available for many genes, the way functional candidate gene studies do

(Hirschhorn and Daly 2005). High-density SNP chips make WGAS cheaper, easier, and

faster than older genotyping technologies, thus making them a common tool for studying

quantitative traits.

Many associations have been successfully identified using WGAS in recent years.

Some of the success stories include identifying genes in humans, cattle, pigs, and dogs. In

humans, hundreds of polymorphisms have been found to be associated with ,

including SNPs associated with Crohn’s disease (Barrett et al. 2008; Duerr et al. 2006;

Franke et al. 2007; Libioulle et al. 2007; Raelson et al. 2007; Rioux et al. 2007; Wellcome

Trust Case Control Consortium 2007; Yamazaki et al. 2005) and systemic lupus 18

erythematosus (International Consortium for Systemic Lupus Erythematosus Genetics

(SLEGEN) 2008). WGAS have also discovered many SNP-trait associations in cattle, including several for growth traits (Snelling et al. 2010), residual feed intake, body weight

and hip height in beef cattle (Bolormaa et al. 2011), as well as for gestation length (Maltecca et al. 2011), milk production and fertility in dairy cattle (Pryce et al. 2010). The 60K SNP chip has not been available as long in pigs, so fewer WGAS have been completed. A few studies have shown good results so far for conformation and leg structure traits (Fan et al.

2011), reproduction traits (Onteru et al. 2011a, b), and androstenone levels (Duijvesteijn et al. 2010) in pigs. Single gene traits such as white coat color in boxers and hair ridge in

Rhodesian ridgebacks have also been successfully mapped with WGAS in dogs (Karlsson et al. 2007), although the MITF gene that was implicated in boxers had been previously shown to affect white spotting in other breeds of dogs via a candidate gene approach (Rothschild et al. 2006).

Despite all of the successes of WGAS, limitations to the technology still exist. Most researchers use single-marker association analyses to test each SNP independently and then adjust the significance threshold for multiple testing. Much of the genetic variation attributed to traits has not been able to be identified with this approach (Maher 2008). This problem may be attributed to many rare genetic variants which are not in high LD with common SNPs but which each independently cause or contribute to the same phenotype (Maher 2008), such as the case of cystic fibrosis where over 1500 causative polymorphisms have been detected

(Cantor et al. 2010). Causative genetic variants that are not in LD with SNPs, including some copy number variations (CNVs), could be commonly responsible for some of the missing genetic variance that is not attributable to SNPs in current WGAS (Maher 2008); although 19

studies to this effect have found that common CNVs tend to be in LD with SNPs providing

evidence against this hypothesis (Conrad et al. 2010). Current studies may also not be powerful enough to detect the many tiny effects that together account for the full genetic variance of a trait (Maher 2008); this hypothesis fits well with the infinitesimal model (Fisher

1930) that quantitative animal geneticists have used for decades.

A newer method for WGAS that appears to account for more of the genetic variance of traits is the use of Bayesian genomic selection approaches (Fan et al. 2011; Onteru et al.

2011a, b). Originally developed to predict marker effects for genomic selection (Meuwissen et al. 2001), researchers have recently identified the potential for successful use of these methods in trait mapping (Heuven and Janss 2010; Sun et al. 2011; Veerkamp et al. 2010).

These Bayesian methods fit SNPs across the genome simultaneously (Meuwissen et al.

2001), which gives the advantage of accounting for LD between markers to avoid overestimating the effect of each marker. These new genomic selection approaches hold much promise for improving WGAS results.

Use of SNP Markers in Industry

Many uses of SNP markers have been described above for discovering the genetic

bases of phenotypes, but how does that knowledge translate into real-world benefit for

production animal systems? The main use of this biological information is in improving

selection programs through MAS (Soller 1978; Soller and Beckmann 1983) or more recently

through genomic selection (Meuwissen et al. 2001). Improved breeding value estimation is

not the only use for genetic information. Knowledge of the biology behind improving

performance can also be used to drive herd management decisions and develop drugs or feed

additives. 20

MAS seeks to add marker genotype information to an animal’s estimated breeding

value (EBV; Soller 1978). Fernando and Grossman (1989) demonstrated a method for

incorporating this marker information into the best linear unbiased prediction (BLUP)

methods that had been developed by C.R. Henderson for predicting breeding values based on

observed phenotypes. During the 1990s, big promises were made about the potential for

improvement from MAS, such as Meuwissen and Goddard’s (1996) claim that genetic gain

could be increased by 8-38% via MAS. Some genes of large effect have been used

successfully in selection programs, such as ESR (Rothschild et al. 1996) and MC4R (Kim et al. 2000), but in general MAS has fallen short of its promises (Dekkers 2004). Studies thus far have tended to identify markers for use in MAS with moderate to large effects, which

tend to be rare within the genome, and consequently, only a small fraction of the total genetic

variance has been accounted for by markers (Dekkers and Hospital 2002). Valid concerns

also arise that use of markers which improve one trait may have a detrimental effect on other

traits of economic importance unless studies are completed to determine the effects of each

potential marker on all traits of interest prior to implementation in MAS (Rothschild et al.

1996).

Certain types of traits are more likely to benefit from MAS than other types. These

categories of traits include sex-limited traits (e.g. milk yield or cryptorchidism), traits

measured late in life or after death (e.g. longevity or carcass yield), traits that are expensive

to measure (e.g. feed efficiency), traits that producers do not want to measure on selection

candidates (e.g. disease resistance) (Dekkers 2004), low-heritability traits (e.g. litter size)

(Dekkers and Hospital 2002; Meuwissen and Goddard 1996), and traits caused by a single

recessive mutation (e.g. chrondrodysplasia in sheep). Examples of two traits in pigs that fit 21

these parameters well but that are not well understood genetically, and are thus deserving of

more research, are the polydactyl phenotype and feed efficiency.

Preaxial polydactyly, the growth of one or more extra digits on the inside of the leg,

has been described in pigs since 1931 (Curson 1931). Various forms of polydactyly have

been found in humans (e.g. Baala et al. 2007; Kang et al. 1997; Mykytyn et al. 2002), cats

(Danforth 1947), chickens (Huang et al. 2006), horses (Carstanjen et al. 2007), and mice

(both natural and mutagenesis-induced; Lettice et al. 2002). The defect is usually caused by a

single gene, making it easy to eliminate from the population once a causative mutation has

been identified. In pigs, this abnormality can lead to lameness and prevents registration of

Yorkshires in the National Swine Registry (National Swine Registry 2011), so breeders

would prefer to remove this genetic defect from their breeding herds.

Feed efficiency is a measure of how well feed intake is converted into growth and has

been of growing concern to producers due to recent rises in feed prices. Several studies have

looked at the genetic and physiological bases for feed efficiency in cattle, as measured by

residual feed intake (RFI; Barendse et al. 2007; Bolormaa et al. 2011; Herd and Arthur 2009;

Richardson and Herd 2004; Sherman et al. 2008). Much less work on the genetic bases of

RFI has been completed in swine, although selection lines for RFI exist in both the United

States (Cai et al. 2011) and France (Gilbert et al. 2007). RFI has been shown to be

moderately heritable in pigs (h2 estimates range from 0.14-0.40; Cai et al. 2008; Foster et al.

1983; Gilbert et al. 2007; Hoque et al. 2009; Johnson et al. 1999; Mrode and Kennedy 1993;

Von Felde et al. 1996) indicating that genetic markers should be identifiable for improving this trait. The high cost of collecting feed intake data makes MAS a desirable approach for

improving feed efficiency. 22

For quantitative traits, genomic selection has the potential to utilize SNPs in a more effective manner for breeding value estimation than MAS. First proposed by Meuwissen et al. (2001), genomic selection predicts effects of markers across the genome in a training population and uses genotypes of unphenotyped animals to predict their genomic estimated breeding values (GEBVs). This method has been implemented in American dairy cattle evaluations since January 2009 due to the improved reliabilities obtained when compared to parent average and limited progeny testing results (VanRaden et al. 2009). When a large training population is available, this method decreases the generation interval by increasing the accuracy of estimated breeding values at a young age, and thus is useful for significantly increasing the rate of trait improvement (Meuwissen et al. 2001).

Understanding the biology driving phenotypes can be used for more than just selecting better animals. Many of the genes found to affect traits can also inform management decisions to increase performance by improving the animals’ environment. For example, use of genetics to determine at what age sows have reached the end of their productive life rather than culling them for old age at a predetermined parity could improve profitability (Mote et al. 2009). In the case of feed efficiency, drug targets could be identified in genetics studies which could be useful for producing feed additives to improve digestion efficiency. These practices require that causative genes, if not causative mutations, be identified rather than less precise genomic regions, and may thus require more follow-up studies than are needed for MAS or genomic selection.

Overall, the animal production industry has much to gain from genetics research.

Significant gains in breeding value prediction accuracy, especially at a young age, from MAS or genomic selection can decrease generation intervals, improve accuracy of EBVs 23

(Meuwissen et al. 2001) and decrease inbreeding by reducing the reliance on family data

(Dekkers 2007). Increased understanding of the biology underlying traits can also be gained

from genetics studies and used to improve animal production systems. SNP markers are

currently the best genetic markers available for completing such research.

REFERENCES

Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR,

Wu A, Olde B, Moreno RF, et al., 1991. Complementary DNA sequencing: expressed

sequence tags and . Science. 252(5013):1651-1656.

Alcivar-Warren A, Meehan-Meola D, Park SW, Xu Z, Delaney M, Zuniga G, 2007a.

ShrimpMap: a low-density, microsatellite-based linkage map of the Pacific whiteleg

shrimp, Litopenaeus vannamei: identification of sex-linked markers in linkage group

4. J Shellfish Res. 26(4):1259-1277.

Alcivar-Warren A, Song L, Meehan-Meola D, Xu Z, Xiang J-H, Warren W, 2007b.

Characterization and mapping of expressed sequence tags isolated from a subtracted

cDNA library of Litopenaeus vannamei injected with white spot syndrome virus. J

Shellfish Res. 26(4):1247-1258.

Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES, 2000.

An SNP map of the human genome generated by reduced representation shotgun

sequencing. Nature. 407:513-516.

Ardlie KG, Kruglyak L, Seielstad M, 2002. Patterns of linkage disequilibrium in the human

genome. Nat Rev Genet. 3:299-309.

Baala L, Audollent S, Martinovic J, Ozilou C, Babron M-C, Sivanandamoorthy S, Saunier S, 24

Salomon R, Gonzales M, Rattenberry E, et al., 2007. Pleiotropic effects of CEP290

(NPHP6) mutations extend to Meckel syndrome. Am J Hum Genet. 81(1):170-179.

Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS, 2007. SNP discovery via 454

transcriptome sequencing. Plant J. 51(5):910-918.

Barendse W, Reverter A, Bunch RJ, Harrison BE, Barris W, Thomas MB, 2007. A validated

whole-genome association study of efficient food conversion in cattle. Genetics.

176:1893-1905.

Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS,

Taylor KD, Barmada MM, et al. 2008. Genome-wide association defines more than

30 distinct susceptibility loci for Crohn’s disease. Nat Genet. 40:955-962.

Beckmann JS, Soller M, 1983. Restriction fragment length polymorphisms in genetic

improvement: methodologies, mapping and costs. Theor Appl Genet. 67(1):35-43.

Bolormaa S, Hayes BJ, Savin K, Hawken R, Barendse W, Arthur PF, Herd RM, Goddard

ME. 2011. Genome-wide association studies for feedlot and growth traits in cattle. J

Anim Sci. epub ahead of print: doi:10.2527/jas.2010-3079.

Botstein D, White RL, Skolnick M, Davis RW, 1980. Construction of a map

in man using restriction fragment length polymorphisms. Am J Hum Genet. 32:314-

331.

Bovine HapMap Consortium, 2009. Genome-wide survey of SNP variation uncovers the

genetic structure of cattle breeds. Science. 324(5926):528-532.

Bradley DG, MacHugh DE, Cunningham P, Loftus RT, 1996. Mitochondrial diversity and

the origins of African and European cattle. Proc Natl Acad Sci (USA). 93:5131-5135.

Brookes AJ, 1999. The essence of SNPs. Gene. 234:177-186. 25

Cai W, Casey DS, Dekkers JCM, 2008. Selection response and genetic parameters for

residual feed intake in Yorkshire swine. J Anim Sci. 86:287-298.

Cai W, Kaiser MS, Dekkers JCM, 2011. Genetic analysis of longitudinal measurements of

performance traits in selection lines for residual feed intake in Yorkshire swine. J

Anim Sci. 89(5):1270-1280.

Cantor RM, Lange K, Sinsheimer JS, 2010. Prioritizing GWAS results: a review of statistical

methods and recommendations for their application. Am J Human Genet. 86(1):6-22.

Carstanjen B, Abitbol M, Desbois C, 2007. Bilateral polydactyly in a foal. J Vet Sci.

8(2):201–203.

Castaño-Sánchez C, Smith TPL, Wiedmann RT, Vallejo RL, Salem M, Yao J, Rexroad III

CE, 2009. Single nucleotide polymorphism discovery in rainbow trout by deep

sequencing of a reduced representation library. BMC Genomics. 10:559.

Ciobanu DC, Bastiaansen JWM, Magrin J, Rocha JL, Jiang D-H, Yu N, Geiger B, Deeb N,

Rocha D, Gong H, et al., 2010. A major SNP resource for dissection of phenotypic

and genetic variation in Pacific white shrimp (Litopenaeus vannamei). Anim Genet.

41:39-47.

Collins FS, Guyer MS, Chakravarti A, 1997. Variations on a theme: cataloging human DNA

sequence variation. Science. 278(5343):1580-1581.

Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD,

Barnes C, Campbell P, et al., 2010. Origins and functional impact of copy number

variation in the human genome. Nature. 464:704-712.

Curson H, 1931. Polydactyly in a springbok and a pig. Anatomical Studies. 19:853.

Danforth CH, 1947. Heredity of polydactyly in the cat. J Hered. 38(4):107-112. 26

Dekkers JCM, 2004. Commercial application of marker- and gene-assisted selection in

livestock: strategies and lessons. J Anim Sci. 82:E313-328.

Dekkers JCM, 2007. Prediction of response to marker-assisted and genomic selection using

selection index theory. J Anim Breed Genet. 124(6):331-341.

Dekkers JCM, Hospital F, 2002. The use of in the improvement of

agricultural populations. Nat Rev Genet. 3:22-32.

Dhar AK, Meehan-Meola D, Breland V, Carr WH, Lotz JM, Alcivar-Warren A, 2007.

Linkage mapping of developmentally expressed cDNAs containing AT-rich elements

in shrimp (Litopenaeus vannamei)—differential expression after challenge with Taura

Syndrome Virus (TSV). J Shellfish Res. 26(4):1217-1224.

Duerr RH, Taylor KD, Brant SR, Rioux JD, Silverberg MS, Daly MJ, Steinhart AH.

Abraham C, Regueiro M, Griffiths A, et al., 2006. A genome-wide association study

identifies IL23R as an inflammatory bowel disease gene. Science. 314(5804):1461-

1463.

Duijvesteijn N, Knol EF, Merks JWM, Crooijmans RPMA, Groenen MAM, Bovenhuis H,

Harlizius B, 2010. A genome-wide association study on androstenone levels in pigs

reveals a cluster of candidate genes on chromosome 6. BMC Genet. 11:42.

Ellegren H, Chowdhary BP, Johansson M, Marklund L, Fredholm M, Gustavsson I,

Andersson L, 1994. A primary linkage map of the porcine genome reveals a low rate

of . Genetics. 137:1089-1100.

Ewing B, Green P, 1998. Base-calling of automated sequencer traces using Phred. II. error

probabilities. Genome Res. 8:186-194.

Ewing B, Hillier L, Wendl MC, Green P, 1998. Base-calling of automated sequencer traces 27

using Phred. I. accuracy assessment. Genome Res. 8:175-185.

Fahrenkrug SC, Freking BA, Smith TPL, Rohrer GA, Keele JW, 2002. Single nucleotide

polymorphism (SNP) discovery in porcine expressed genes. Anim Genet. 33(3):186-

195.

Fan B, Onteru SK, Du Z-Q, Garrick DJ, Stalder KJ, Rothschild MF, 2011. Genome-wide

association study identifies loci for body composition and structural soundness traits

in pigs. PLoS ONE. 6(2):e14726.

Fernando RL, Grossman M, 1989. Marker assisted selection using best linear unbiased

prediction. Genet Sel Evol. 21(4):467-477.

Fisher RA, 1930. The genetical theory of natural selection. Oxford, UK: Oxford University

Press.

Foster WH, Kilpatrick DJ, Heaney IH, 1983. Genetic variation in the efficiency of energy

utilization by the fattening pig. Anim Prod. 37:387-393.

Franke A, Hampe J, Rosenstiel P, Becker C, Wagner F, Häsler R, Little RD, Huse K, Ruether

A, Balschun T, et al., 2007. Systematic association mapping identifies NELL1 as a

novel IBD disease gene. PLoS ONE. 2(8):e691.

Fredholm M, Winterø AK, Christensen K, Kristensen B, Bräuner-Nielsen P, Davies W,

Archibald A, 1993. Characterization of 24 porcine (dA-dC)n-(dT-dG)n microsatellites:

genotyping of unrelated animals from four breeds and linkage studies. Mamm

Genome. 4(4):187-192.

Fredman D, White SJ, Potter S, Eichler EE, Den Dunnen JT, Brookes AJ, 2004. Complex

SNP-related sequence variation in segmental genome duplications. Nat Genet.

36(8):861-866. 28

Gilbert H, Bidanel JP, Gruand J, Caritez JC, Billon Y, Guillouet P, Lagant H, Noblet J,

Sellier P, 2007. Genetic parameters for residual feed intake in growing pigs, with

emphasis on genetic relationships with carcass and meat quality traits. J Anim Sci.

85:3182-3188.

Giuffra E, Kijas JM, Amarger V, Carlborg O, Jeon JT, Andersson L, 2000. The origin of the

domestic pig: independent domestication and subsequent introgression. Genetics.

154(4):1785-1791.

Glowatzki-Mullis M-L, Gaillard C, Wigger G, Fries R, 1995. Microsatellite-based parentage

control in cattle. Anim Genet. 26(1):7-12.

Gordon D, 2004. Viewing and editing assembled sequences using Consed. In: Baxevanis

AD, Davison DB, editors. Current Protocols in . New York: John

Wiley & Co.; p. 11.2.1-11.2.43.

Gordon D, Abajian C, Green P, 1998. Consed: a graphical tool for sequence finishing.

Genome Res. 8:195-202.

Grapes L, Rudd S, Fernando RL, Megy K, Rocha D, Rothschild MF, 2006. Prospecting for

pig single nucleotide polymorphisms in the human genome: have we struck gold? J

Anim Breed Genet. 123(3):145-151.

Grodzicker T, Williams J, Sharp P, Sambrook J, 1974. Physical mapping of temperature-

sensitive mutations of adenoviruses. Cold Spring Harbor Symp Quant Biol. 39:439-

446.

Gross PS, Bartlett TC, Browdy CL, Chapman RW, Warr GW, 2001. Immune gene discovery 29

by expressed sequence tag analysis of hemocytes and hepatopancreas in the Pacific

White Shrimp, Litopenaeus vannamei, and the Atlantic White Shrimp, L. setiferus.

Dev Comp Immunol. 25:565-577.

Haldane JBS, 1948. Croonian lecture: the formal genetics of man. Proc R Soc Lond B.

135:147-170.

Hancock JM, 1995. The contribution of slippage-like processes to genome evolution. J Mol

Evol. 41(6):1038-1047.

Hayes BJ, Visscher PM, McPartlan HC, Goddard ME, 2003. Novel multilocus measure of

linkage disequilibrium to estimate past effective population size. Genome Res.

13:635-643.

Heaton MP, Harhay GP, Bennett GL, Stone RT, Grosse WM, Casas E, Keele JW, Smith

TPL, Chitko-McKown CG, Laegreid WW, 2002. Selection and use of SNP markers

for animal identification and paternity analysis in U.S. beef cattle. Mamm Genome.

13:272-281.

Herbinger CM, Doyle RW, Pitman ER, Paquet D, Mesa KA, Morris DB, Wright JM, Cook

D, 1995. DNA fingerprint based analysis of paternal and maternal effects on offspring

growth and survival in communally reared rainbow trout. Aquaculture. 137:245-256.

Herd RM, Arthur PF, 2009. Physiological basis for residual feed intake. J Anim Sci. (E.

Suppl.) 87:E64-E71.

Heuven HCM, Janss LLG, 2010. Bayesian multi-QTL mapping for growth curve parameters.

BMC Proc. 4(Suppl 1):S12.

Hill WG, 1981. Estimation of effective population size from data on linkage disequilibrium.

Genetical Res. 38:209-216. 30

Hirschhorn JN, Daly MJ, 2005. Genome-wide association studies for common diseases and

complex traits. Nat Rev Genet. 6:95-108.

Hoque MA, Kadowaki H, Shibata T, Oikawa T, Suzuki K, 2009. Genetic parameters for

measures of residual feed intake and growth traits in seven generations of Duroc pigs.

Livest Sci. 121:45-49.

Huang YQ, Deng XM, Du ZQ, Qiu X, Du X, Chen W, Morisson M, Leroux S, Ponce de

Léon FA, Da Y, et al., 2006. Single nucleotide polymorphisms in the chicken Lmbr1

gene are associated with chicken polydactyly. Gene. 374:10-18.

Hudson RR, Kreitman M, Aguadé M, 1987. A test of neutral molecular evolution based on

nucleotide data. Genetics. 116(1):153-159.

Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer

AD, May GD, Cregan PB, 2010a. High-throughput SNP discovery through deep

resequencing of a reduced representation library to anchor and orient scaffolds in the

soybean whole genome sequence. BMC Genomics. 11(1):38.

Hyten DL, Song Q, Fickus EW, Quigley CV, Lim J-S, Choi I-Y, Hwang E-Y, Pastor-

Corrales M, Cregan PB, 2010b. High-throughput SNP discovery and assay

development in common bean. BMC Genomics. 11:475.

International Chicken Polymorphism Map Consortium, 2004. A genetic variation map for

chicken with 2.8 million single-nucleotide polymorphisms. Nature. 432:717-722.

International Consortium for Systemic Lupus Erythematosus Genetics (SLEGEN), 2008.

Genome-wide association scan in women with systemic lupus erythematosus

identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci. Nat

Genet. 40(2):204-210. 31

International Human Genome Sequencing Consortium, 2001. Initial sequencing and analysis

of the human genome. Nature. 409:860-921.

International SNP Map Working Group, 2001. A map of human genome sequence variation

containing 1.42 million single nucleotide polymorphisms. Nature. 409:928-933.

Israel C, Weller JI, 2000. Effect of misidentification on genetic gain and estimation of

breeding value in dairy cattle populations. J Dairy Sci. 83(1):181-187.

Johnson ZB, Chewning JJ, Nugent RA, 1999. Genetic parameters for production traits and

measures of residual feed intake in large white swine. J Anim Sci. 77:1679-1685.

Kan YW, Dozy AM, 1978. Antenatal diagnosis of sickle-cell anaemia by D.N.A. analysis of

amniotic-fluid cells. Lancet. 312(8096):910-912.

Kang S, Graham JM Jr., Olney AH, Biesecker LG, 1997. GLI3 frameshift mutations cause

autosomal dominant Pallister-Hall syndrome. Nat Genet. 15:266-268.

Karlsson EK, Baranowska I, Wade CM, Salmon Hillbertz NHC, Zody MC, Anderson N,

Biagi TM, Patterson N, Rosengren Pielber G, Kulbokas EJ III, et al., 2007. Efficient

mapping of mendelian traits in dogs through genome-wide association. Nat Genet.

39:1321-1328.

Kim H, Schmidt CJ, Decker KS, Emara MG, 2002. Chicken SNP discovery by EST data

mining. Plant, Animal, & Microbe Genomes X Conference; 2002 Jan 12-16, San

Diego, CA, USA. P246.

Kim H, Schmidt CJ, Decker KS, Emara MG, 2003. A double-screening method to identify

reliable candidate non-synonymous SNPs from chicken EST data. Anim Genet.

34(4):249-254.

Kim KS, Larsen N, Short T, Plastow G, Rothschild MF, 2000. A missense variant of the 32

porcine melanocortin-4 receptor (MC4R) gene is associated with fatness, growth, and

feed intake traits. Mamm Genome. 11:131-135.

Komura D, Shen F, Ishikawa S, Fitch KR, Chen W, Zhang J, Liu G, Ihara S, Nakamura H,

Hurles ME, et al., 2006. Genome-wide detection of human copy number variations

using high-density DNA oligonucleotide arrays. Genome Res. 16(12):1575-1584.

Kwok P-Y, Deng Q, Zakeri H, Taylor SL, Nickerson DA, 1996. Increasing the information

content of STS-based genome maps: identifying polymorphisms in mapped STSs.

Genomics. 31(1):123-126.

Lander ES, Botstein D, 1989. Mapping Mendelian factors underlying quantitative traits using

RFLP linkage maps. Genetics. 121:185-199.

Larson G, Dobney K, Albarella U, Fang M, Matisoo-Smith E, Robins J, Lowden S,

Finlayson H, Brand T, Willerslev E, et al., 2005. Worldwide phylogeography of wild

boar reveals multiple centers of pig domestication. Science. 307(5715):1618-1621.

Lettice LA, Horikoshi T, Heaney SJH, van Baren MJ, van der Linde HC, Breedveld GJ,

Joosse M, Akarsu N, Oostra BA, Endo N, et al., 2002. Disruption of a long-range cis-

acting regulator for Shh causes preaxial polydactyly. Proc Natl Acad Sci (USA).

99(11):7548-7553.

Leutenegger A-L, Prum B, Génin E, Verny C, Lemainque A, Clerget-Darpoux F, Thompson

EA, 2003. Estimation of the inbreeding coefficient through use of genomic data. Am

J Hum Genet. 73(3):516-523.

Libioulle C, Louis E, Hansoul S, Sandor C, Farnir F, Franchimont D, Vermeire S, Dewit O, 33

de Vos M, Dixon A, et al. 2007. Novel Crohn disease locus identified by genome-

wide association maps to a gene desert on 5p13.1 and modulates expression of

PTGER4. PLoS Genet. 3(4):e58.

Lusis AJ, 1988. Genetic factors affecting blood lipoproteins: the candidate gene approach. J

Lipid Res. 29:397-429.

Maher B, 2008. Personal genomes: the case of the missing heritability. Nature.

456(7218):18-21.

Malek M, Dekkers JCM, Lee HK, Baas TJ, Rothschild MF, 2001. A molecular genome scan

analysis to identify chromosomal regions influencing economic traits in the pig. I.

growth and body composition. Mamm Genome. 12(8):630-636.

Maller J, George S, Purcell S, Fagerness J, Altshuler D, Daly MJ, Seddon JM, 2006.

Common variation in three genes, including a noncoding variant in CFH, strongly

influences risk of age-related macular degeneration. Nat Genet. 38(9):1055-1059.

Maltecca C, Gray KA, Weigel KA, Cassady JP, Ashwell M, 2011. A genome-wide

association study of direct gestation length in US Holstein and Italian Brown

populations. Anim Genet. Epub ahead of print. doi: 10.1111/j.1365-

2052.2011.02188.x.

Marshall E, 1997. ‘Playing chicken’ over gene markers. Science. 278(5346):2046-2048.

Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier L, Kwok P-Y,

Gish WR, 1999. A general approach to single-nucleotide polymorphism discovery.

Nat Genet. 23:452-456.

McKay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, Crews D, 34

Dias Neto E, Gill CA, Gao C, et al., 2008. An assessment of population structure in

eight breeds of cattle using a whole genome SNP panel. BMC Genet. 9:37.

McKillen DJ, Chen YA, Chen C, Jenny MJ, Trent HF III, Robalino J, McLean Jr DC, Gross

PS, Chapman RW, Warr GW, et al., 2005. Marine genomics: a clearing-house for

genomic and transcriptomic data of marine organisms. BMC Genomics. 6:34.

Meehan D, Xu Z, Zuniga G, Alcivar-Warren A, 2003. High frequency and large number of

polymorphic microsatellites in cultured shrimp, Penaeus (Litopenaeus) vannamei

[Crustacea:Decapoda]. Mar Biotechnol. 5:311-330.

Megens HJ, Crooijmans RPMA, Larson G, Scandura M, Iacolina L, Apollonio M, Bertorelle

G, Triantafyllidis A, Alexandri P, Muir W, et al., 2010. The Porcine HapMap

projects: genome-wide analysis of pig, wild boar and suiforme diversity. European

Wild Boar 8th International Symposium on Wild Boars and Other Suids; 2010 Sept 1-

4, York, UK.

Meldrum D, 2000. Automation for genomics, part two: sequencers, microarrays, and future

trends. Genome Res. 10:1288-1303.

Meuwissen THE, Goddard ME, 1996. The use of marker haplotypes in animal breeding

schemes. Genet Sel Evol. 28:161-176.

Meuwissen THE, Hayes BJ, Goddard ME, 2001. Prediction of total genetic value using

genome-wide dense marker maps. Genetics. 157:1819-1829.

Miesfeld R, Krystal M, Amheim N, 1981. A member of a new repeated sequence family

which is conserved throughout eucaryotic evolution is found between the human δ

and β globin genes. Nucleic Acids Res. 9(22):5931-5948.

Mote BE, Mabry JW, Stalder KJ, Rothschild MF, 2009. Evaluation of current reasons for 35

removal of sows from commercial farms. Prof Anim Scientist. 25:1-7.

Mrode RA, Kennedy BW, 1993. Genetic variation in measures of food efficiency in pigs and

their genetic relationships with growth rate and backfat. Anim Prod. 56:225-232.

Mykytyn K, Nishimura DY, Searby CC, Shastri M, Yen H, Beck JS, Braun T, Streb LM,

Cornier AS, Cox GF, et al., 2002. Identification of the gene (BBS1) most commonly

involved in Bardet-Biedl syndrome, a complex human obesity syndrome. Nat Genet.

31:435-438.

National Swine Registry, 2011. Yorkshire breed markings and registration requirements.

http://www.nationalswine.com/Home_pages/Breed_Pages/

Yorkshire_Breed_Page.html. Accessed May 19, 2011.

Nickerson DA, Tobe VO, Taylor SL, 1997. PolyPhred: automating the detection and

genotyping of single nucleotide substitutions using fluorescence-based resequencing.

Nucleic Acids Res. 25:2745-2751.

O’Leary NA, Trent HF III, Robalino J, Peck MET, McKillen DJ, Gross PS, 2006. Analysis

of multiple tissue-specific cDNA libraries from the Pacific whiteleg shrimp,

Litopenaeus vannamei. Integr Comp Biol. 46:931–939.

Onteru SK, Fan B, Du Z-Q, Garrick DJ, Stalder KJ, Rothschild MF, 2011a. A whole-genome

association study for pig reproductive traits. Anim Genet. (in press).

Onteru SK, Fan B, Nikkilä MT, Garrick DJ, Stalder KJ, Rothschild MF, 2011b. Whole-

genome association analyses for lifetime reproductive traits in the pig. J Anim Sci.

89(4):988-995.

Pavy N, Parsons LS, Paule C, MacKay J, Bousquet J, 2006. Automated SNP detection from a 36

large collection of white spruce expressed sequences: contributing factors and

approaches for the categorization of SNPs. BMC Genomics. 7:174.

Perez-Enriquez R, Takagi M, Taniguchi N, 1999. Genetic variability and pedigree tracing of

a hatchery-reared stock of red sea bream (Pagrus major) used for stock enhancement,

based on microsatellite DNA markers. Aquaculture. 173:413-423.

Pertoldi C, Wójcik JM, Tokarska M, Kawałko A, Kristensen TN, Loeschcke V, Gregersen

VR, Coltman D, Wilson GA, Randi E, et al., 2010. Genome variability in European

and American bison detected using the BovineSNP50 BeadChip. Conserv Genet.

11:627-634.

Pryce JE, Bolormaa S, Chamberlain AJ, Bowman PJ, Savin K, Goddard ME, Hayes BJ,

2010. A validated genome-wide association study in 2 dairy cattle breeds for milk

production and fertility traits using variable length haplotypes. J Dairy Sci.

93(7):3331-3345.

Raelson JV, Little RD, Ruether A, Fournier H, Paquin B, Van Eerdewegh P, Bradley WEC,

Croteau P, Nguyen-Huu Q, Segal J, et al., 2007. Genome-wide association study for

Crohn’s disease in the Quebec founder population identifies multiple validated

disease loci. Proc Nat Acad Sci (USA). 104(37):14747-14752.

Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE,

Bendixen C, Churcher C, Clark R, Dehais P, et al., 2009. Design of a High Density

SNP Genotyping Assay in the Pig Using SNPs Identified and Characterized by Next

Generation Sequencing Technology. PLoS ONE 4(8): e6524.

Ramos AM, Megens HJ, Crooijmans RPMA, Schook LB, Groenen MAM, 2011. 37

Identification of high utility SNPs for population assignment and traceability

purposes in the pig using high-throughput sequencing. Anim Genet. Epub ahead of

print. doi: 10.1111/j.1365-2052.2011.02198.x.

Rege JEO, Kahi A, Okomo-Adhiambo M, Mwacharo J, Hanotte O, 2001. Zebu cattle of

Kenya: uses, performance, farmer preferences and measures of genetic diversity.

International Livestock Research Institute, Nairobi, Kenya.

Richardson EC, Herd RM, 2004. Biological basis of variation in residual feed intake in beef

cattle. 2. Synthesis of results following divergent selection. Aust J Exp Agr. 44:431-

440.

Rioux JD, Xavier RJ, Taylor KD, Silverberg MS, Goyette P, Huett A, Green T, Kuballa P,

Barmada MM, Datta LW, et al. 2007. Genome-wide association study identifies new

susceptibility loci for Crohan disease and implicates autophagy in disease

pathogenesis. Nat Genet. 39:596-604.

Risch N, Merikangas K, 1996. The future of genetic studies of complex human diseases.

Science. 273(5281):1516-1517.

Robalino J, Almeida JS, McKillen D, Colglazier J, Trent HF III, Chen YA, Peck ME,

Browdy CL, Chapman RW, Warr GW, et al., 2007. Insights into the immune

transcriptome of the shrimp Litopenaeus vannamei: tissue-specific expression profiles

and transcriptomic responses to immune challenge. Physiol Genomics. 29:44-56.

Rohrer GA, Freking BA, Nonneman D, 2007. Single nucleotide polymorphisms for pig

identification and parentage exclusion. Anim Genet. 38:253-258.

Rothschild M, Jacobson C, Vaske D, Tuggle C, Wang L, Short T, Eckardt G, Sasaki S, 38

Vincent A, McLaren D, et al., 1996. The estrogen receptor locus is associated with a

major gene influencing litter size in pigs. Proc Nat Acad Sci (USA). 93:201-205.

Rothschild MF, Van Cleave PS, Glenn KL, Carlstrom LP, Ellinwood NM, 2006. Association

of MITF with white spotting in Beagle crosses and Newfoundland dogs. Anim Genet.

37(6):606-607.

Saiki RK, Scharf S, Faloona F, Mullis KB, Horn GT, Erlich HA, Arnheim N, 1985.

Enzymatic amplification of b-globin genomic sequences and restriction site analysis

for diagnosis of sickle cell anemia. Science. 230:1350-1354.

Sham PC, Cherny SS, Purcell S, 2009. Application of genome-wide SNP data for uncovering

pairwise relationships and quantitative trait loci. Genetica. 136:237-243.

Sherman EL, Nkrumah JD, Murdoch BM, Moore SS, 2008. Identification of polymorphisms

influencing feed intake and efficiency in beef cattle. Anim Genet. 39:225-231.

Shuber AP, Michalowsky LA, Nass GS, Skoletsky J, Hire LM, Kotsopoulos SK, Phipps MF,

Barberio DM, Klinger KW, 1997. High throughput parallel analysis of hundreds of

patient samples for more than 100 mutations in multiple disease genes. Hum Mol

Genet. 6(3):337-347.

Snelling WM, Allan MF, Keele JW, Kuehn LA, McDaneld T, Smith TPL, Sonstegard TS,

Thallman RM, Bennett GL, 2010. Genome-wide association study of growth in

crossbred beef cattle. J Anim Sci. 88(3):837-848.

Soller M, 1978. The use of loci associated with quantitative effects in dairy cattle

improvement. Anim Prod. 27:133-139.

Soller M, Beckmann JS, 1983. Genetic polymorphism in varietal identification and genetic

improvement. Theor Appl Genet. 67(1):25-33. 39

Spritz RA, 1981. Duplication/deletion polymorphism 5'- to the human -globin gene. Nucl

Acids Res. 9(19):5037–5048.

Steemers FJ, Gunderson KL, 2007. Whole genome genotyping technologies on the

BeadArray™ platform. Biotechnol J. 2(1):41-49.

Stefansson H, Sigurdsson E, Steinthorsdottir V, Bjornsdottir S, Sigmundsson T, Ghosh S,

Brynjolfsson J, Gunnarsdottir S, Ivarsson O, Chou TT, et al., 2002. Neuregulin 1 and

susceptibility to schizophrenia. Am J Hum Genet. 71(4):877-892.

Stormont C, 1967. Contribution of blood typing to dairy science progress. J Dairy Sci.

50(2):253-260.

Sun X, Habier D, Fernando RL, Garrick DJ, Dekkers JCM, 2011. Genomic breeding value

prediction and QTL mapping of QTLMAS2010 data using Bayesian methods. BMC

Proc. 5(Suppl 3):S13.

Sunyaev S, Hanke J, Aydin A, Wirkner U, Zastrow I, Reich J, Bork P, 1999. Prediction of

nonsynonymous single nucleotide polymorphisms in human disease-associated genes.

J Mol Med. 77:754-760.

Taillon-Miller P, Gu Z, Li Q, Hillier L, Kwok P-Y, 1998. Overlapping genomic sequences: a

treasure trove of single-nucleotide polymorphisms. Genome Res. 8:748-754.

Tian C, Gregersen PK, Seldin MF, 2008. Accounting for ancestry: population substructure

and genome-wide association studies. Hum Mol Genet. 17(R2):R143-R150.

Tóth G, Gáspári Z, Jurka J, 2000. Microsatellites in different eukaryotic genomes: survey and

analysis. Genome Res. 10:967-981.

Van Tassell CP, Smith TPL, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, 40

Haudenschild CD, Moore SS, Warren WC, Sonstegard TS, 2008. SNP discovery and

allele frequency estimation by deep sequencing of reduced representation libraries.

Nat Methods. 5(3):247-252.

VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF,

Schenkel FS, 2009. Reliability of genomic predictions for North American Holstein

bulls. J Dairy Sci. 92(1):16-24.

Veerkamp RF, Verbyla KL, Mulder HA, Calus MPL, 2010. Simultaneous QTL detection and

genomic breeding value estimation using high density SNP chips. BMC Proc.

4(Suppl 1):S9.

Vignal A, Milan D, SanCristobal M, Eggen A, 2002. A review on SNP and other types of

molecular markers and their use in animal genetics. Genet Sel Evol. 34:275-305.

Visscher PM, Woolliams JA, Smith D, Williams JL, 2002. Estimation of pedigree errors in

the UK dairy population using microsatellite markers and the impact on selection. J

Dairy Sci. 85(9):2368-2375.

Von Felde A, Roehe R, Looft H, Kalm E, 1996. Genetic association between feed intake and

feed intake behavior at different stages of growth of group-housed boars. Livest Prod

Sci. 47:11-12.

Wade CM, Giulotto E, Sigurdsson S, Zoli M, Gnerre S, Imsland F, Lear TL, Adelson DL,

Bailey E, Bellone RR, et al., 2009. Genome sequence, comparative analysis, and

population genetics of the domestic horse. Science. 326(5954):865-867.

Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K, 2001. The discovery of single-nucleotide

polymorphisms--and inferences about human demographic history. Am J Hum Genet.

69(6):1332-1347. 41

Wang S, Sha Z, Sonstegard TS, Liu H, Xu P, Somridhivej B, Peatman E, Kucuktas H, Liu Z,

2008. Quality assessment parameters for EST-derived SNPs from catfish. BMC

Genomics. 9:450.

Weissenbach J, Gyapay G, Dib C, Vignal A, Morissette J, Millasseau P, Vaysseix G, Lathrop

M, 1992. A second-generation linkage map of the human genome. Nature. 359:794-

801.

Wellcome Trust Case Control Consortium, 2007. Genome-wide association study of 14,000

cases of seven common diseases and 3,000 shared controls. Nature. 447:661-678.

Weller JI, Glick G, Ezra E, Zeron Y, Seroussi E, Ron M, 2010. Paternity validation and

estimation of genotyping error rate for the BovineSNP50 BeadChip. Anim Genet.

41:551-553.

Wiedmann RT, Smith TPL, Nonneman DJ, 2008. SNP discovery in swine by reduced

representation and high throughput pyrosequencing. BMC Genet. 9:81.

Wright S, 1922. Coefficients of inbreeding and relationship. Am Nat. 56(645):330-338.

Xu J, Turner A, Little J, Bleecker ER, Meyers DA, 2002. Positive results in association

studies are associated with departure from Hardy-Weinberg equilibrium: hint for

genotyping error? Hum Genet. 111(6):573-574.

Yamazaki K, McGovern D, Ragoussis J, Paolucci M, Butler H, Jewell D, Cardon L, Takazoe

M, Tanaka T, Ichimori T, et al., 2005. Single nucleotide polymorphisms in TNFSF15

confer susceptibility to Crohn’s disease. Hum Mol Genet. 14(22):3499-3506. 42

CHAPTER 2. SNP DISCOVERY IN LITOPENAEUS VANNAMEI WITH A NEW COMPUTATIONAL PIPELINE

A paper published in Animal Genetics. Volume 40(1):106-109.

D. M. Gorbach, Z.-L. Hu, Z.-Q. Du, M. F. Rothschild Summary

Litopenaeus vannamei (Pacific White Shrimp) have been farmed in the Americas for many years and are growing in popularity in Asia with the development of specific pathogen- free stocks. The full genomic sequence of this species might not be available in the near future, so other tools are needed to discover the location of polymorphic sites for quantitative trait loci mapping, association studies and subsequent marker-assisted selection. Currently,

25,937 L. vannamei expressed sequence tags (ESTs) are publicly available. These sequences

were manually screened, masked for tandem repeats and inputted into CAP3 for clustering.

The resulting 3,532 contigs were analysed for possible single nucleotide polymorphisms

(SNPs) with SNPidentifier, a newly developed computer program for predicting SNPs.

SNPidentifier is designed for ESTs without accompanying chromatogram sequence quality

information, and therefore it performs quality control checks on all data. SNPidentifier sets a

threshold such that the sequences used have a poor quality nucleotide (N) frequency <0.1,

and it trims off the first 10 bases of every sequence to ensure higher sequence quality. For a

base to be predicted as a SNP, the minor nucleotide (allele) frequency must be >0.1, it must

be observed at least four times, and the 15 bases on either side must exactly match the

consensus sequence. Using these conservative parameters, 504 SNPs were predicted from

141 contigs for L. vannamei. A small sample of 18 individuals from three lines have been

43

sequenced to verify prediction results and 17 of 39 (44%) of the tested SNPs have been

confirmed.

Article

In recent years, most of the research on shrimp has focused on the penaeid species because they are the most commonly farmed. Litopenaeus vannamei (Pacific white shrimp) are commonly bred in captivity in the Americas and are becoming more popular for farming in Asia (Moss et al. 2006). The primary advantages of L. vannamei include the availability of specific pathogen-free stocks and the ease of breeding them in captivity (Moss et al.

2006). To increase the success of captive breeding, a better understanding of the underlying

molecular and genetic mechanisms for important agronomic traits is needed to allow marker-

assisted selection. Single nucleotide polymorphisms (SNPs) have been useful as markers for

quantitative trait loci identification, and these have been employed in the large-scale whole- genome association analyses currently ongoing in human and livestock disease genetics.

However, the locations of very few SNPs are currently known in L. vannamei.

Expressed sequence tags (ESTs) have been used successfully for predicting SNPs in other species (Buetow et al. 1999, Guryev et al. 2004, Hayes et al. 2007). One of the main limitations of using ESTs for SNP prediction is the lack of ESTs available in some species.

Penaeus monodon, for example, only has about 8,000 ESTs publicly available and these do not provide enough homologous sequences to predict many SNPs with a high degree of accuracy (data not shown). When more ESTs are available for a species, we need to take advantage of that information for SNP mining and mapping, particularly when a whole- genome map is not available for that species. Such a species is L. vannamei, for which

25,937 ESTs are publicly available at http://www.marinegenomics.org (O’Leary et al. 2006).

44

Although attempts were made to find and use chromatogram files and information related to sequence quality (e.g. Q score), this information was not available to us for the majority of the L. vannamei EST data. Consequently, the quality of these EST sequences is in question. To cope with this problem, a new software package, SNPidentifier, was developed to perform quality control checks and predict SNP locations from ESTs. The

ESTs from L. vannamei were checked manually for usability and removed if they showed any of the following problems: (i) apparent ‘junk’ sequences, such as those 5-10 bp in length,

(ii) apparent primer-only sequences (16-20 bp in length) or (iii) sequences with mixed non-

ACGT letters apparently introduced from bioinformatic programs at the sequencing facility.

A total of 3,253 sequences were removed based on these criteria. The ESTs were then masked for low-quality tandem repeats using RepeatMasker to avoid low-specific matches in clustering. Other techniques, such as increasing the BLAST stringency and BLASTing the consensus sequences against known peptide sequences with BLASTX to make sure the translation did not have gaps, were also used to minimize the possible chimeric problems.

The remaining ESTs were inputted into CAP3 to be arranged into contigs (Fig. 1). CAP3 produced 3,532 contigs containing 8,628 sequences with 5,096 singlets remaining unclustered. The output from CAP3 was inputted into SNPidentifier, which then (i) removed all sequences in which more than 10% of the bases were Ns; (ii) removed the first 10 bases of every sequence to reduce the problem of inaccurate vector clipping; and (iii) searched for bases that did not match the consensus sequence. Any time a base did not match, the program would check the 15 bases on either side to ensure that they exactly matched the consensus sequence. The program only predicted base positions to contain an SNP if the minor nucleotide (allele) frequency was at least 0.1 and at least four sequences contained

45

each allele. The outputted score was based on the number of sequences containing the minor

nucleotide (allele). These parameters were designed to be conservative, but can be easily

altered to fit the user’s needs. SNPidentifier software is available for download from

http://www.animalgenome.org/bioinfo/tools/share/SNPidentifier.

Using the conservative parameters outlined above, SNPidentifier predicted 504 SNPs

from 141 contigs in L. vannamei (Table S1). The sequences containing SNPs were BLASTed

against , and primer sets were designed for the nine best matches

using Primer3 (http://frodo.wi.mit.edu/). An additional 10 primer sets were designed in the

same manner for randomly selected sequences. The primers for the randomly selected

sequences were designed to amplify 110-250 bp fragments to reduce the probability of containing introns. PCR was carried out with GoTaq (Promega) using standard procedures.

The primer sets are listed in Table 1. Out of the 19 primer sets designed, 15 consistently amplified DNA. Sequence data were collected for 18 individuals from three L. vannamei lines and analysed using Sequencher software (Gene Codes Corporation). To date, 17 of 39

SNPs (44%) have been confirmed (dbSNP accession numbers ss8632910-ss86352926).

Possible reasons for the low validation rate include the fact that a small set of animals from only three shrimp lines was used for validation. Consequently, the variation observed across multiple lines of L. vannamei used to generate ESTs may be much larger than that of our validation population. In addition, EST sequence does not provide additional information on the allele frequencies of the population, so these frequencies may have been too low in our population to observe and validate. Furthermore, EST sequences are known to contain inaccuracies. Although SNPidentifier was designed to identify and remove potential inaccuracies, not all of these problems can be correctly identified. In addition, 47 SNPs were

46 located via sequencing in our population for which there was no evidence from the deposited

ESTs (Table 1). Note that these SNPs have not been confirmed using any other method.

These additional 47 SNPs and the sequence surrounding them are given in Table S2. The relatively large number of SNPs identified provides further evidence for our previous belief that L. vannamei has a highly polymorphic genome.

A chi-squared test was performed by dividing the SNPs into two categories based on whether more than 15 sequences contributed to the SNP being predicted or 15 or fewer sequences contributed. The number of contributing sequences did not have a significant effect on prediction accuracy (p-value = 0.45). Two categories of SNPs were also created based on whether the minor nucleotide (allele) frequency was >0.333 or not. A chi-squared test on whether these SNP predictions were confirmed or not did not show a significant effect of minor nucleotide (allele) frequency on prediction accuracy (p-value = 0.65).

Acknowledgements

Funding was provided by the USDA NRSP8 National Animal Genome Program, the

Iowa Agricultural Home Economics Experiment Stations, State of Iowa and Hatch funding.

D. M. Gorbach is supported by a USDA-CSREES National Needs fellowship under Grant no. 2007-38420-17767.

References

Buetow K.H., Edmonson M.N., & Cassidy A.B. (1999) Reliable identification of large

numbers of candidate SNPs from public EST data. Nature Genetics 21, 323-325.

Guryev V., Berezikov E., Malik R., Plasterk R.H.A., & Cuppen E. (2004) Single nucleotide

polymorphisms associated with rat expressed sequences. Genome Research 14:1438-

47

1443.

Hayes B.J., Nilsen K, Berg P.R., Grindflek E., & Lien S. (2007) SNP detection exploiting

multiple sources of redundancy in large EST collections improves validation rates.

Bioinformatics 23(13):1692-1693.

Moss S. M., Arce S.M. & Moss D.R. (2006) Use of domesticated specific pathogen free

Pacific white shrimp Litopenaeus vannamei in Asia. Opportunities and challenges.

World Aquaculture Society Conference, Florence, Italy.

O’Leary N.A., Trent H.F., Robalino J., Peck M.E.T., McKillen D.J. & Gross P.S. (2006).

Analysis of multiple tissue-specific cDNA libraries from the Pacific whiteleg shrimp,

Litopenaeus vannamei. Integ. Comp. Biol. 46:931-939.

48

Raw EST sequences

Sequences masked for tandem repeats

Sequences grouped into contigs with CAP3

Contigs’ alignments were evaluated for SNPs with SNPidentifier

Number of sequences in No, then cannot contig alignments ≥ 8? find any SNPs in contig

Yes, then remove the first 10 bases of each sequence

Are more than 10% of the Yes, then remove nucleotides in the sequence this sequence from the poor quality (Ns)? pipeline

If difference results No, then search for differences from a gap, then between this sequence and the increase gap percent consensus sequence

No, If the If gapthe gappercentage percent For each difference containing an A, C, then remove this exceedsexceeds 30%, 30%, then then not G, or T, do the 15 bases on either side difference from the a true SNP exactly match the consensus sequence? SNP prediction

Yes, then compute minor allele If one or both is frequency and count below the threshold, then not predicted to be a SNP If both values exceed the thresholds, then base position is predicted to contain a SNP

Figure 1. A flow chart for clustering ESTs and using SNPidentifier to predict SNPs from the resulting contigs.

Table 1. Primers used and their success in amplifying and confirming SNPs. Contig Forward Primer (5’-3’) Reverse Primer (5’-3’) Amplifies Amplified Average Gene Name No. of No. of Number1 Well? EST Primer Confirmed Additional 2 Fragment TM (°C) SNPs SNPs Size (bp) 282 CCTACGTCCTCCTGGCTAAA GACACGTTCTGGTCTGCTCA Yes 140 59.7 unknown 0 / 1 7 372 CACCCGTGATCCTAGCTTCT GCCTTCAATTGCTACGCTGT Yes 139 59.9 HCA 1 / 2 0 995 CTCGAAAGCCAAGAGGTTTG TTGCCAATAACAGCAGCATC Yes 123 59.9 RPL13 - 1015 ATCCACATTTCCATCGTGGT CACGATCAGCTGCTTCACAC Yes 435 60.3 Elongation factor 2 / 5 3 1α100E 1417 TCAACAGCTCGACTTTGCTG TGCCTACGTCCATTTCACAA Yes 196 55.0 unknown 0 / 1 0 1464 AAGGTATGGAGCCTGACCAA GAAGCACAGCGAGTTGATGA No 201 55.7 Rack1 n/a 1659 AAGGCTCAGTGCCCCATT TCTCCAGTGAGGAGGTGGAT Yes 131 60.1 RPS5 0 / 1 1 2313 TGTGTTTGCCATGTTTGACC ATCTGTGCCCTGAAGACGAT Yes 232 55.2 sqh 0 / 2 2 2669 AACCTCGCTTGACAACCTGT ATCCATTCCACAAGCCAGAG No 110 55.9 unknown n/a 49

2814 GCCAATGGTGCATCAGTCTA CTCCTTGTCTTTGCCCTGAG No 138 55.4 RPL26 n/a 2826 GCCGTCTTCCCCTCCATC ACCGTCGGGAAGCTCGTAG Yes 651 62.8 Actin α2 6 / 12 13 2842 TGGATATGTTTCATGCCCATT TTTTCCATAGAAACCAAGAACACA No 238 52.6 RPS30 n/a 2856 ACTGGGACGACATGGAGAAG AGGAAGGAAGGCTGGAAAAG Yes 568 60.0 Actin 57B 2 / 4 19 3038 CCCTGGTCACGTTGATTTCT GTGGCAATGATGACGTTCAC Yes 258 60.0 P82216 0 / 1 0 3077 TTGCCGATGCTGTTAAGTTG TTTGAAGCTCACCCTGCTCT Yes 226 60.0 mt:ND1 1 / 2 0 3154 GAATGGAGGAGCTTGTGGAG GGATAAGGACCCAGCCTCTC Yes 190 56.1 unknown 0 / 1 0 3172 GCATATCCTGCGTTTGATGA CGGCCTTCTTGAGGACAAT Yes 116 59.9 RPS18 1 / 1 0 3505 ACTGGGACGACATGGAAAAG AGGAAGGAAGGCTGGAACAT Yes 568 60.0 Actin 42A 3 / 3 1 3511 GATCGTAAGGTTCCCGATGA GGCAATCACATTTTATTTGCTTT Yes 195 59.6 HCY6_ANDAU 1 / 3 1 1The following cycling conditions were used for all contigs except 2856 and 3511: one step of 94°C for 3 min; 35 cycles of 94°C for 30 s, 61°C for 30 s and 70°C for 30 s; and one step of 72°C for 5 min. For contig 2856, the extension temperature was 68°C; for contig 3511, the annealing temperature was 55°C.

2For each contig, the following is indicated: the number of SNPs confirmed/the number of SNPs predicted in the fragment. For contig 995, the fragment contained a large repeat region that made the sequence undeterminable; n/a indicates that sequencing was not possible because the primers did not amplify a fragment.

Table S1. Positions of predicted SNPs and the homology of the surrounding sequence to known proteins

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

14 478 0.4 10 4 53 44 0.25 16 4 59 658 0.5 8 4 99 134 0.2 20 4 99 300 0.435 23 10 99 372 0.348 23 8 100 121 0.174 23 4 103 264 0.5 12 6

103 313 0.462 13 6 50

103 337 0.385 13 5 103 392 0.462 13 6 115 64 0.25 24 6 CG8562 / metallocarboxypeptidase activity 1e-12 115 297 0.4 10 4 CG8562 / metallocarboxypeptidase activity 1e-12 117 258 0.5 8 4 129 80 0.222 18 4 CG2973 / structural constituent of chitin-based cuticle 1e-07 129 95 0.2 20 4 CG2973 / structural constituent of chitin-based cuticle 1e-07 129 123 0.154 26 4 CG2973 / structural constituent of chitin-based cuticle 1e-07 129 198 0.182 22 4 CG2973 / structural constituent of chitin-based cuticle 1e-07 133 201 0.5 8 4 157 170 0.455 11 5 PSAP: Proactivator polypeptide precursor 1e-12 157 210 0.364 11 4 PSAP: Proactivator polypeptide precursor 1e-12 176 179 0.364 22 8 195 63 0.4 10 4 1 SNPidentifier score is based on the number of sequences containing the minor nucleotide (allele) 2 GO terms [e-value < 1e-6] 3 Has entries only if attempts were made to confirm the SNP position

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

212 371 0.25 16 4 221 104 0.333 60 20 221 247 0.462 52 24 221 302 0.308 52 16 221 388 0.32 50 16 248 135 0.5 8 4 251 291 0.5 8 4 CHI3L2: Chitinase 3-like protein 2 precursor 2e-11 260 55 0.5 8 4 262 358 0.5 8 4 CPA2: Carboxypeptidase A2 precursor 2e-10 275 139 0.462 26 12 CG9299 / structural constituent of chitin-based cuticle 2e-07 51 CG9299 / structural constituent of chitin-based cuticle

275 220 0.364 22 8 2e-07 275 362 0.222 18 4 CG9299 / structural constituent of chitin-based cuticle 2e-07 282 495 0.429 14 6 282 589 0.429 14 6 no 282 707 0.4 10 4 295 172 0.4 10 4 298 265 0.267 15 4 CG9299 / structural constituent of chitin-based cuticle 9e-08 298 347 0.308 13 4 CG9299 / structural constituent of chitin-based cuticle 9e-08 307 475 0.5 8 4 315 197 0.333 12 4 322 335 0.4 10 4 338 434 0.5 8 4 Hexosaminidase 1/beta-N-acetylhexosaminidase 3e-37 activity 338 555 0.333 12 4 Hexosaminidase 1/beta-N-acetylhexosaminidase 3e-37 activity 338 612 0.308 13 4 Hexosaminidase 1/beta-N-acetylhexosaminidase 3e-37 activity 363 641 0.118 34 4 Carbonic anhydrase 1/ carbonate dehydratase activity 3e-32 363 734 0.133 30 4 Carbonic anhydrase 1/ carbonate dehydratase activity 3e-32

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

363 774 0.143 28 4 Carbonic anhydrase 1/ carbonate dehydratase activity 3e-32 372 98 0.421 76 32 HCA, HC1: Hemocyanin subunit A precursor 1e-30 no 372 122 0.15 80 12 HCA, HC1: Hemocyanin subunit A precursor 1e-30 yes 372 166 0.152 92 14 HCA, HC1: Hemocyanin subunit A precursor 1e-30 372 262 0.149 94 14 HCA, HC1: Hemocyanin subunit A precursor 1e-30 372 303 0.142 106 15 HCA, HC1: Hemocyanin subunit A precursor 1e-30 381 664 0.5 8 4 383 59 0.5 8 4 Minute (2) 21AB/methionine adenosyltransferase 3e-87 activity 383 233 0.5 8 4 Minute (2) 21AB/methionine adenosyltransferase 3e-87 activity 52 383 254 0.5 8 4 Minute (2) 21AB/methionine adenosyltransferase 3e-87 activity 421 463 0.4 10 4 421 536 0.4 10 4 500 64 0.4 10 4 lacI, b0345, JW0336: Lactose operon repressor 3e-08 541 192 0.333 12 4 Actin (Fragment) / protein binding 1e-33 541 304 0.4 10 4 Actin (Fragment) / protein binding 1e-33 546 148 0.45 20 9 mitochondrial Cytochrome b/ubiquinol-cytochrome-c 2e-125 redutase 555 283 0.5 8 4 555 345 0.5 8 4 598 140 0.4 10 4 763 453 0.286 14 4 mitochondrial NADH-ubiquinone oxidoreductase 2e-17 chain 6 763 569 0.286 14 4 mitochondrial NADH-ubiquinone oxidoreductase 2e-17 chain 6 951 35 0.5 8 4 951 150 0.5 8 4

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

951 186 0.5 8 4 961 95 0.5 8 4 Mitochondrial phosphate carrier protein 4e-26 976 144 0.4 10 4 cytochrome c oxidase, subunit VIIc / integral to 3e-11 membrane 976 195 0.5 10 5 cytochrome c oxidase, subunit VIIc / integral to 3e-11 membrane 976 249 0.5 10 5 cytochrome c oxidase, subunit VIIc / integral to 3e-11 membrane 978 207 0.333 18 6 995 81 0.122 41 5 RPL13, BBC1, OK/SW-cl.46: 60S ribosomal protein 4e-54 L13 53 RPL13, BBC1, OK/SW-cl.46: 60S ribosomal protein

995 127 0.237 38 9 4e-54 L13 995 219 0.237 38 9 RPL13, BBC1, OK/SW-cl.46: 60S ribosomal protein 4e-54 L13 995 318 0.372 43 16 RPL13, BBC1, OK/SW-cl.46: 60S ribosomal protein 4e-54 L13 995 391 0.333 39 13 RPL13, BBC1, OK/SW-cl.46: 60S ribosomal protein 4e-54 L13 995 546 0.125 32 4 RPL13, BBC1, OK/SW-cl.46: 60S ribosomal protein 4e-54 L13 995 625 0.167 24 4 RPL13, BBC1, OK/SW-cl.46: 60S ribosomal protein 4e-54 L13 1006 59 0.105 38 4 zgc:56493 / cellular component 5e-28 1006 62 0.189 37 7 zgc:56493 / cellular component 5e-28 1006 87 0.154 39 6 zgc:56493 / cellular component 5e-28 1006 329 0.5 34 17 zgc:56493 / cellular component 5e-28 1010 272 0.143 28 4 Ribosomal protein L32 / cytosolic large ribosomal 4e-49 subunit 1015 186 0.333 12 4 Elongation factor 1alpha100E / translational 4e-103 no elongation

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

1015 267 0.364 11 4 Elongation factor 1alpha100E / translational 4e-103 no elongation 1015 396 0.444 9 4 Elongation factor 1alpha100E / translational 4e-103 yes elongation 1015 432 0.444 9 4 Elongation factor 1alpha100E / translational 4e-103 no elongation 1015 453 0.444 9 4 Elongation factor 1alpha100E / translational 4e-103 yes elongation 1018 220 0.256 39 10 Mitochondrial Cytochrome c oxidase subunit III 4e-95 1018 529 0.211 38 8 Mitochondrial Cytochrome c oxidase subunit III 4e-95 1061 319 0.222 18 4 54 rpl-3, F13B10.2: 60S ribosomal protein L3

1067 205 0.5 12 6 1e-120 1073 275 0.364 11 4 ribosomal protein S29 / cytosolic small ribosomal 6e-19 subunit 1131 584 0.283 53 15 mitochondrial ATPase subunit 6 / hydrogen-exporting 2e-53 ATPase 1164 58 0.25 16 4 1174 98 0.455 11 5 vha-10, CBG14904: Probable vacuolar ATP synthase 2e-18 subunit G 1180 150 0.129 31 4 ribosomal protein S13 / RNA binding 1e-71 1180 234 0.469 32 15 ribosomal protein S13 / RNA binding 1e-71 1180 287 0.286 28 8 ribosomal protein S13 / RNA binding 1e-71 1184 133 0.333 27 9 zgc:92066 / cellular component 2e-57 1184 439 0.486 35 17 zgc:92066 / cellular component 2e-57 1184 539 0.438 32 14 zgc:92066 / cellular component 2e-57 1184 625 0.121 33 4 zgc:92066 / cellular component 2e-57 1184 666 0.455 22 10 zgc:92066 / cellular component 2e-57 1184 720 0.211 19 4 zgc:92066 / cellular component 2e-57 1185 613 0.4 10 4 Ribosomal protein L23A / cytosolic large ribosomal 6e-48 subunit

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

1191 273 0.429 21 9 Prss2: protease, serine, 2 / multifunctional 9e-55 1191 328 0.19 21 4 Prss2: protease, serine, 2 / multifunctional 9e-55 1191 382 0.238 21 5 Prss2: protease, serine, 2 / multifunctional 9e-55 1191 502 0.25 20 5 Prss2: protease, serine, 2 / multifunctional 9e-55 1191 538 0.238 21 5 Prss2: protease, serine, 2 / multifunctional 9e-55 1191 755 0.185 27 5 Prss2: protease, serine, 2 / multifunctional 9e-55 1191 779 0.231 26 6 Prss2: protease, serine, 2 / multifunctional 9e-55 1218 427 0.364 11 4 1226 352 0.429 21 9 Nme1: Nucleoside diphosphate kinase A / protein 7e-66 binding 1226 489 0.429 21 9 Nme1: Nucleoside diphosphate kinase A / protein 7e-66 55 binding 1245 251 0.128 39 5 RPL27: 60S ribosomal protein L27 6e-41 1245 322 0.105 38 4 RPL27: 60S ribosomal protein L27 6e-41 1245 375 0.105 38 4 RPL27: 60S ribosomal protein L27 6e-41 1245 566 0.308 13 4 RPL27: 60S ribosomal protein L27 6e-41 1248 14 0.5 8 4 1248 100 0.333 15 5 1262 37 0.138 29 4 Ribosomal protein S9 / cytosolic small ribosomal 2e-85 subunit 1262 82 0.138 29 4 Ribosomal protein S9 / cytosolic small ribosomal 2e-85 subunit 1262 281 0.148 27 4 Ribosomal protein S9 / cytosolic small ribosomal 2e-85 subunit 1262 317 0.188 32 6 Ribosomal protein S9 / cytosolic small ribosomal 2e-85 subunit 1271 564 0.308 13 4 CG17824 / structural constituent of peritrophic 1e-10 membrane 1281 145 0.25 16 4 TRY2_BOVIN: Anionic trypsin precursor 3e-47

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

1281 169 0.294 17 5 TRY2_BOVIN: Anionic trypsin precursor 3e-47 1281 359 0.278 18 5 TRY2_BOVIN: Anionic trypsin precursor 3e-47 1281 410 0.278 18 5 TRY2_BOVIN: Anionic trypsin precursor 3e-47 1281 498 0.471 17 8 TRY2_BOVIN: Anionic trypsin precursor 3e-47 1284 305 0.5 10 5 1284 373 0.4 10 4 1284 534 0.5 10 5 1284 535 0.5 10 5 1299 720 0.5 10 5 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl ion 1e-50 binding 1299 1054 0.375 16 6 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl ion 1e-50 56 binding 1301 33 0.167 30 5 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 145 0.333 36 12 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 193 0.406 32 13 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 217 0.156 32 5 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 235 0.278 36 10 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 301 0.304 23 7 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 304 0.385 26 10 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 374 0.435 23 10 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1301 375 0.458 24 11 RpL28: Ribosomal protein L28 / cytosolic large 5e-24 ribosomal subunit 1310 325 0.222 18 4 RPS3, OK/SW-cl.26: 40S ribosomal protein S3 4e-107

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

1325 436 0.25 20 5 zgc:77734 / cellular component 2e-28 1385 39 0.333 24 8 RPL30: 60S ribosomal protein L30 / large ribosomal 3e-46 subunit 1391 144 0.254 63 16 1391 219 0.134 67 9 1391 252 0.113 62 7 1391 257 0.224 49 11 1391 267 0.149 47 7 1391 302 0.345 55 19

1391 304 0.185 54 10 57

1391 329 0.109 55 6 1391 350 0.136 59 8 1391 539 0.157 51 8 1391 614 0.167 42 7 1391 629 0.25 16 4 1400 231 0.37 27 10 1400 324 0.407 27 11 1400 428 0.348 23 8 1400 457 0.286 21 6 1400 569 0.312 16 5 1400 689 0.5 12 6 1417 142 0.267 15 4 1417 363 0.308 13 4 no 1417 638 0.5 8 4 1446 297 0.286 14 4 RpL18: Ribosomal protein L18 / cytosolic large 9e-70 ribosomal subunit 1447 570 0.211 19 4 Ef1alpha48D: Elongation factor 0.0 1alpha48D/translation elongation

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

1447 605 0.231 26 4 Ef1alpha48D: Elongation factor 0.0 1alpha48D/translation elongation 1447 1174 0.25 16 6 Ef1alpha48D: Elongation factor 0.0 1alpha48D/translation elongation 1447 1404 0.143 28 4 Ef1alpha48D: Elongation factor 0.0 1alpha48D/translation elongation 1456 238 0.4 10 4 ZDB-GENE-050320-41 / cellular component 6e-33 1457 433 0.174 23 4 RpL9: Ribosomal protein L9 / large ribosomal subunit 3e-66 1457 525 0.19 21 4 RpL9: Ribosomal protein L9 / large ribosomal subunit 3e-66 1457 667 0.308 13 4 RpL9: Ribosomal protein L9 / large ribosomal subunit 3e-66 Rack1: Receptor of activated protein kinase C 1 1464 692 0.364 11 4 3e-156 58

1464 859 0.286 14 4 Rack1: Receptor of activated protein kinase C 1 3e-156 1467 441 0.467 15 7 RPL6: 60S ribosomal protein L6 / large ribosomal 8e-31 subunit 1467 565 0.308 13 4 RPL6: 60S ribosomal protein L6 / large ribosomal 8e-31 subunit 1487 663 0.286 14 4 1523 41 0.4 10 4 1523 72 0.4 10 4 1586 164 0.308 13 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl ion 2e-33 binding 1586 464 0.267 15 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl ion 2e-33 binding 1599 495 0.4 10 4 1634 142 0.4 10 4 HCD: Hemocyanin D chain / Cu, Cl ion binding 6e-13 1659 167 0.357 14 5 RPS5: 40S ribosomal protein S5 / small ribosomal 1e-97 subunit 1659 378 0.292 24 7 RPS5: 40S ribosomal protein S5 / small ribosomal 1e-97 subunit

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

1659 433 0.227 22 5 RPS5: 40S ribosomal protein S5 / small ribosomal 1e-97 no subunit 1803 332 0.5 8 4 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 697 0.143 28 4 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 720 0.121 33 4 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 745 0.333 33 11 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 769 0.424 33 14 TGM1, KTG: Protein-glutamine gamma- 1e-15

glutamyltransferase K 59

1803 802 0.148 27 4 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 873 0.133 30 4 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 901 0.118 34 4 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 936 0.217 46 10 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 1016 0.152 33 5 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 1081 0.24 25 6 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 1096 0.161 31 5 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 1803 1097 0.174 23 4 TGM1, KTG: Protein-glutamine gamma- 1e-15 glutamyltransferase K 2158 506 0.286 14 4 Nap1l1: nucleosome assembly protein 1-like 1 6e-12 2201 490 0.351 37 13 2201 1189 0.467 15 7

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

2283 288 0.179 28 5 2283 467 0.154 26 4 2283 516 0.217 23 5 2283 614 0.357 14 5 2294 418 0.5 8 4 CG2145 / serine-type peptidase activity 1e-33 2294 442 0.5 8 4 CG2145 / serine-type peptidase activity 1e-33 2294 463 0.5 8 4 CG2145 / serine-type peptidase activity 1e-33 2313 108 0.364 11 4 sqh: spaghetti squash / ATPase activity (coupled), 1e-75 cellularization 2313 204 0.364 11 4 sqh: spaghetti squash / ATPase activity (coupled), 1e-75 no cellularization 60

2313 255 0.364 11 4 sqh: spaghetti squash / ATPase activity (coupled), 1e-75 no cellularization 2313 431 0.364 11 4 sqh: spaghetti squash / ATPase activity (coupled), 1e-75 cellularization 2313 545 0.211 19 4 sqh: spaghetti squash / ATPase activity (coupled), 1e-75 cellularization 2313 658 0.286 21 6 sqh: spaghetti squash / ATPase activity (coupled), 1e-75 cellularization 2329 149 0.14 50 7 Qm / cytosolic large ribosomal subunit 2e-105 2329 285 0.185 54 10 Qm / cytosolic large ribosomal subunit 2e-105 2332 220 0.444 9 4 Rps6: 40S ribosomal protein S6 / glucose homeostasis 3e-79 2332 422 0.211 19 4 Rps6: 40S ribosomal protein S6 / glucose homeostasis 3e-79 2352 216 0.343 108 37 2352 421 0.317 104 33 2352 565 0.286 91 26 2352 636 0.163 98 16 2352 813 0.212 33 7 2464 493 0.429 14 6

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

2464 804 0.429 14 6 2464 930 0.308 13 4 2474 329 0.385 13 5 Rpl31: ribosomal protein L31 / cytosolic large 2e-29 ribosomal subunit 2478 444 0.429 14 6 ANXA11, ANX11: Annexin A11 / nuclear envelope 6e-09 2478 466 0.286 14 4 ANXA11, ANX11: Annexin A11 / nuclear envelope 6e-09 2478 556 0.429 14 6 ANXA11, ANX11: Annexin A11 / nuclear envelope 6e-09 2478 877 0.5 12 6 ANXA11, ANX11: Annexin A11 / nuclear envelope 6e-09 2478 912 0.333 12 4 ANXA11, ANX11: Annexin A11 / nuclear envelope 6e-09 2585 93 0.222 27 6 mt:ND5: mitochondrial NADH-ubiquinone 2e-110 oxidoreductase chain 5 61 2585 405 0.16 25 4 mt:ND5: mitochondrial NADH-ubiquinone 2e-110 oxidoreductase chain 5 2631 594 0.5 8 4 2655 121 0.174 23 4 Rps20: ribosomal protein S20 / cytosolic small 3e-47 ribosomal subunit 2669 274 0.236 157 37 2669 373 0.101 139 14 2669 442 0.17 141 24 2669 465 0.323 124 40 2669 468 0.138 123 17 2669 477 0.142 127 18 2669 576 0.324 105 34 2669 608 0.383 107 41 2669 627 0.2 90 18 2669 758 0.179 67 12 2669 805 0.105 76 8 2754 904 0.389 18 7 RPL38: 60S ribosomal protein L38 / large ribosomal 4e-27 subunit

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

2761 265 0.375 16 6 RpL35A: Ribosomal protein L35A / large ribosomal 6e-38 subunit 2764 450 0.267 15 4 Lcp65Ag2 / structural constituent of chitin-based 2e-13 larval cuticle 2769 129 0.235 17 4 2769 162 0.455 11 5 2770 234 0.286 14 4 Rps11: ribosomal protein S11 / ribosomal structural 7e-63 constituent 2782 545 0.312 16 5 2782 602 0.333 12 4

2788 202 0.286 14 4 CG12330 / structural constituent of chitin-based 5e-08 62

cuticle 2808 151 0.346 26 9 RpL37a: Ribosomal protein L37a / large ribosomal 1e-37 subunit 2814 401 0.376 93 35 RpL26: Ribosomal protein L26 / large ribosomal 1e-52 subunit 2814 547 0.203 64 13 RpL26: Ribosomal protein L26 / large ribosomal 1e-52 subunit 2814 550 0.2 20 4 RpL26: Ribosomal protein L26 / large ribosomal 1e-52 subunit 2816 708 0.121 33 4 2817 223 0.4 15 6 CG31997 / cellular component 5e-18 2817 375 0.429 14 6 CG31997 / cellular component 5e-18 2817 379 0.5 16 8 CG31997 / cellular component 5e-18 2820 207 0.118 34 4 RpL29: Ribosomal protein L29 / large ribosomal 1e-16 subunit 2820 324 0.394 33 13 RpL29: Ribosomal protein L29 / large ribosomal 1e-16 subunit 2820 416 0.455 11 5 RpL29: Ribosomal protein L29 / large ribosomal 1e-16 subunit 2826 54 0.438 16 7 acta2: actin, alpha 2, smooth muscle, aorta 1e-126

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

2826 122 0.318 22 7 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 2826 146 0.25 24 6 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 2826 170 0.348 23 8 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 no 2826 227 0.174 23 4 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 no 2826 275 0.3 20 6 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 yes 2826 278 0.409 22 9 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 yes 2826 302 0.304 23 7 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 yes 2826 305 0.304 23 7 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 no 2826 359 0.435 23 10 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 yes 2826 486 0.273 22 6 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 no 63 acta2: actin, alpha 2, smooth muscle, aorta

2826 510 0.222 18 4 1e-126 no 2826 546 0.318 22 7 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 no 2826 591 0.364 22 8 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 yes 2826 618 0.25 16 4 acta2: actin, alpha 2, smooth muscle, aorta 1e-126 yes 2842 595 0.4 10 4 RpS30: Ribosomal protein S30 / cytosolic small 9e-23 ribosomal subunit 2843 1023 0.286 14 4 Argk: Arginine kinase 2e-169 2851 613 0.5 10 5 RPL21: 60S ribosomal protein L21 / large ribosomal 9e-55 subunit 2852 31 0.154 26 4 RPL27A: 60S ribosomal protein L27a / large 2e-50 ribosomal subunit 2852 194 0.25 16 4 RPL27A: 60S ribosomal protein L27a / large 2e-50 ribosomal subunit 2852 537 0.417 12 5 RPL27A: 60S ribosomal protein L27a / large 2e-50 ribosomal subunit 2856 527 0.5 10 5 Actin 57B / cytoskeleton organization and biogenesis 1e-150 yes 2856 552 0.4 10 4 Actin 57B / cytoskeleton organization and biogenesis 1e-150 yes 2856 634 0.4 10 4 Actin 57B / cytoskeleton organization and biogenesis 1e-150 no 2856 657 0.333 12 4 Actin 57B / cytoskeleton organization and biogenesis 1e-150 no

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

2875 116 0.222 18 4 Apod: apolipoprotein D / extracellular space 1e-14 2875 135 0.286 21 6 Apod: apolipoprotein D / extracellular space 1e-14 2875 144 0.312 16 5 Apod: apolipoprotein D / extracellular space 1e-14 2875 169 0.263 19 5 Apod: apolipoprotein D / extracellular space 1e-14 2875 225 0.25 16 4 Apod: apolipoprotein D / extracellular space 1e-14 2875 260 0.333 18 6 Apod: apolipoprotein D / extracellular space 1e-14 2875 332 0.333 18 6 Apod: apolipoprotein D / extracellular space 1e-14 2875 452 0.278 18 5 Apod: apolipoprotein D / extracellular space 1e-14 2875 474 0.353 17 6 Apod: apolipoprotein D / extracellular space 1e-14 2875 484 0.357 14 5 Apod: apolipoprotein D / extracellular space 1e-14 64 Apod: apolipoprotein D / extracellular space

2875 517 0.222 18 4 1e-14 2875 562 0.368 19 7 Apod: apolipoprotein D / extracellular space 1e-14 2886 145 0.207 29 6 2886 215 0.138 29 4 2886 252 0.167 30 5 2886 428 0.2 30 6 2886 598 0.273 22 6 2918 85 0.19 21 4 2918 103 0.2 20 4 2918 124 0.238 21 5 2918 203 0.3 20 6 2918 617 0.368 19 7 2922 546 0.235 17 4 2929 29 0.444 9 4 CG32230 / mitochondrial electron transport, NADH 5e-27 to ubiquinone 2929 348 0.385 13 5 CG32230 / mitochondrial electron transport, NADH 5e-27 to ubiquinone

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

2929 439 0.333 12 4 CG32230 / mitochondrial electron transport, NADH 5e-27 to ubiquinone 2933 529 0.174 23 4 Lyz: lysozyme / cell wall catabolic process 1e-29 2935 55 0.333 30 10 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 82 0.361 36 13 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 112 0.119 42 5 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 136 0.195 41 8 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 149 0.209 43 9 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 195 0.455 44 20 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 256 0.432 44 19 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 298 0.227 44 10 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 65

2935 320 0.239 46 11 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 392 0.111 45 5 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 422 0.171 35 6 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2935 429 0.364 44 16 APLP2, APPL2: Amyloid-like protein 2 precursor 8e-13 2939 74 0.1 80 8 2939 268 0.191 94 18 2939 340 0.223 94 21 2939 359 0.192 99 19 2939 531 0.221 86 19 2939 614 0.176 68 12 2939 680 0.152 33 5 2975 450 0.208 53 11 mt:CoII: mitochondrial Cytochrome c oxidase subunit 3e-77 II 2975 725 0.316 19 6 mt:CoII: mitochondrial Cytochrome c oxidase subunit 3e-77 II 2995 265 0.387 31 12 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 8e-126 2995 308 0.353 34 12 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 8e-126

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

2995 327 0.353 34 12 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 8e-126 2995 463 0.5 46 23 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 8e-126 2995 980 0.414 29 12 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 8e-126 3014 129 0.138 29 4 RPL5, MSTP030: 60S ribosomal protein L5/large 1e-83 ribosomal subunit 3014 174 0.31 29 9 RPL5, MSTP030: 60S ribosomal protein L5/large 1e-83 ribosomal subunit 3014 536 0.294 17 5 RPL5, MSTP030: 60S ribosomal protein L5/large 1e-83 ribosomal subunit 3014 582 0.267 15 4 RPL5, MSTP030: 60S ribosomal protein L5/large 1e-83 ribosomal subunit 66 RpS25: Ribosomal protein S25 / small ribosomal

3032 243 0.164 55 9 1e-28 subunit 3032 261 0.179 56 10 RpS25: Ribosomal protein S25 / small ribosomal 1e-28 subunit 3032 351 0.125 56 7 RpS25: Ribosomal protein S25 / small ribosomal 1e-28 subunit 3032 401 0.222 36 8 RpS25: Ribosomal protein S25 / small ribosomal 1e-28 subunit 3032 402 0.161 31 5 RpS25: Ribosomal protein S25 / small ribosomal 1e-28 subunit 3038 625 0.5 8 4 P82216: Unknown protein from 2D-page (Fragment) 9e-107 3038 662 0.286 14 4 P82216: Unknown protein from 2D-page (Fragment) 9e-107 3040 43 0.364 11 4 Cd63 antigen / endosome membrane 4e-26 3040 478 0.308 13 4 Cd63 antigen / endosome membrane 4e-26 3040 626 0.417 12 5 Cd63 antigen / endosome membrane 4e-26 3047 577 0.417 12 5 Clec4e: C-type lectin domain family 4, member e / 2e-08 sugar binding 3050 468 0.263 19 5 RPL18A: 60S ribosomal protein L18a / large 6e-66 ribosomal subunit

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

3071 354 0.375 16 6 mt:ND2: mitochondrial NADH-ubiquinone 1e-26 oxidoreductase chain 2 3071 682 0.267 15 4 mt:ND2: mitochondrial NADH-ubiquinone 1e-26 oxidoreductase chain 2 3071 712 0.286 14 4 mt:ND2: mitochondrial NADH-ubiquinone 1e-26 oxidoreductase chain 2 3075 104 0.26 73 19 lysoz2: Lysozyme 2 precursor / cytolysis 1e-13 3075 569 0.125 32 4 lysoz2: Lysozyme 2 precursor / cytolysis 1e-13 3075 593 0.167 24 4 lysoz2: Lysozyme 2 precursor / cytolysis 1e-13 3076 382 0.364 11 4 eIF-2beta: Eukaryotic initiation factor 2beta / 3e-75 translation initiation 3077 136 0.422 45 19 mt:ND1: mitochondrial NADH-ubiquinone 1e-103 67 oxidoreductase chain 1 3077 205 0.409 44 18 mt:ND1: mitochondrial NADH-ubiquinone 1e-103 no oxidoreductase chain 1 3077 276 0.409 44 18 mt:ND1: mitochondrial NADH-ubiquinone 1e-103 yes oxidoreductase chain 1 3077 485 0.214 42 9 mt:ND1: mitochondrial NADH-ubiquinone 1e-103 oxidoreductase chain 1 3077 677 0.353 17 6 mt:ND1: mitochondrial NADH-ubiquinone 1e-103 oxidoreductase chain 1 3077 810 0.5 8 4 mt:ND1: mitochondrial NADH-ubiquinone 1e-103 oxidoreductase chain 1 3082 136 0.231 26 6 3082 175 0.429 28 12 3082 420 0.273 33 9 3082 422 0.121 33 4 3082 579 0.375 24 9 3082 602 0.5 8 4 3084 81 0.429 21 9 RpL7A: Ribosomal protein L7A / large ribosomal 7e-71 subunit

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

3084 107 0.476 21 10 RpL7A: Ribosomal protein L7A / large ribosomal 7e-71 subunit 3084 108 0.304 23 7 RpL7A: Ribosomal protein L7A / large ribosomal 7e-71 subunit 3084 240 0.182 22 4 RpL7A: Ribosomal protein L7A / large ribosomal 7e-71 subunit 3087 410 0.286 14 4 Rps3a: ribosomal protein S3a / induction of apoptosis 4e-96 3102 73 0.14 57 8 Rpl14: ribosomal protein L14 / ribosomal structural 5e-31 constituent 3102 196 0.172 64 11 Rpl14: ribosomal protein L14 / ribosomal structural 5e-31 constituent 3107 519 0.114 35 4 Rps15: ribosomal protein S15 / ribosomal structural 8e-52 68 constituent 3107 520 0.185 27 5 Rps15: ribosomal protein S15 / ribosomal structural 8e-52 constituent 3117 146 0.308 26 8 RPL12: 60S ribosomal protein L12 / large ribosomal 2e-68 subunit 3120 144 0.444 9 4 PRDX6, AOP2, KIAA0106: Peroxiredoxin-6 / 3e-83 phospholipase A2 3124 87 0.5 8 4 3124 118 0.5 8 4 3124 295 0.4 10 4 3124 485 0.5 10 5 3139 351 0.235 17 4 RpLP0: Ribosomal protein LP0 / DNA lyase activity 8e-119 3139 459 0.263 19 5 RpLP0: Ribosomal protein LP0 / DNA lyase activity 8e-119 3154 357 0.4 10 4 no 3154 524 0.4 10 4 3161 291 0.5 8 4 3165 59 0.182 22 4 CG9299 / structural constituent of chitin-based cuticle 8e-08 3165 120 0.273 22 6 CG9299 / structural constituent of chitin-based cuticle 8e-08

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

3165 185 0.333 24 8 CG9299 / structural constituent of chitin-based cuticle 8e-08 3165 205 0.375 24 9 CG9299 / structural constituent of chitin-based cuticle 8e-08 3165 230 0.25 24 6 CG9299 / structural constituent of chitin-based cuticle 8e-08 3165 275 0.391 23 9 CG9299 / structural constituent of chitin-based cuticle 8e-08 3165 329 0.28 25 7 CG9299 / structural constituent of chitin-based cuticle 8e-08 3165 411 0.28 25 7 CG9299 / structural constituent of chitin-based cuticle 8e-08 3165 472 0.217 23 5 CG9299 / structural constituent of chitin-based cuticle 8e-08 3172 147 0.462 13 6 RPS18: 40S ribosomal protein S18 / small ribosomal 1e-63 yes subunit 3174 443 0.4 10 4 RpL13A: Ribosomal protein L13A / large ribosomal 3e-56

subunit 69

3181 170 0.444 9 4 tsr: twinstar / actin binding, female gonad 3e-54 development 3181 1151 0.267 15 4 tsr: twinstar / actin binding, female gonad 3e-54 development 3181 1206 0.5 8 4 tsr: twinstar / actin binding, female gonad 3e-54 development 3187 534 0.286 14 4 Cp1: Cysteine proteinase-1 / cathepsin L activity 2e-26 3190 97 0.27 63 17 RpLP1: Ribosomal protein LP1 / large ribosomal 4e-23 subunit 3192 552 0.316 19 6 3192 619 0.267 15 4 3192 677 0.308 13 4 3192 859 0.385 13 5 3192 952 0.333 15 5 3192 975 0.357 14 5 3192 1054 0.364 11 4 3205 195 0.5 8 4 3205 244 0.5 8 4

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

3205 308 0.5 8 4 3205 355 0.5 8 4 3205 417 0.5 8 4 3205 506 0.5 8 4 3223 64 0.286 21 6 RpL11: Ribosomal protein L11 / large ribosomal 2e-77 subunit 3223 190 0.182 22 4 RpL11: Ribosomal protein L11 / large ribosomal 2e-77 subunit 3223 340 0.2 20 4 RpL11: Ribosomal protein L11 / large ribosomal 2e-77 subunit RpL11: Ribosomal protein L11 / large ribosomal 3223 497 0.273 22 6 2e-77 70 subunit

3241 408 0.312 16 5 mt:ND4: mitochondrial NADH-ubiquinone 2e-107 oxidoreductase chain 4 3241 529 0.312 16 5 mt:ND4: mitochondrial NADH-ubiquinone 2e-107 oxidoreductase chain 4 3241 570 0.5 12 6 mt:ND4: mitochondrial NADH-ubiquinone 2e-107 oxidoreductase chain 4 3243 121 0.286 14 4 COTL1, CLP: Coactosin-like protein / actin binding 3e-32 3243 378 0.182 22 4 COTL1, CLP: Coactosin-like protein / actin binding 3e-32 3301 107 0.426 47 20 Rps27: ribosomal protein S27 / cell proliferation, zinc 2e-38 ion binding 3312 749 0.364 11 4 Mlc-c: Myosin light chain cytoplasmic / ATPase 9e-62 activity 3371 248 0.5 12 6 3371 283 0.364 11 4 3371 488 0.417 12 5 3371 807 0.455 11 5 3383 520 0.222 18 4

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

3387 1254 0.222 72 16 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-99 ion binding 3387 1933 0.188 48 9 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-99 ion binding 3387 2010 0.174 46 8 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-99 ion binding 3387 2065 0.174 46 8 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-99 ion binding 3387 2146 0.143 28 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-99 ion binding 3387 2147 0.273 22 6 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-99

ion binding 71

3393 666 0.167 24 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 5e-52 ion binding 3393 729 0.222 18 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 5e-52 ion binding 3393 1014 0.4 10 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 5e-52 ion binding 3393 1019 0.4 10 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 5e-52 ion binding 3397 319 0.368 19 7 RPL35: 60S ribosomal protein L35 / large ribosomal 3e-30 subunit 3419 128 0.235 17 4 3419 260 0.294 17 5 3419 334 0.267 15 4 3431 1014 0.333 12 4 3431 1181 0.4 10 4 3434 490 0.4 15 6 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 5e-148 3434 634 0.368 19 7 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 5e-148 3434 923 0.391 23 9 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 5e-148 3434 1063 0.267 15 4 mt:CoI: mitochondrial Cytochrome c oxidase subunit I 5e-148

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

3441 409 0.174 23 4 3441 1304 0.274 376 103 3441 1390 0.118 204 24 3441 1397 0.233 103 24 3441 1410 0.143 63 9 3441 1411 0.29 31 9 3482 598 0.5 8 4 CYCS, CYC: Cytochrome c / electron transporter 3e-34 3486 49 0.4 10 4 zgc:56134 / cellular component 3e-59 3486 90 0.4 10 4 zgc:56134 / cellular component 3e-59 3505 349 0.444 9 4 Act42A: Actin 42A / structural constituent of 2e-173 yes cytoskeleton 72 3505 560 0.412 17 7 Act42A: Actin 42A / structural constituent of 2e-173 yes cytoskeleton 3505 641 0.286 14 4 Act42A: Actin 42A / structural constituent of 2e-173 yes cytoskeleton 3509 565 0.267 15 4 fabp1b: fatty acid binding protein 1b 5e-11 3509 586 0.412 17 7 fabp1b: fatty acid binding protein 1b 5e-11 3511 619 0.286 14 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-40 ion binding 3511 887 0.375 16 6 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-40 ion binding 3511 944 0.467 15 7 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-40 no ion binding 3511 956 0.286 14 4 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-40 no ion binding 3511 972 0.333 15 5 HCY6_ANDAU: Hemocyanin AA6 chain / Cu, Cl 7e-40 yes ion binding 3527 460 0.364 11 4 Jon65Aiv: Jonah 65Aiv / proteolysis, chymotrypsin, 8e-51 endopeptidase

Table S1. (continued)

Contig SNP Minor Sequence SNP Matching Protein from BLAST/ BLAST Confirmed?3 Number position Nucleotide count identifier Its Function or Location e-value2 Frequency Score1

3527 938 0.5 8 4 Jon65Aiv: Jonah 65Aiv / proteolysis, chymotrypsin, 8e-51 endopeptidase 3532 327 0.381 21 8 Atp5g2: ATP synthase, H+ transporting, 4e-17 mitochondrial F0 complex

73

Table S2. Contigs sequenced to confirm predicted SNPs and the additional SNPs found within

Contig New Sequencea number SNPs 282 7 TATCTATCACACCTACCTGTTATTAGCTGTAGCAGATGGCTAGC[G/T]CCTGT[A/G]ACTCCTTAGC AAATATGGC[A/T]G[A/T][C/G][A/C]TTTGT[A/G]TGAGCAGACCAGAACGTGTCAC 372 0 TCACCCGTGATCCTAGCTTCTTTAGGCTGCATAAATACATGGATAACATCTTAAGGAACACAAGG ACAGTCTCCCTCCCTACCAGTGGAAGAACTRACATTTGCCGGTGTAAGTGTAGACAGCGTAGCAA TTGAAGGCTACA 1015 3 CCATCGTGGTGGTGGGCCACGTAGACTCTGGCAAGTCCACCACTACCGGTCATCTCATCTACAAA TGCGGTGGCATCGACAAGCGAACCATCGAGAAGTTCGAGAAGGAGTCCTCTGAGATGGGCAAGG GCTCCTTCAAGTACGCCTGGGTGCTGGACAAGCTGAAGGCCGAGCGCGAGCGCGG[C/T]AT[C/T] ACCATCGACATCGCCCTGTGGAAGTTCGAGACCAACAGGTTCTACGTGACCATCATCGATGCCCC

AGGCCATCGYGATTTCATCAAGAACATGATCACCGGAAC[A/G]TCCCAGGCCGACTGCGCCGTGC 74

TGATYGTGGCTGCCGGTACCGGCGAGTTCGAAGCTGGTATCTCGAAGAACGGGCAGACCCGCGA GCACGTGTTGCTGTGCTTCACCCTGGGTGTGAAGCA 1417 0 TCATCATTCAAGGTCCCAAAGCATGGCCTTGGCATCCTGCAAAAAAGAAATTAAATGAAAACTTG ATTTTTTATATTCATTAAATTAAATTCACACAGCCAAAAAAAAAAATATATATATATATATACAC AAAAAAAATAATAATAATAATTTACCTTGCCACCAGAAGCACAGAGGGAGCCGTACAGGGGAGA CGGTAACTGTGTTACAGGTAACC 1659 1 GGCTCAGTGCCCCATTGTGGAGCGTCTGACCAACTCCCTCATGATGCACGGTCG[C/T]AACAACG GCAAGAAGCTCATGGCCGTCCGCATTGTCAAGCACTCCTTCGAGATCATCCACCTCCTCACTGGA G a There are four possible outcomes. Sequences contain: 1) no SNPs (contigs 1417, 3038, 3154); 2) only SNPs that were predicted by SNPidentifier (identified by a single letter abbreviation) (contigs 372, 3077, 3172); 3) only SNPs that were not predicted (identified by the opposing alleles in brackets) (contigs 282, 1659, 2313); or 4) both predicted and unpredicted SNPs (contigs 1015, 2826, 2856, 3505, 3511).

Table S2. (continued)

2313 2 TTGTGTTTGCCATGTTTGACCAAGTTCAGATTCAGGAGTTCAAGGAAGCGTTCAACATGATTGAC CAGAATCGTGATGGATTCATTGACAAGGATGATCTGCATGATATGTTAGCTTCTCTAGGTGAATA TACATATTGATTTGAATTTGAAGTGTATATATATGTATATATAG[A/G]TATTATAGATGAAGATAT CATTAGAATTATAGATTTATATGTATAAAGCTGTACCTTTTTATACTTGGGATATTGTATGTATTTT TCTTTTCCTTTTTT[A/-]TGATTTCTGTTCAAATTTATAGATAAATTTTGAATGTTCTTACCTTGTATC ATAGACCTGTCT[A/G]TATCATAGTGATTTCTTATTTAGTAATATCACTGACTGCAATCATTATTTT TCCAGGCAAGAACCCAACTGATGACTACTTGGAAGGAATGATGAATGAGGCTCCAGGACCCATC AACTTCACCATGTTCCTCACTCTTTTCGGTGATCGTCTTCAGGGCACAGATAA

2826 13 TGCCGTTTTTTTCCCTCCATCGTCGGCCGTGCCCGTCACCAGGGTGTGATGGTCGGTATGGGTCAG AAGGACGCCTACGTCGGTGATGAGGCCCAGAGCAAGCGTGGTATCCTCAC[C/T]CTCAAGTACCC CATYGARCACGGTATCATCACCAACTGGGAYGACATGGAGAAGATCTGGTACCACACCTTCTAC AATGA[A/G]CTCCGTGTTGCCCCTGARGAGTCCCCCACACTTCTCACTGAGGCTCCCCTCAACCCC 75

AAGGCCAACCGTGAGAAGATGACTCAGATCATGTTCGAGTCCTTCAA[C/T]GTGCCTGCCAC[G/T] TA[C/T]GTTACCATCCAGGCTGT[C/G]CTGTCCCTGTACGCCTC[C/T]GGTCGTAC[C/T]ACTGGT[C/ G]AGGTTTG[C/T]GACTCTGGTGACGGTGTGACTCAC[A/T]T[C/G]GTCCCCGTCTATGAAGGTTTC GCTCTTCCTCATGCTATCCTYCGTCTCGA[C/T]TTGGCTGGTCGTGACCTKACCCACTACCTGATG AAGATCATGACTGAGCGTGGCTACTCCTTCACCACCACCGCTGAACGTGAAATCGTTCGTGACAT CAAGGAGAAGCTTTGCTACATCGCCCTTGACTTCGAGAGTGAGATGAACGTTGCTGCTGCTTCCT CCTCCTTGGACAAGTCCTACGAGCTCCCCGACGGTA 2856 19 GACATGGAGAAGATCTGGCATCACTCCTTCTACAACGAACTCCGTGTTGCTCCTGAGGAATCTCC CGTCCTCCTCACTGAGGCTCCCCTCAACCCCAAGGCCAACCG[A/T]GAGAAGATGACCCAGATCA TGTTCGAGACCTTCAACACACCCGCCATGTACGT[A/G]GC[C/T]ATCCAGGCCGTGCTGTCCCT[C/ G]TACGCCTC[A/T]GGTCGTAC[C/T]ACTGGTAT[C/T]GT[C/G]CTCGAY[A/T]CTGGTGA[C/T]GGTG TG[A/T]CCCACACYGTCCCCATCTACGAAGGTTATGCTCTTCCTCATGCTATCCTTCGTCTCGACTT GGCTGGTCGTGACCT[C/G]AC[C/T]GCTTACCTGATGAAGATCATGACTGAGCGTGGCTACTCCTT CACCACCACCGCTGAACGAGAAATCGTTCGTGACATCAAGGAGAAGCTTTGCTACGTCGCCCTTG ACTTCGAGAG[C/T]GAGATGAA[C/T]GTCGCTGCTGCTTC[C/T]TCATC[C/T]CTCGAGAAGTCCTA TGAACTTCCCGACGGTCAGGTCATCACCAT[C/T]GGCAACGAGCG[C/T]TTCCGTTGCCCCGAGTC TCTTTTCCAGCCTTTCCTTCCTA

Table S2. (continued)

3038 0 TCGAGGTACTGATGGTGCCCTCGTCGTGGTAGATTGCGTCTCTGGTGAGTATGCAAAAGATTCCA CATGATAAAAGTTCATCAGTGCTATAGCTACCACAATTTAGTTTCATCTAAGCAAAAGTAACATT TCAGGTGTGTAAGCAAAAGTAACATTTTCAGGTGTGTGTGTGCAGACCGAGACTGTACTGCGTCA GGCTATTGCTGAGCGCATCAAGCCAGTTCTGTTCATGAACAAGATGGACCGTGCTCTCCTTGAAT TGCAGCTCGAGCAGGAGGAATTGTACCAGACATTCCAGGTTTGTCTAGTGTCCTGTATATCTTGA CATTTGATTTGTTGCACTGGATCTTTTGAATTTGAATGATTTGGAGCATTTTAAATATAAA 3077 0 TTTGCCGATGCTGTTAAGTTGTTCACTAAAGAACAGACTCTACCAGTTATATCAAATTTTCTCCCT TATTACTTATCTCCTGTTTTTAGTCTTTTTGTGTCTTTAATTGTATGGYTAGTTATGCCGTATGAAC TGGGGTTAATAAATTTTAGTATGAGAACTTTGTTTTTTCTATGTTGTACAAGTCTAGGAGTGTATA CGACTATAAGAGCAGGGTGAGCTTCAAAAA

3154 0 GAATGGAGGAGCTTGTGGAGGAGGTGGATAGTAACCCTGCAGCTGTTGTAGATACAGTTGTTGCT

GCTGCTGTACATACAAGGACTGTTGCTGCTGCTGGTATGCTGCCTGCTGTGAGTAGGCAAGTTTTT 76

GAGCTTGCTGGGCAGCTGCCAAGTTCTGATAGTATGCAAGAGAGGCTGGGTCCTTATCCA 3172 0 TTAGCATATCTCTGCGTTTGATGAACACCAACATCGATGGCAGGCGTAAGGTTATGTTCGCCATG ACTTCCATCAGGGGTGTTGGTCGCCGTTACTCSAACATTGTCCTCAAGAAAGGCCGA 3505 1 TACTGG[G/T]ACCGACATGGAAAAGATCTGGCACCACACCTTCTACAACGAGCTCCGTGTSGCTC CCGAGGAACACCCTGTCCTGCTCACCGAGGCTCCCCTCAACCCCAAGGCCAACCGTGAGAAGAT GACACAGATTATGTTCGAGACCTTCAACAGCCCCGCCATGTACGTAGCCATCCAGGCTGTGCTTT CTCTGTACGCCTCCGGTCGTACCACGGGTATCGTGCTTGACTCTGGCGACGGCGTGTCCCACACA GTGCCCATCTAYGAGGGATATGCACTCCCTCACGCCATCCAGCGCCTCGACCTCGCTGGACGTGA CCTTACAGACTACCTGATGAAGATCCTRACCGAGCGCGGCCACACCTTCACGACCACCGCTGAGC GAGAAATCGTTCGTGACATCAAGGAGAAGCTGTGCTATGTTGCACTTGACTTCGAAGAGGAAAT GGCCACTTCCACACAGTCTTCTTCTCTTGAGAAATCTTACGAACTCCCCGACGGCCAGGTGATCA CCATCGGCAACGAGAGGTTCCGCTGCCCCGAGGCCATGTTCCAGCCTTTCCATCCTA 3511 1 CACTTCGGCCCATCCAGTTAAGGTCTTCAATCATGGTGAACATATCCATCACCATTAACTTGCAA[ C/T]ATGAACATTCGCCACGAATAARTAGTACCATTCTAGTGATACCTGTTATTTGACCATGATGA GCAGCTAATAAGATGTGATT

77

CHAPTER 3. MINING ESTs TO DETERMINE THE USEFULNESS OF SNPs ACROSS SHRIMP SPECIES

A paper published in Animal Biotechnology. Volume 21(2):100-103.

Danielle M. Gorbach, Zhi-Liang Hu, Zhi-Qiang Du, Max F. Rothschild

ABSTRACT

Expressed sequence tag (EST) libraries from members of the Penaeidae family and brine shrimp (Artemia franciscana) are currently the primary source of sequence data for shrimp species. Penaeid shrimp are the most commonly farmed worldwide, but selection methods for improving shrimp are limited. A better understanding of shrimp genomics is needed for farmers to use genetic markers to select the best breeding animals. The ESTs from

Litopenaeus vannamei have been previously mined for single nucleotide polymorphisms

(SNPs). This present study took publicly available ESTs from nine shrimp species, excluding L. vannamei, clustered them with CAP3, predicted SNPs within them using

SNPidentifier, and then analyzed whether the SNPs were intra- or interspecies. Major goals of the project were to predict SNPs that may distinguish shrimp species, locate SNPs that may segregate in multiple species, and determine the genetic similarities between L. vannamei and the other shrimp species based on their EST sequences. Overall, 4,597 SNPs were predicted from 4,600 contigs with 703 of them being interspecies SNPs, 735 of them possibly predicting species’ differences, and 18 of them appearing to segregate in multiple species. While sequences appear relatively well conserved, SNPs do not appear to be well conserved across shrimp species.

78

INTRODUCTION

The penaeid species of shrimp are the most commonly farmed, with Litopenaeus vannamei being the most popular among them. Currently very little is known about shrimp genomics. Breeders are interested in applying livestock genetic concepts such as marker- assisted selection (MAS), but increases in knowledge of the shrimp genome are needed. The first step is to locate single nucleotide polymorphisms (SNPs) for the creation of a linkage map. The current aim is to rapidly advance beyond a linkage map to designing SNP chips for discovery of quantitative trait loci (QTL) in shrimp.

Our lab previously mined a database of 25,937 L. vannamei expressed sequence tags

(ESTs) for segregating SNPs.1 Research2 has shown that SNPs may be conserved across species. Thus, looking in conserved regions of the genome for individual bases which are not well-conserved may be useful for locating SNPs that segregate in multiple related species. Currently there are thousands of publicly available ESTs from several shrimp species.

In this study, we tested the hypothesis that within and between species variations in common genes exist in shrimp species, and these variations may be discovered using large- scale virtual sequence alignments for SNP detection. Thus, ESTs from multiple shrimp species were compared for several purposes: 1) to predict base positions which may uniquely identify shrimp species; 2) to predict SNPs that are segregating in multiple shrimp species and would be especially valuable for designing a generic shrimp SNP chip; and 3) to consider the genetic similarity of shrimp species through comparison of these results to those found with L. vannamei. A broader goal was to assess whether ESTs, which are widely

79

available for many species, could be used to easily determine the transferability of genetic

knowledge across species or populations.

MATERIALS AND METHODS

A total of 60,725 ESTs from species of shrimp excluding L. vannamei were

downloaded (www.ncbi.nlm.nih.gov). Questionable ESTs were deleted using the criteria in

Gorbach et al.1 The remaining ESTs were clustered with CAP33 to generate contigs. After

excluding the ESTs which did not cluster, the contigs contained the following numbers of

ESTs: 27,800 Artemia franciscana, 8,348 Fenneropenaeus chinensis, 5,087 Penaeus

monodon, 668 L. setiferus, 173 L. stylirostris, 28 Marsupenaeus japonicus, 16 Palaemonetes

pugio, 3 F. indicus, and 2 F. merguiensis. These contigs were mined for SNPs using a

bioinformatics tool, SNPidentifier, as in Gorbach et al.1 The SNPs were predicted where the

EST sequences were completely conserved for 15 bases on each side of the predicted SNP

and each allele at the predicted SNP position was observed at least twice. Predicted SNPs

were examined to determine if base differences were intraspecies or only interspecies.

Contigs were BLASTed against the previously described EST contigs from L. vannamei1 for comparison using the blastn option under bl2seq. Blast scores greater than 100 were considered significant.

RESULTS AND DISCUSSION

From 42,679 raw EST sequences, 4,600 contigs were obtained by clustering overlapping sequences with CAP3, which left 15,217 singletons unclustered. Out of the

4,600 contigs, 911 were predicted to contain SNPs and a total of 4,597 SNPs were predicted

(supplementary Tab. 1). Of the 911 contigs, 672 were composed of sequences from a single species. This included all 544 with only A. franciscana sequences, since A. franciscana

80

never clustered with any of the remaining species within the contigs predicted to contain

SNPs.

The lack of clustering between A. franciscana and the penaeid shrimp species is partially expected based on the more ancient divergence of A. franciscana, a branchiopod,

from the penaeid species of shrimp.4,5 Considering this lack of homology between A. franciscana and the other shrimp species, we can be sure that little genomic knowledge can be transferred from brine shrimp to the other species included in this study.

Of the 911 contigs predicted to contain SNPs, 415 (46%) showed significant alignments to at least 1 L. vannamei contig. Most of the contigs that did not align substantially (463 out of 496) were comprised of a single species with 420 of them only containing A. franciscana sequences. In comparison, 86% of the contigs comprised of sequences from two or more species aligned well to a contig from L. vannamei. Within the predicted SNPs, there were three main categories: 1) the bases which appeared to be segregating within one or more species; 2) the bases which appeared to only vary between species; and 3) the bases which appeared to vary both within and between species. A total of

3,571 SNPs belong in category 1. In category 2, there were 703 predicted between species base differences. Finally, 323 bases appeared to vary both within and between species. Of these base differences, all 703 from category 2 and 32 from category 3 may prove useful for distinguishing species (Fig. 1). In vivo testing is needed to confirm which of these predicted bases are actually fixed within each species, but the predictions obtained from ESTs provide the surrounding sequence needed to design primers.

This method had more success with predicting differences between species than predicting SNPs that appear to segregate in multiple species. Only 18 SNPs were located

81

which appeared to be segregating in multiple species. Though further comparisons should be

done, there is not much promise for use of a SNP chip across species of shrimp. However,

clustering ESTs does seem to be a very cheap, easy, and quick way to assess which

populations can be grouped together when designing a SNP chip or assessing the potential

usefulness of an existing SNP chip across different populations or species whenever sizeable

EST libraries are available. This method also quickly predicts inter- and intraspecies SNPs

for further validation. Overall, our study demonstrated a useful approach for EST analysis,

but this is limited if few ESTs exist for a given species.

ACKNOWLEDGEMENTS

Funding for this work was provided by the USDA NRSP8 National Animal Genome

Program, the Iowa Agricultural Home Economics Experiment Stations, State of Iowa and

Hatch funding. D.M. Gorbach is supported by a USDA-CSREES National Needs fellowship under Grant no. 2007-38420-17767.

REFERENCES

1. Gorbach DM, Hu Z-L, Du Z-Q, Rothschild MF. SNP discovery in Litopenaeus vannamei

with a new computational pipeline. Anim. Genet. 2009;40(1):106-109.

2. Grapes L, Rudd S, Fernando RL, Megy K, Rocha D, Rothschild MF. Prospecting for pig

single nucleotide polymorphisms in the human genome: have we struck gold? J. Anim.

Breed. Genet. 2006;123(3):145-151.

3. Huang X, Madan A. CAP3: A DNA Sequence Assembly Program. Genome Res.

1999;9:868-877.

4. Hwang UW, Friedrich M, Tautz D, Park CJ, Kim W. Mitochondrial protein phylogeny

joins myriapods with chelicerates. Nature. 2001;413(6852):154-157.

82

5. V’yugin VV, Gelfand MS, Lyubetsky VA. Tree Reconciliation: Reconstruction of

Species Phylogeny by Phylogenetic Gene Trees. Mol. Biol. 2002;36(5):650-658.

83

4000

3500

3000

2500

2000 Species Distinguishing

1500 All Other SNPs

1000

500

0 Between Species SNPs Within Species SNPs Both Within and Between Species SNPs

Figure 1. Distribution of predicted SNPs. The 735 bases predicted to be useful for distinguishing species are represented in black. Most of the between species SNPs were predicted from brine shrimp, which had the most ESTs.

84

CHAPTER 4. USE OF SNP GENOTYPING TO DETERMINE PEDIGREE AND BREED COMPOSITION OF DAIRY CATTLE IN KENYA

A paper published in Journal of Animal Breeding and Genetics. Volume 127(5):348-351.

D.M. Gorbach, M.L. Makgahlela, J.M. Reecy, S.J. Kemp, I. Baltenweck, R. Ouma, O. Mwai, K. Marshall, B. Murdoch, S. Moore & M.F. Rothschild

Summary

High levels of inbreeding in East African dairy cattle are a potential concern because of use of a limited range of imported germplasm coupled with strong selection, especially by disease, and sparse performance recording. To address this, genetic relationships and breed

composition in an admixed population of Kenyan dairy cattle were estimated by means of a

50K SNP scan. Genomic DNA from 3 worldwide Holstein and 20 Kenyan bulls, 71 putative

cow-calf pairs, 25 cows from a large ranch, and 5 other Kenyan animals were genotyped for

37,238 informative SNPs. Sires were predicted and 89% of putative dam-calf relationships

were supported by genotype data. Animals were clustered with the HapMap population using

Structure software to assess breed composition. Cows from a large ranch primarily clustered

with Holsteins, while animals from smaller farms were generally crosses between Holstein

and Guernsey. Coefficients of relatedness were estimated and showed evidence of heavy use

of one AI bull. We conclude that little native germplasm exists within the genotyped

populations and mostly European ancestry remains.

Introduction

Coefficients of relationship between pairs of individuals play a very important role in

many areas of quantitative genetics, conservation genetics and molecular ecology.

Knowledge of the genetic relationships in different populations is used for the estimation of

85

quantitative genetic parameters (e.g. heritabilities and genetic covariances) and breeding

values (Lynch & Walsh 1998; Ritland 2000), is necessary for kin selection (Morin et al.

1994), and allows for the study of mating systems (Engh et al. 2002; Frankham et al. 2002).

In the management of populations, availability of pedigree structure or the co-ancestries

between the individuals that belong to it helps to avoid the loss of diversity and control

inbreeding (Ballou & Lacy 1995; Meuwissen 1997; Caballero & Toro 2000; Frankham et al.

2002).

In many developing countries, such as those of East Africa, the necessary pedigree and performance data are often not reliably recorded or are unavailable (Rege et al. 2001).

Furthermore, the relatively few exotic (i.e. non-indigenous) genotypes available for import as

semen are typically selected on the basis of their genetic merit under European or North

American production systems. Consequently, losses because of disease and other

environmental demands may be high and this presumably further narrows the range of exotic

genetics in the African dairy herds (McDermott & Arimi 2002; Mattioli et al. 2000). There

is reason for concern that the herds are subject to inbreeding and subsequent depression of

productivity (Rege et al. 2001).

Another risk with importation of exotic germplasm is the loss of species diversity

because of elimination of native stock from the African breeding population. Centuries of

natural selection have resulted in native African cattle which are adapted for the harsher

conditions, and these genetic resources may be lost if too many matings occur to animals of

European ancestry. Studies utilizing high-density markers enable researchers to assess the current levels of genetic diversity and determine the optimal method for conservation of genetic diversity (Oliehoek et al. 2006; Windig & Engelsma 2009). Currently, genetic

86

information about Kenyan cattle is missing, which makes determining the best method of

conservation impractical.

Therefore, the objective of this study was to use large-scale SNP data to determine

parentage and breed composition of each animal in an admixed population of dairy cattle in

Kenya. The determination of breed composition of parents and offspring could provide

information on how to improve population management as accurate pedigree records are not

available for this assessment.

Materials and Methods

The following 23 bulls were sampled for this study with all but the first three living in

Kenya: three worldwide Holsteins, 10 from the Central Artificial Insemination Station

(CAIS) and 10 from smaller farms. All bulls were chosen based on their heavy use in

Kenyan breeding programmes for genetic improvement of dairy cattle populations and status

as putative sires of replacement heifers (stock). An additional 71 putative cow-calf pairs from

different small herds were identified and sampled, plus 25 cows across 4 generations from a

large Kenyan ranch (with deeper pedigree reported) and 5 unmatched cows and calves

presumed genetically similar to the cow-calf pairs (172 cows and calves total). Blood or semen samples were collected from all 195 animals. For each animal, DNA was extracted from blood samples and genotyped for SNP markers across the genome, using the Illumina

50K bovine SNP chip (San Diego, CA, USA). All markers with Illumina QC scores <0.8 in more than 10% of the population or minor allele frequency <0.015 were removed leaving

37,238 markers.

87

Maternity and paternity were checked for each calf using Sire-Match software (E. J.

Pollak, Cornell University, Ithaca, NY, USA) with 100 randomly selected markers with minor allele frequency greater than 0.45.

Genomic inbreeding coefficients were estimated using the pair-wise kinship coefficients of Ritland (1996) implemented in Spatial Genetic Diversity (SPAGeDi 1.2;

Hardy & Vekemans 2002). Coefficients were computed separately for each chromosome because of the software’s limitations and then averaged across the first 20 chromosomes.

A total of 1,000 markers which were genotyped in both the HapMap study

(http://bfgl.anri.barc.usda.gov/cgi-bin/hapmap/affy2/viewMarkers) and the 195 Kenyan animals were inputted into Structure 2.1 (Pritchard et al. 2000) to detect breed composition of the animals. Runs were completed with k=2 through k=10 clusters using 10,000 reps during burn-in and 10,000 MCMC reps after burn-in with each run replicated twice. Each run predicts the proportion of the animals’ genetic composition that originates from each of the k clusters.

Results

A total of 8 of the 81 putative cow-calf pairs (which include 10 from the large ranch) showed incorrect maternity. The true dams of the eight calves, all from small farms, were not among the genotyped animals. Either some incorrect parentage identification occurred on the small farms or errors may have occurred during sampling and processing.

Sire-Match software predicted 24 sire-calf pairs, with all nine of the predicted sires coming from CAIS. One bull from CAIS was the predicted sire of eight genotyped calves.

This sire was recorded under two different names within the large-scale ranch pedigree, which may have resulted in him being used more than owners realized.

88

Based on how much of the genetic make-up of the Kenyan cows clustered with the known Holsteins and Guernseys from the HapMap population, the Structure program gave evidence that the cows from the large-scale farm are primarily of Holstein descent, while the animals from smaller farms are a mix of Holstein and Guernsey (Figure 1). The bulls used for AI varied widely in their ancestry with anywhere from 30 to 98% Holstein genetics with the next most represented breed being Guernsey. None of the animals appeared to have significant native African or Bos indicus heritage.

Of 195 animals sampled, 41 had inbreeding coefficients greater than 2.5% including

14 originating from the large-scale ranch (56% of that population) and all 3 worldwide

Holstein bulls (Figure 2).

Discussion

To infer genetic relationships and breed composition, we examined the population

structure based on SNP data from an admixed population of dairy cattle in Kenya. Based on

how well the Structure program clustered animals within breeds from the HapMap

population and recognized breed composition for known admixed breeds, this software was

used instead of assignation methods which would have assumed animals were purebred. The

clustering results from the Structure program reflect the widespread use of Holstein semen in

the Kenyan dairy breeding populations and perhaps indicate earlier access to Holstein

genetics by the larger ranch owners compared to small-scale farmers. While the large ranch cows are mostly Holstein, the animals from smaller farms average about 50% Holstein blood, consistent with one generation of using Holstein semen. Surprisingly, the Kenyan animals showed little to no evidence of Bos indicus ancestry or other native African blood. The other major contributor to most of these cows was most likely the Guernsey breed. Further work

89

should be completed to assess the suitability of these European breeds for milk production in

East Africa.

Based on the pedigrees provided by small-scale farmers, ancestry records maintained in Kenya appear to be incomplete and contain a significant number of inaccuracies. While the large ranch pedigree was far more complete, it also contained a small number of errors and showed no evidence of attempts to minimize inbreeding. Because true base population allele frequencies were not available, the values of the genomic inbreeding coefficients are only accurate for comparisons of animals within this population and say nothing about the true inbreeding levels of each animal. Furthermore, insufficient pedigree structure and trio data existed to make other inbreeding estimation techniques viable (data not shown).

Consequently, the main conclusion to be drawn is that large-scale ranch cows appear to exhibit more inbreeding than cattle from smaller farms, although the worldwide Holstein bulls have higher average inbreeding than either group of cows (Figure 2).

Overall, the Structure program clustering results and parentage information obtained in this study indicated that genetically cows from large-scale farms are more similar to

Holstein dairy cattle and may be more inbred than local Kenyan animals. Bulls used for AI varied substantially in their genetic background. We also found that data on animal relationships were not always accurately recorded.

Acknowledgements

Funding was provided by the Ensminger Endowment, State of Iowa and Hatch funding. Support for D. M. Gorbach was provided by a USDA-CSREES National Needs fellowship under Grant no. 2007-38420-17767. We thank the two reviewers for their constructive feedback on this article.

90

References

Ballou J., Lacy R.C. (1995) Identifying genetically important individuals for management of

genetic diversity in pedigreed populations. In: J. Ballou, M. Gilpin, and T. Foose

(eds), Population management for survival and recovery: analytical methods and

strategies in small population conservation. Columbia Univ. Press, New York, pp. 76-

111.

Caballero A., Toro M.A. (2000) Interrelations between effective population size and other

pedigree tools for the management of conserved populations. Genet. Res., 75, 331-

343.

Engh A.L., Funk S.M., Van Horn R.C., Scribner K.T., Bruford M.W., Libants S., Szykman

M., Smale L., Holekamp K.E. (2002) Reproductive skew among males in a female-

dominated mammalian society. Behav. Ecol., 13, 193-200.

Frankham R., Ballou J.D., Briscoe D.A. (2002) Introduction to Conservation Genetics.

Cambridge University Press, Cambridge.

Hardy O.J., Vekemans X. (2002) SPAGeDi: a versatile computer program to analyse spatial

genetic structure at the individual or population levels. Mol. Ecol. Notes, 2, 618–620.

Lynch M., Walsh J.B. (1998) Genetics and Analysis of Quantitative Traits. Sinauer

Associates, Sunderland, MA.

Mattioli R.C., Pandey V.S., Murray M., Fitzpatrick J.L. (2000) Immunogenetic influences on

tick resistance in African cattle with particular reference to trypanotolerant N’Dama

(Bos Taurus) and trypanosusceptible Gobra zebu (Bos indicus) cattle. Acta Trop., 75,

263-277.

91

McDermott J.J., Arimi S.M. (2002) Brucellosis in sub-Saharan Africa: epidemiology, control

and impact. Vet. Microbiol., 90, 111-134.

Meuwissen T.H. (1997) Maximizing the response of selection with a predefined rate of

inbreeding. J. Anim. Sci., 75, 934-940.

Morin P.A., Moore J.J., Chakraborty R., Jin L., Goodall J., Woodruff D.S. (1994) Kin

selection, social structure, gene flow, and the evolution of chimpanzees. Science, 265,

1193-1201.

Oliehoek P.A., Windig J., Arendonk J.A.M., Bijma P. (2006) Estimating relatedness between

individuals in general populations with a focus on their use in conservation programs.

Genetics, 173, 483-496.

Pritchard J.K., Stephens M., Donnelly P.J. (2000) Inference of population structure using

multilocus genotype data. Genetics, 155, 945–959.

Rege J.E.O., Kahi A., Okomo-Adhiambo M., Mwacharo J., Hanotte O. (2001) Zebu cattle of

Kenya: Uses, performance, farmer preferences and measures of genetic diversity.

ILRI (International Livestock Research Institute), Nairobi, Kenya.

Ritland K. (1996) Estimators for pairwise relatedness and inbreeding coefficients. Genet.

Res., 67, 175–186.

Ritland K. (2000). Marker-inferred relatedness as a tool for detecting heritability in nature.

Mol. Ecol., 9, 1195–1204.

Windig J.J., Engelsma K.A. (2009) Perspectives of genomics for genetic conservation of

livestock. Conserv. Genet., Published Online: 5 November 2009. DOI: 10.1007/

s10592-009-0007-x.

92

Figure 1. Results from the Structure software for clustering the Kenyan animals with the

HapMap cattle. A cluster size of k = 6 is depicted. The cows from the large-scale ranch obviously cluster more Holsteins than those from smaller farms. Common relatives with

Guernsey also appear to have contributed significantly to the Kenyan populations (further resolved with larger cluster size).

93

Figure 2. A comparison of the distribution of inbreeding coefficients in the different groups of cattle based on Ritland’s pair-wise kinship coefficient. The worldwide bulls show the most inbreeding, but the large-ranch cows show a higher average than other Kenyan groups.

94

CHAPTER 5. POLYDACTYL INHERITANCE IN THE PIG

A paper published in Journal of Heredity. Volume 101(4):469-475.

Danielle Gorbach*, Benny Mote*, Liviu Totir, Rohan Fernando, Max Rothschild

ABSTRACT

Two pigs were identified having “extra feet” (preaxial polydactyly) within a purebred

population of Yorkshire pigs. Polydactyly is an inherited disorder in many species that may

be controlled by either recessive or dominant genes. Experimental matings were conducted

using pigs that had produced affected offspring with the result of 12 polydactyl offspring out

of 95 piglets. One polydactyl-producing boar was also mated to 4 Duroc sows and 8

distantly related Yorkshire sows to produce 129 unaffected offspring. Together, these results

suggest a recessive mode of inheritance, possibly with incomplete penetrance. Candidate

genes, LMBR1, EN2, HOXA10-13, GLI3, WNT2, WNT16 and SHH, were identified based on

association with similar phenotypes in other species. Homologues for these genes are all

found on SSC18. Sequencing and linkage studies showed no evidence for association with

HOXA10-13, WNT2 and WNT16. Results for the regions including GLI3, LMBR1, and SHH,

however, were inconclusive. A whole genome scan was conducted on DNA samples from

10 affected pigs and 12 close relatives using the Illumina PorcineSNP60 BeadChip and

compared with 69 more distantly related animals in the same population. No evidence was

found for a major gene causing polydactyly. However, a 25-Mb stretch of homozygosity on

SSC8 was identified as fairly unique to the family segregating for this trait. Therefore, this

chromosome segment may play a role in development of polydactyly in concert with other

genes.

*These authors contributed equally to the work.

95

INTRODUCTION

The first report of a polydactyl pig was in 1931 (Curson 1931), with additional reports in 1938 (Hughes 1938), 1959 (Gaedtke 1959), and 1963 (Ptak 1963). Additionally, the National Swine Registry (the governing body of the Yorkshire breed of pigs) states in their requirements for registration that a Yorkshire pig with an extra dewclaw is not allowed to be registered (National Swine Registry 2007).

Other vertebrates have also been known to express different polydactyl phenotypes that are observed either simply by themselves or as one phenotype of a syndrome. Genes found responsible for polydactylism in other species include: Hedgehog and WNT gene families (Yang et al. 1998; Sheth et al. 2007; Zeller and Zuniga 2007), Sonic Hedgehog

(SHH) (Hill 2007), a cis-acting regulator of SHH within LMBR1 (Sagai et al. 2004; Huang et al. 2006; Wang et al. 2007), engrailed-2 (EN2) (Lawrence et al. 1999), TWIST1 (Firulli et al. 2007), GLI family zinc finger 3 (GLI3) (Fujioka et al. 2005), and the HOX genes (Tarchini et al. 2006; Sheth et al. 2007; Zeller and Zuniga 2007). Homologues for all of these candidate genes except TWIST1 reside on porcine chromosome 18.

After initial identification of 2 purebred Yorkshire pigs expressing a polydactyl phenotype, additional matings were conducted to positively identify if the phenotype was in fact genetic in nature and not due solely to environmental influences. Additional pigs with a range of polydactyl phenotypes were produced, and genetic markers were utilized to investigate the mode of inheritance and the genes causing the observed phenotype in our population of pigs.

96

MATERIALS AND METHODS

Phenotypic Analysis

The polydactyl phenotypes observed ranged from extra dewclaws where only extra

phalanges were present to extra feet that included extra carpal bones, metacarpals, and

phalanges (Figure 1). In one case, the extra feet replaced the medial dewclaws (Figure 2). In

cases of polydactyly in other species, a range of phenotypes have resulted from mutations in

a single gene (e.g., Radhakrishna et al. 1999). Because pigs exhibiting all of these

phenotypes occurred in the same family, pigs with any of the polydactyl traits were scored as affected, regardless of severity.

Population

This study originated in a closed breeding population of Yorkshire pigs located at the

Iowa State University (ISU) swine breeding farm (Madrid, IA). From this population, a total

of 3 boars (2 producing affected offspring), 9 dams (5 producing affected offspring), and 127

offspring (12 affected) in 13 litters were included in this study. Two additional Yorkshire

boars from Swine Genetics International (SGI) (Cambridge, IA) that had reports of

producing pigs with extra dewclaws in other herds were used, producing 15 offspring (2

affected) in 2 litters. Additionally, 1 other boar and 12 other dams were mated with some of

the above animals to assess the recessive or dominant nature of this trait, resulting in 143

offspring. All animals were raised under approved animal care regulations.

Two polydactyl pigs were initially identified. Pig 137-06 was a male pig (Figure

1A,B) having an “extra foot” on the medial side of both of his front feet. He was one of 8

piglets in a litter resulting from a mating between the 18-1 (sire) and 52-11 (dam) animals.

Pig 156-08 was a female pig (Figure 1C) possessing an “extra foot” on the medial side of

97

only one front foot (left) and was from a litter of 9 piglets that resulted from the mating of the

95-04 (sire) and 102-11 (dam) animals. All future matings involved at least one animal

derived from the above 4 parents. Pictures of live pigs were taken at the farm, and X-rays were taken on 137-06 after he was euthanized. Pedigrees are provided in Figures 3 and 4.

Matings

The 95-04 boar was subsequently mated to the 52-11 and 102-11 dams, his cousin

(137-09, a full sibling to the 137-06 barrow), and several of his daughters (156-05 twice,

156-06 twice, 156-07, 192-05, 192-07, and 207-03). The boar 156-03 was mated with his full

sister, 192-07.

Additional matings were carried out to further test the inheritance of this phenotype

(Figure 4). The sire 95-04 was mated to 4 Duroc (unrelated) females and 8 additional

Yorkshire (unrelated) females. The 156-05 female was mated to a Duroc boar that had sired

multiple litters but no affected piglets.

SGI possessed frozen semen on 2 boars, subsequently referred to as SGI_1 and

SGI_2, that had previous reports of offspring with extra dewclaws. SGI_1 was mated to 137-

09, and SGI_2 was mated to 156-06.

DNA

For adult or mature animals, blood samples were collected and used for DNA

isolation. For planned matings, tail tissue samples were obtained at birth from each pig,

placed into a labeled 1.7 ml tube, and stored at -80 °C. For each animal, a 25-mg tail sample

was used for DNA isolation using the DNeasy kit from Qiagen (Valencia, CA) following the

manufacturer’s protocol.

98

Candidate Gene Approach to SNP Genotyping

Single nucleotide polymorphisms (SNPs) were identified and genotyped in candidate genes as well as other genes located throughout SSC18. The candidate genes analyzed on

SSC18 included LMBR1, SHH, EN2, HOXA10-13, GLI3, WNT2, and WNT16, in which either

SNPs were genotyped or sequencing was used if SNPs were not otherwise identified. In addition, SNPs from the following genes were used for linkage analysis of the chromosome:

Leptin, GPR37, SPAM1, HYAL4, WASL, LMOD2, ASB15, SLC13A1, AASS, FAM3C,

WNT16, ING3, WNT2, PACAPR (Kollers et al. 2006), MPP6, and IGFBP1 (Mote and

Rothschild 2006) spanning 83 of 91 cM of SSC18.

Linkage Analysis

An extension of the Elston-Stewart algorithm was used in a model-based linkage analysis to map the genomic location most likely to contain the locus causing the polydactyl phenotype. The implementation of the Elston-Stewart algorithm used here is described in

Elston and Stewart (1971) and Fernandez et al. (2001, 2002), although use of linkage mapping has been previously described (Ott 1974). Likelihood ratios were calculated for each marker interval with complete penetrance assuming that the polydactyl mutation is at the center of this interval (L1) or that the polydactyl mutation is at another chromosomal location (L2). The log base 10 of this likelihood ratio (L1/L2) resulted in the logarithm of odds (LODs) score where a LOD score greater than 3 was classified as significant.

Likelihood (L) can be expressed as

α = L Pr(y) ∑G Pr(y | G) * Pr(G),

99

where y is a vector of polydactyl phenotypes and G is a vector of genotypes at the markers

flanking the interval in question. Values equal to or below -3.0 were considered evidence

against linkage. Scores above 3.0 were considered strong evidence for linkage.

Large-Scale SNP Genotyping

DNA samples from 10 affected animals (137-06, 156-08, X156-01, 177-07, 184-01,

191-13, 207-07, 251-01, 251-08, and 251-15) and 12 close relatives (sires: 18-1, 95-04 and his dam 18-5 [sister of 18-1]; dams: 52-11, 102-11; both dams and full siblings: 156-05, 156-

06; full siblings: 137-07, X156-02, 251-03, 251-13, 251-16) were genotyped for 64,232

SNPs using the Illumina (San Diego, CA) 60K porcine SNP chip (Ramos et al. 2009). The same SNP chip was used to genotype 730 other animals from the same closed breeding population. GeneSeek, Inc. (Lincoln, NE) completed the genotyping.

Statistical Analysis of SNP Chip Data

Data from the 730 control animals combined with the 10 affected animals were used to remove from further analysis all SNPs that were not segregating in the population.

Analyses were conducted comparing the 10 affected piglets with the 12 related pigs, then by comparing the 10 affected pigs with 69 of the most genetically similar of the 730 pigs from the general population.

First, regions that were inherited identical-by-descent (IBD) were predicted in the affected animals by searching for the longest stretches of homozygosity based on build 9 of the porcine genome after excluding all fixed markers. Next, the polydactyl pigs were compared to the 12 relatives using the software programs PLINK v1.05 (Purcell et al. 2007) with the DFAM analysis option for family-based association analyses and multifactor

100

dimensionality reduction v1.1.0 (MDR; Moore et al. 2006) to look for single SNP effects and epistatic interactions between SNPs, respectively.

For all homozygous stretches that were thought to be significant based on the above analyses, the significance was assessed by comparing the affected animals with the 69 animals from the general population using PLINK v1.05 (Purcell et al. 2007). In each stretch, 1 SNP from each end and 1 in the middle were selected based on higher minor allele frequencies in the whole population to capture maximal variation. A haplotype association test was used to compare each set of 3 SNPs between the affected animals and unrelated control animals. All regions with any significance were examined for potential causative genes.

RESULTS

Matings and Possible Mode of Inheritance

A complete analysis of all matings (within the polydactyl family, matings of boar 95-

04 to distantly related or unrelated sows, and matings to commercial boars) and the resulting

total number of litters farrowed, total pigs born, and the number of affected animals born can

be seen in Table 1. Due to many complications, including fertility problems, neither of the 2

matings of affected pigs to affected pigs produced any offspring. No further attempts were

made. Six matings producing 9 separate litters (3 repeat matings) occurred between pigs

within our population that had at some time produced polydactyl offspring (Table 1). This

included 2 boars and 5 sows. Together, these matings produced 65 live offspring and 28

stillborn offspring that could be evaluated for phenotype. Of these 93 offspring, 1 had 2

extra feet, 1 had an extra foot but was missing a dewclaw, 4 had a single extra foot, 2 had 2

extra dewclaws each, and 4 had a single extra dewclaw for a total of 12 polydactyl offspring.

101

These matings also appear to have produced a significantly higher proportion of dead

offspring (at least 30 offspring/9 litters) and mummies (at least 4 mummies/9 litters) than

expected in this herd; the remainder of this population of Yorkshire pigs showed an average

of 5% of piglets born dean and 2% mummified.

Additional matings outside of our affected population were used to assess whether the

trait could be dominant. The sire 95-04 was mated to 4 Duroc (unrelated) females producing

50 piglets and 8 additional Yorkshire (unrelated) females that produced 79 piglets with none being affected (Table 1). The mating of an unrelated Duroc boar to 156-05 (Yorkshire putative carrier) produced 14 unaffected piglets, even though she had previously produced 5 affected piglets.

Pedigree analysis of the initial 2 animals exhibiting the polydactyl phenotype showed that a common ancestor was found on both sides of each animal’s pedigree within 8-9 generations (data not shown). To compare the genetic basis of polydactyly in the ISU population with commercial swine, matings were made between boars from SGI (Cambridge,

IA) and our dams. Pedigree analysis of both SGI boars showed that they also had the same common relative in their pedigree as the ISU animals. SGI_1 was mated to the female 137-

09 who had already produced an affected pig when mated to the 95-04 boar. The resulting

litter produced 9 live piglets at birth with 2 individuals that possessed an extra dewclaw. One

of the affected piglets had other birth defects such that he was unable to stand and had to be

euthanized. The boar SGI_2 was mated to the female 156-06; the litter included 6 piglets

born with no affected piglets.

102

SNP Segregation Analysis

After examining the WNT16 alleles the affected offspring inherited from each parent,

it became clear that polydactyl offspring could inherit either allele from 95-04 (Table 2).

Linkage Analyses

Table 3 shows the genes analyzed for linkage analyses. No marker interval showed a

significant association with the polydactyl phenotype in question with LOD scores between

0.15 and -69.85 (Table 3). Based on the linkage analysis results, the causative mutation can

be rejected as being located between Leptin and WNT2 (LOD scores range from -3.08 to -

69.85 throughout this stretch). Though other regions of SSC18, like the stretch from MPP6 to IGFBP1 (LOD score = 0.15), did not show strong evidence of association with the polydactyl phenotype, they should not be excluded from further analysis. This also left the region around SHH unable to be excluded (LOD score = -1.01).

60K SNP Chip Analysis

Comparisons of SNPs among affected and nonaffected pigs from the family did not identify SNPs that achieved significance using the max (T) empirical p-value (EMP2) option of PLINK but showed several SNPs on chromosome 7 with low empirical p-value (EMP1) values. MDR did not identify any significant epistatic interactions.

Based on the search for long IBD stretches in all affected animals, one especially long stretch (251 SNPs, 25 Mb) was identified on chromosome 8 (positions 36,104,826 -

61,205,897) that was homozygous among all 10 affected animals. However, this region was also homozygous among 11 of the 12 unaffected pigs. Only 9% of the general population (N

= 730 animals genotyped) shared the genotype of the affected animals throughout this

103

stretch. The next longest stretch of homozygosity in the affected animals was 34 SNPs long

on chromosome 15 (1.0 Mb from 24,078,571 to 25,041,363 bp).

When PLINK was used to compare the affected animals with the group of 69 pigs

outside this family within regions shown by other methods to be potentially significant,

haplotype analyses gave the following P values for the most significant haplotype in each

region: SSC7 119,765,662 - 119,940,243 bp = 5.46 x 10-14; SSC8 36,104,826 - 61,205,897 bp = 1.17 x 10-8; and SSC18 553,289 – 2,383,572 bp = 5.78 x 10-9.

DISCUSSION

Polydactyl pigs were rare in the general swine population but were commonly detected when animals were mated specifically to produce polydactyl offspring. In total, there were 14 affected pigs (half were born dead) of 110 pigs that were born in this project

using pigs (carriers) originating from the founder animals of this population (Figure 3).

However, when a boar that had produced polydactyl offspring was mated to distantly related

and unrelated sows, no affected offspring were seen among 129 piglets of 12 sows.

Together, these data suggest a recessive mode of inheritance for the trait.

At the same time, the number of affected offspring was less than expected for a

simple recessive mode of inheritance. Based on Mendelian expectations, we should have

seen approximately 28 polydactyl piglets (chi-squared = 6.6, P = 0.01). Therefore, this is

unlikely to be a simple recessive mode of inheritance. Furthermore, an interesting note was

that of the 155 pigs born in this project from parents (carriers or offspring of carriers), there

were at least 40 pigs (26%) born dead and at least 13 mummified fetuses (8%), much more

than normally expected (5% stillborn, 2% mummy) in this population. These data suggest

104

some type of possible lethal expression also but was not different statistically from a recessive mode of inheritance for lethality.

Personal reports from other breeders that had sows with extra dewclaws, as well as the report from Hughes (1938), also suggest that there is not full penetrance with this trait as the affected by affected matings produced both affected and unaffected pigs.

Another possible explanation of the inheritance pattern is that the phenotype is controlled by 2 recessive loci, which could explain the homozygous stretch on chromosome 8 being present in all affected animals and their family members. This region was less common among the general population (9% homozygotes). If this locus is interacting with another significant locus, it could have produced the observed ratio of phenotypes. Additionally, a causative mutation within this region may exist that was not genotyped but could be heterozygous in some of the unaffected relatives, whereas being homozygous in the affected animals.

The production of affected offspring by mating ISU sows to SGI boars clearly suggested that this phenotype, in some form, also existed outside the breeding population at

ISU with a similar genetic basis. This result indicates that the significant regions found in this population could be further studied in commercial populations to potentially identify a causative mutation.

Combined, these results suggest that the polydactyl phenotype in this population is not due to a single gene dominant mode of inheritance and is suggested to be recessive in nature, but without full penetrance. Based on the assumption of recessive inheritance, the candidate genes WNT16 and WNT2 can be rejected as containing causative mutations based on the inheritance patterns observed. The lack of mutations in EN2, HOXA10-13, and GLI3

105

also suggests that they are not directly involved in causing porcine polydactyly in this

population. The interval linkage analysis also rules out a causative mutation between Leptin

and WNT2 and between PACAPR and MPP6 on chromosome 18.

FUNDING

Iowa Agriculture and Home Economics Experiment Station; State of Iowa and Hatch

funding; US Department of Agriculture-Cooperative State Research, Education, and

Extension Service National Needs Fellowship to B.M. and (2007-38420-17767 to D.G.).

ACKNOWLEDGEMENTS

Data collection and assistance provided by individuals from the farm crew at the

Lauren Christian Swine Research Center, PIC USA, Dr. James Koltes, Mr. Dusty Loy, Ms.

Penny Fang, Mr. Dan Mouw, and Dr. Suneel Onteru and other members of the Rothschild

Laboratory are greatly appreciated. Contribution of semen from SGI, Cambridge, IA is appreciated. We would like to thank the 2 reviewers and especially editorial board member,

Dr. E. Bailey, for their constructive feedback on this article.

REFERENCES

Curson H, 1931. Polydactyly in a Springbok and a pig. Anatomical Studies. 19:853.

Elston R, Stewart J, 1971. A general model for the genetic analysis of pedigree data. Hum

Hered. 21(6):523-542.

Fernandez S, Fernando R, Guldbrandtsen B, Totir L, Carriquiry A, 2001. Sampling

genotypes in large pedigrees with loops. Genet Sel Evol. 33(4):337-367.

Fernandez S, Fernando R, Guldbrandtsen B, Stricker C, Schelling M, Carriquiry A, 2002.

Irreducibility and efficiency of ESIP to sample marker genotypes in large pedigrees

with loops. Genet Sel Evol. 34(5):537-555.

106

Firulli B, Redick B, Conway S, Firulli A, 2007. Mutations within helix I of Twist1 result in

distinct limb defects and variation of DNA binding affinities. J Biol Chem.

282(37):27536-27546.

Fujioka H, Ariga T, Horiuchi K, Otsu M, Igawa H, Kawashima K, Yamamoto Y, Sugihara

T, Sakiyama Y, 2005. Molecular analysis of non-syndromic preaxial polydactyly:

preaxial polydactyly type-IV and preaxial polydactyly type-I. Clin Genet. 67(5):429–

433.

Gaedtke H, 1959. Hereditäre Polydaktylie einer Schweinefamilie. Monatsh Veterinärmed.

14:57-58.

Hill R, 2007. How to make a zone of polarizing activity: Insights into limb development via

the abnormality of preaxial polydactyly. Dev Growth Differ. 49:439-448.

Huang Y, Deng X, Du Z, Qiu X, Du X, Chen W, Morisson M, Leroux S, Ponce De Leon F,

Da Y, et al., 2006. Single nucleotide polymorphisms in the chicken Lmbr1 gene are

associated with chicken polydactyly. Gene. 374:10-18.

Hughes E, 1938. Polydactyly in swine. J Hered. 26:415-418.

Kollers K, Mote B, Rothschild M, Plastow G, Rocha D, 2006. Single nucleotide

polymorphism identification, linkage and radiation hybrid mapping of the porcine

pituitary adenylate cyclase-activating polypeptide type I receptor gene to

chromosome 18. J Anim Breed Genet. 123(6):414-418.

Lawrence P, Casal J, Struhl G, 1999. hedgehog and engrailed: pattern formation and polarity

in the Drosophila abdomen. Development. 126:2431-2439.

Moore JH, Gilbert JC, Tsai C-T, Chiang F-T, Holden T, Barney N, White BC, 2006. A

flexible computational framework for detecting, characterizing, and interpreting

107

statistical patterns of epistasis in genetic studies of human disease susceptibility. J

Theor Biol. 241:252-261.

Mote B, Rothschild M, 2006. SNP detection and linkage mapping of pig genes involved in

growth. Anim Genet. 37(3):295-296.

National Swine Registry. 2007. Yorkshire Breed Markings and Registration Requirements.

[cited 2007 December 4]. Available from:

http://www.nationalswine.com/industryreference/indrefswinebrYork.html.

Ott J, 1974. Estimation of the recombination fraction in human pedigrees: efficient

computation of the likelihood for human linkage studies. Am J Hum Genet.

26(5):588-597.

Ptak W, 1963. Polydactyly in wild boar. Acta Theriol. 7:312-314.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P,

de Bakker PIW, Daly MJ, et al., 2007. PLINK: a toolset for whole-genome

association and population-based linkage analysis. Am J Hum Genet. 81(3):559-

575.

Radhakrishna U, Bornholdt D, Scott HS, Patel UC, Rossier C, Engel H, Bottani A, Chandal

D, Blouin JL, Solanki JV, et al., 1999. The phenotypic spectrum of GLI3

morphopathies includes autosomal dominant preaxial polydactyly type-IV and

postaxial polydactyly type-A/B; No phenotype prediction from the position of GLI3

mutations. Am J Hum Genet. 65(3):645-655.

Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE,

Bendixen C, Churcher C, Clark R, Dehais P, et al., 2009. Design of a High Density

SNP Genotyping Assay in the Pig Using SNPs Identified and Characterized by Next

108

Generation Sequencing Technology. PLoS ONE 4(8): e6524.

Sagai T, Masuya H, Tamura M, Shimizu K, Yada Y, Wakana S, Gondo Y, Noda T, Shiroishi

T, 2004. Phylogenetic conservation of a limb-specific, cis-acting regulator of Sonic

hedgehog (SHH). Mamm Genome. 15:23-34.

Sheth R, Bastida MF, Ros M, 2007. Hoxd and Gli3 interactions modulate digit number in the

amniote limb. Dev Biol. 310:430–441.

Tarchini B, Duboule D, Kmita M, 2006. Regulatory constraints in the evolution of the

tetrapod limb anterior–posterior polarity. Nature. 443:985-988.

Wang Z, Tian S, Shi Y, Zhou P, Wang Z, Shu R, Hu L, Kong X, 2007. A single C to T

transition in intron 5 of LMBR1 gene is associated with triphalangeal thumb-

polysyndactyly syndrome in a Chinese family. Biochem Bioph Res Co. 355(2):312-

317.

Yang Y, Guillot P, Boyd Y, Lyon M, McMahon A, 1998. Evidence that preaxial polydactyly

in the Doublefoot mutant is due to ectopic Indian Hedgehog signaling. Development.

125:3123-3132.

Zeller R, Zuniga A, 2007. Shh and Gremlin1 chromosomal landscapes in development and

disease. Curr Opin Genet Dev. 17:428–434.

109

A B

C

Figure 1. Original pigs affected with preaxial polydactyly in a breeding population of Yorkshire swine. (A) Yorkshire male (ID 137-06) expressing a preaxial polydactyl phenotype on both front feet. (B) Radiograph showing both front feet of the Yorkshire male (137-06) with preaxial polydactyly. (C) Yorkshire female (ID 156-08) expressing a preaxial polydactyl phenotype on one (left) front foot.

110

Figure 2. Variation of the preaxial polydactyl phenotype seen in Yorkshire pigs. Both legs had what appeared to be an “extra foot” but one (left) is missing a dewclaw.

111

102-11 95-04 52-11 18-01

156-03 156-05 192-07 8 0 8 1 156-06 207-03 5 0 3 2 137-09 7 1 3 0

6 0 5 0 4 1 4 1 1 1 8 4 5 0 5 0 6 1 5 0

Figure 3. Pedigree of Yorkshire animals where polydactyl animals existed. Circles represent females. Squares represent males. Shaded figures represent polydactyl animals. Large circles and squares are parents, whereas smaller circles and squares represent the number of offspring from the mating. Mummies are not included in counts, though pigs born dead are.

112

102-11 95-04 52-11 18-01

Duroc 156-05 SGI_2 156-06 SGI_1 137-09

6 0 8 0 5 0 1 0 2 2 5 0

Figure 4. Pedigree structure of Yorkshire population where polydactyl pigs appeared when females were mated to males outside the breeding population to test inheritance. Circles represent females. Squares represent males. Shaded figures represent polydactyl animals. Large circles and squares are parents, whereas smaller circles and squares represent the number of offspring from the mating.

113

Table 1. Phenotypic results of all matings to produce polydactyl offspring

Ever Ever produced produced No. affected affected born No. born No. No. Sirea offspring? Dam offspring? Litter alive dead affectedb mummies 95-04 Yes 102-11 Yes 156 9 Unknown 1 Unknown 18-01 Yes 52-11 Yes 137 12 Unknown 1 Unknown 95-04 Yes 102-11 Yes 192 10 2c 0 2 95-04 Yes 52-11 Yes 207 3 8 2 Unknown 95-04 Yes 156-05 Yes 191 6 7 2 Unknown 95-04 Yes 156-06 Yes x156 0 2 1 0 95-04 Yes 156-06 Yes 177 5 3 1 2 95-04 Yes 137-09 Yes 184 12 0 1 0 95-04 Yes 156-05 Yes 251 8 8 3 0 95-04 Yes 207-03 No 275 5 5c 0 2 156-03 No 192-07 No 277 6 5 0 0 95-04 Yes 156-07 No 291 7 0 0 7 95-04 Yes 192-05 No 294 4 0 0 0 SGI_1 Yes 137-09 Yes 9 Unknown 2 Unknown SGI_2 Yes 156-06 Yes 6 Unknown 0 Unknown 95-04 Yes 4 Durocs No 50 Unknown 0 Unknown 95-04 Yes 8 Yorkshires No 79 Unknown 0 Unknown Duroc No 156-05 Yes 14 Unknown 0 Unknown Total 245 40 14 13 a No carrier by affected matings produced a litter. b Affected animals have extra dewclaws or extra feet. c Indicates these dead individuals were never examined for phenotype.

114

Table 2. Evidence against WNT16 from segregation data from matings of pigs producing affected offspring

Affected Sire's SNP Dam's SNP offsprings' Unaffected offsprings' Gene Sire genotype Dam genotype genotypes genotypes

WNT16 95-04 A/B 156-05 A/A A/A (1), A/B (3) A/A (10), A/B (9) WNT16 95-04 A/B 156-06 B/B A/B (1) A/B (2), B/B (6)

WNT16-SNP2 95-04 A/B 156-05 B/B A/B (3), B/B (1) A/B (10), B/B (7) WNT16-SNP2 95-04 A/B 156-06 A/A A/B (1) A/A (6), A/B (2)

115

Table 3. LOD scores for intervals on chromosome 18

Physical Physical LOD location location Starting Ending score for of start of end gene gene intervala gene gene (Mb)b (Mb)b LMBR1 LEPTIN -1.01 0.96 18.35c LEPTIN GPR37 -3.08 18.35 21.14 GPR37 SPAM1 -44.79 21.14 21.75 SPAM1 HYAL4 -29.86 21.75 21.78 HYAL4 WASL -31.93 21.78 21.91 WASL LMOD2 -42.98 21.91 21.96 LMOD2 ASB15 -53.31 21.96 21.99 ASB15 SLC13A1 -46.56 21.99 22.34 SLC13A1 AASS -11.97 22.34 23.17 AASS FAM3C -51.14 23.17 23.75 FAM3C WNT16 -69.85 23.75 23.79 WNT16 ING3 -36.79 23.79 24.12 ING3 WNT2 -6.21 24.12 27.50 WNT2 PACAPR -0.87 27.50 40.41 PACAPR MPP6 -4.65 40.41 46.58d MPP6 IGFBP1 0.15 46.58 48.65e a Based on recessive trait with full penetrance. b Midpoint of each gene in megabases taken from ENSEMBL Pig Build 9. c SHH is located in this interval at 1.74 Mb and EN2 at 2.01 Mb. d HOXA10-13 is located in this interval at 44.18 Mb. e GLI3 is located after this interval at 51.45 Mb.

116

CHAPTER 6. WHOLE-GENOME ASSOCIATION ANALYSES FOR RESIDUAL FEED INTAKE AND RELATED TRAITS IN SWINE

A paper prepared for submission to PLoS ONE.

D.M. Gorbach, W. Cai, J.M. Young, J.C.M. Dekkers, D.J. Garrick, R.L. Fernando, and M.F. Rothschild

ABSTRACT

As grain prices have increased, pig producers have become more concerned with their

animals’ feed efficiency. Residual feed intake (RFI) is a measure of feed efficiency that

quantifies how much feed an animal consumes compared to how much would be expected

based on maintenance and growth requirements. Iowa State University (ISU) has developed

two lines of Yorkshire swine. The first line was selected for decreased RFI while the second

line was randomly selected for five generations and then selected for increased RFI in the

sixth generation. Growing animals were measured for average daily feed intake (ADFI),

backfat (BF), average daily gain (ADG), and loin muscle area (LMA). A genome-wide association study was completed using 730 pigs genotyped with the PorcineSNP60

BeadChip. Genetic effects for each trait were fitted using the GenSel software with a

Bayesian model averaging approach (Bayes-C). Many sets of neighboring five-SNP windows were fitted more often than expected by chance, and haplotype analyses using ASReml software confirmed many of the significant window effects. Based on build 10 of the porcine genome, several genes involved in gene expression regulation, including MKNK1 and

MKNK2, were localized to significant regions for RFI. Multiple other genes such as HTR2A and DOLPP1 were also within significant RFI regions. For the other traits, genes previously known to be involved, such as MC4R, were confirmed to be significant, and novel genes such

117

as NUAK1 for ADFI and PHLPP1 for ADG were implicated. Overall, these genes could be

important for improving livestock selection or developing drug targets for enhanced feed

efficiency, although further research is necessary.

INTRODUCTION

Feed efficiency (FE) has long been a concern of livestock producers, but it has come to the forefront of discussions as feed prices have increased in recent years, driving down profits. Over 60% of the variation in feed intake in growing pigs is explained by differences in growth and maintenance [1], but significant differences in efficiency of feed utilization

may not be optimally captured by the ratio of growth to feed intake alone. To address this

issue, residual feed intake (RFI) was first described as a linear measure of FE in 1963 [2].

RFI is defined as the difference between the amount of feed an animal consumes and the

amount the animal is expected to need for its growth and maintenance based on average

requirements. Hence, a lower or more negative RFI value indicates greater FE.

RFI has been studied in many species and shown to have many favorable properties.

It is moderately heritable (h2 estimates in pigs range from 0.14-0.40; [1,3-8]) and, as a

residual after adjusting feed intake for expected average daily gain (ADG) and backfat (BF),

has been designed to be phenotypically uncorrelated with ADG and BF in pigs. As a result,

selection for RFI has been shown to significantly improve FE without having large

undesirable impacts on growth rate and BF [7].

Over the last decade, Iowa State University (ISU) has developed two selection lines

for RFI [1,9]. Beginning with a mixed population of purebred Yorkshire pigs from Iowa

breeders, two lines were derived by randomly dividing up full siblings in generation zero. For

the first five generations, one line (select) was selected for decreased RFI and the other line

118

(control) was randomly selected. In subsequent generations, the control line was converted to

a divergent selection line. After six generations of selection, the following selection

responses were observed in the select line: RFI decreased by 1.8 genetic standard deviations

(sd), average daily feed intake (ADFI) decreased by 1.5 genetic sd, ADG decreased by 0.6

genetic sd, and BF decreased by 0.7 genetic sd [Cai, unpublished results]. These responses indicate that RFI is not genetically uncorrelated with ADG and BF, despite the lack of phenotypic correlations.

Four physiological mechanisms were reported to each account for more than 10% of

RFI in beef cattle: differences in digestion; activity level; protein turnover, tissue metabolism and stress; and variation in unmeasured processes such as ion transport [10]. A whole-

genome association study (WGAS) for RFI in cattle implicated several genes with functions

attributable to the above categories [11]. For example, UBE2I, which affects protein

turnover, and ATP1A1, which transports ions, were shown to be associated with RFI [11].

With the availability of whole-genome sequence and development of panels capable

of genotyping thousands of SNPs at a time, WGAS have become a relatively common

approach to assess the genetic bases of various phenotypes. Many researchers utilize

frequentist statistical approaches to analyze SNP-trait associations one SNP at a time and

adjust for multiple comparisons to determine the most significant results. More recently,

Bayesian approaches that fit all SNPs simultaneously have been developed for genomic

selection [12]. These methods account for linkage disequilibrium (LD) between SNPs when

determining the effects of individual SNPs. This latter approach has been used successfully

for WGAS in livestock [13,14].

119

In the present study, the PorcineSNP60 BeadChip [15] was used to perform WGAS

for RFI, ADFI, ADG, BF, and LMA using data from the ISU RFI lines. Bayesian statistical models were used initially, and results were compared to those discovered with haplotype approaches. Differences in allele frequencies between RFI lines were also evaluated.

MATERIALS AND METHODS

Study Population and Phenotypes

A total of 730 pigs from a population maintained at ISU’s Lauren Christian Swine

Research Farm in Madrid, IA were used in this study. Animal care guidelines were followed according to the Institutional Animal Care and Use Committee (IACUC) at ISU. Pigs were from the following generations of ISU’s selection experiment for residual feed intake [1]:

generations zero (69 boars from the select line that were full-siblings to the control line

animals), four (89 select line gilts, 74 control line gilts), five (62 boars and 81 gilts from the

select line, 85 boars and 90 gilts from the control line), and six (95 select line boars, 85

control line boars).

Each animal was put on-test at approximately 90 days of age with daily feed intake

collected using electronic FIRE Feeders (Osborne, KS, USA). Pens of 16 animals, each containing both select and control line pigs, were rotated between two weeks of feed intake collection and two weeks of ad libitum feeding on conventional feeders until they reached approximately 210 days of age. Weight was measured every two weeks during the test period. Upon reaching approximately 115 kg, 10th-rib BF and LMA were evaluated with an

Aloka ultrasound machine (Corometrics Medical Systems, Inc., Wallingford, CT, USA).

Refer to [1] and [9] for additional details. Of the 730 animals genotyped, 716, 723, 724, 718

and 718 pigs had phenotypes for RFI, ADFI, ADG, BF, and LMA data, respectively.

120

ADG was calculated as the difference between off-test and on-test weights divided by

the number of days on-test. Calculations for ADFI took into account the data cleaning

methods developed by [16] to correct and fill in missing data. Phenotype for ADFI for each

pig was then estimated using a quadratic random regression model fitted to observed daily

feed intake data from on-test day to off-test day [17]. Finally, phenotypes for RFI were

computed similar to [1] by using a single trait animal model to adjust ADFI for ADG and BF.

The following equation was used to derive the RFI for an individual pig based on its

phenotype for ADFI:

RFI=ADFI-(b1i*onwtdev+b2i*offwtdev+b3i*metamidwt+b4i*adga+b5i*offbfa)

where i represents each combination of generation and line, onwtdev and offwtdev are the

difference between the pig’s weight at on-test and 40 kg and its weight at off-test and 115 kg, respectively, metamidwt is the average of the pig’s weight0.75 between on-test age and off-

test age, adga is the pig’s ADG adjusted to testing from 90-210 days of age, and offbfa is the

pig’s off-test BF adjusted to 115 kg of body weight. Regression coefficients (b) were

estimated using an animal model that fitted data from all generations and parities and

included the random effects of animal, dam, and pen within on-test group and the fixed

effects of line, on-test group, sex, and interactions of generation by line with each of the

following: onwtdev, offwtdev, metamidwt, adga, offbfa, and on-test age minus 90 (as

described in [17]).

Genotyping

DNA was extracted from tail tissue from each animal using the Qiagen (Valencia,

CA, USA) 96 DNeasy blood & tissue kit. DNA concentration was quantified with a

NanoDrop 1000 (Thermo Fisher Scientific Inc., Wilmington, DE, USA). All DNA samples

121

selected for genotyping had an A260/280 ratio between 1.7 and 2.0 and a DNA concentration

greater than 50 ng/µl. Genotyping for 64,232 SNPs was completed with the Illumina (San

Diego, CA, USA) PorcineSNP60 BeadChip by GeneSeek, Inc. (Lincoln, NE, USA). Only

genotypes on SNPs that met both of the following criteria were analyzed: 1) minor allele

frequency (MAF) greater than 0.0001 and 2) more than 80% of the population genotyped

successfully. A total of 55,533 SNPs satisfied these criteria.

Single SNP Analyses

GenSel software (http://bigs.ansci.iastate.edu; [18]) was used to estimate the effect of

each SNP using the Bayes C model averaging approach, similar to the method described in

[19]. The following regression equation was implemented in the analyses:

= + + where y is a vector of phenotypes, X is푦 an incidence푿훽 풁푢 matrix푒 for fixed effects, β is a vector of fixed effects, Z is a matrix of SNP genotypes, u is a vector of random SNP allele substitution effects, and e is a vector of random residual effects. Fixed effects for RFI, ADFI, and ADG were line, sex, and pen within group, and on-test age nested within generation, parity, and line was fitted as a covariate. For BF and LMA, on-test age was replaced with off-test weight. Elements of vector u were estimated assuming the SNP effects originated from a mixture distribution with 99.5% of the SNPs having zero effect and 0.5% of the SNP effects

2 coming from a N(0, σ u) distribution. SNP effects were sampled repeatedly during 50,000 iterations of a Markov Chain fitting all informative SNPs simultaneously. Predicted SNP effects were based on posterior distributions from the last 40,000 iterations.

122

Five-SNP Window Analyses

Causal mutations are unlikely to be present on the SNP chip, so linkage disequilibrium between neighboring SNPs was used to capture the effects of causal mutations. SNPs were ordered based on genomic position in build 10 of the porcine genome.

Contributions of windows of five consecutive SNPs to genomic estimated breeding values

(wGEBVs) were computed for each window across the genome based on the variance of the wGEBV among animals divided by the variance in genomic estimated breeding values

(GEBVs) based on all SNPs. The estimates of wGEBV were computed as the sum across the five SNPs of the product of the individual’s genotype and the estimate of the SNP effect. R software (http://www.r-project.org) was used to plot the proportion of variance that window explained against window position. Bootstrap analyses were carried out on the top 150 SNP windows for each trait as in [14]. Only SNP regions with p < 0.01 were considered for further gene and pathway analyses. No adjustments were made for multiple testing.

Haplotype Analyses

SNPs from the most significant 150 SNP windows for each trait were clustered into haplotypes using the Gabriel blocks formation method [20] implemented in Haploview software version 4.1 [21]. SNP genotypes in the 150 SNP windows were phased for each animal using the FastPHASE version 1.2.3 software [22]. ASReml 3 software [23] was used to estimate haplotype effects using the same statistical models as the single SNP analyses, except a random animal effect based on pedigree information was also fitted. The number of copies of an individual haplotype was fitted as a fixed covariate instead of the simultaneous fitting of random SNP effects.

123

Allele Frequency Differences Between Lines

Base population allele frequencies were estimated based on the 69 animals from

generation zero of the select line. Allele frequencies were assumed to be the same in

generation zero of the control line due to full siblings being used to found the two lines. To

account for allele frequency differences resulting from drift or sampling errors within a

generation and line, the allele frequencies for each SNP calculated from generations four and

five of the control line were averaged across generations and compared to the base

population allele frequencies. Differences between allele frequencies in generations four to

six of the select line and base population allele frequencies were then compared to the

expectations from drift and sampling errors alone, both on a SNP-by-SNP basis and by

computing a test statistic. The test statistic was the allele frequency difference between the

select animals across generations four through six and generation zero for a SNP divided by

the standard deviation of the difference between the control animals and generation zero

across all SNPs. This test statistic follows approximately a standard normal distribution. R

software (www.r-project.org) was used to determine the p-value cut-off for a 20% false discovery rate (FDR) based on the method of Benjamini & Hochberg [24]. Based on this computation, any values with p < 0.00029 were considered significant.

Genes and Gene Pathways

Within the regions with significant SNP window p-values, significant haplotypes,

and/or significant allele frequency differences between lines, genes were identified based on

NCBI blastn searches of the human genome and transcripts with query sequences from build

10 of the Sus scrofa genome. Functional annotation of genes was obtained based on literature

124

searches. DAVID software (http://david.abcc.ncicrf.gov) was used to cluster genes within

significant regions by function.

RESULTS

Data Evaluation

Phenotypic data were close to normally distributed for all traits. Phenotypic means

(±SE) were as follows: 1.889±0.008 kg/day for ADFI, 0.678±0.004 kg/day for ADG,

16.67±0.14 mm for BF, 0.18±0.03 kg/day for RFI, and 42.14±0.18 cm2 for LMA.

Of the 64,232 SNPs, 4232 SNPs with MAF less than 0.0001were removed from

analyses, and 4467 SNPs did not meet the criterion of being genotyped successfully in greater than 80% of the animals, leaving 55,533 SNPs for analysis. For the SNPs included in analyses, genotyping call rates for individual animals ranged from 97.7% to 99.9%.

Single SNP Analyses

The unmapped SNP, ASGA0096364, had the largest estimated effects on both BF

and LMA. The allele that produced an increase in BF was estimated to reduce LMA. The

SNP window results depicted in Figure 1 exclude this SNP for BF and LMA.

Haplotype Analyses

Over 80% of the haplotype blocks created from the most significant SNP windows

for each trait (Table 1) were found to contain at least one haplotype that was significant at the

p < 0.05 level (Table 2). Some overlap existed between the positions of the most significant

haplotypes (p < 0.001) for different traits. RFI and ADFI shared significant regions near 68-

70 Mb (build 10 genomic position) on SSC3 by the LRRTM4 gene, at 120 Mb on SSC3 by the KLHL29 gene, at 125 Mb on SSC3 by the TTC32 gene, and at 60 Mb on SSC14 by the

ERO1LB gene. The same region on SSC14 was also significant for ADG. ADFI and ADG

125

also shared significant haplotype regions near MC4R at 166-168 Mb on SSC1 and at 117-120

Mb on SSC7 near an uncharacterized gene, F1SDY6_PIG. No overlap existed (< 5 Mb) between the most significant haplotype regions for BF or LMA and any of the other traits.

Allele Frequency Differences Between Lines

Using a p-value cut-off of 0.00029 to establish significance, 58 SNPs were considered

to have allele frequency changes beyond those expected due to drift and sampling errors

alone between generations four to six of the select line and the founding population. Of these

SNPs, 19 were positioned within a 3.5 Mb span on SSC9 [75.0Mb-78.5Mb genomic position

in build 10], which included the STEAP4, YBX1, ADAM22, CDK14, and HSPD1 genes.

Another eight SNPs were located on SSC1 [171.3Mb-175.8Mb] surrounding the FEM1B,

ANP32A, MAP2K1, MAP2K5, SMAD3, and SMAD6 genes, and seven SNPs were located on

SSC12 [0.07Mb-0.5Mb] around the FOXK2, FN3K, and RAB40B genes. DAVID software

analysis showed an enrichment of genes in these regions which regulate endopeptidase

activity. Within these regions, allele frequency changes ranged from 0.44 to 0.52 while

changes due to drift in the control line ranged from 0.0 to 0.15 for the same SNPs.

DISCUSSION

Long-range LD resulting from selection and inbreeding could create spurious

associations between SNPs and traits when using statistical approaches that only analyze one

SNP at a time, but should be reduced by the use of Bayesian models that account for LD. The

fitting of haplotypes in ASReml software in this study confirmed many of the top SNP

windows were statistically significant using a more traditional approach and helped to

validate the use of a Bayesian approach. This study was meant as a preliminary analysis to

126

identify targets for further study, and as such, multiple testing corrections were not applied to

the results to avoid a high rate of false negative errors.

Of the significant genes for RFI based on the WGAS, PLIN2, PLIN4, PLIN5, and

DOLPP1 are all involved in fat production and maintenance. Protein turnover was found to be important for RFI in beef cattle when RFI was defined as feed intake adjusted for weight gain [10]. It makes sense that the swine definition of RFI, which includes an adjustment for

BF, would also pick up genes for fat maintenance. Selection for lowered RFI reduced BF deposition [25], which supports the idea that genes for fat production and maintenance may be significant for RFI.

DAVID analyses showed post-transcriptional gene expression regulation to be a major contributor to RFI, indicating that downstream changes in gene expression may cause bigger differences in RFI than mutations in those downstream genes. Genes such as PTEN,

SOX17, MKNK1, and MKNK2 were categorized as expression regulators. MKNK1 and

MKNK2 are both thought to play a role in environmental stress response [26], which was another process that [10] implicated in creating RFI differences in cattle.

Other processes that may be involved in altering feed efficiency based on these gene results include metabolism (CRAT), monocarboxylate transport (SLC5A12), mitochondrial

function (ATPAF1), and thermoregulation and appetite control (HTR2A). Obviously, many

pathways are involved in altering animals’ feed efficiency, so follow-up studies are needed to confirm the relative importance of these various processes and confirm which genes in each pathway are the most important.

None of the most significant SNPs or SNP windows from the WGAS for RFI

overlapped with the regions with the largest changes in allele frequencies between lines. Two

127

SNPs with significantly changed allele frequencies were located 0.6 Mb away from the

important region for RFI on SSC4 at 81 Mb. The region on SSC1 that had significant

changes in frequencies was 3.0 Mb away from the MC4R gene and was close to reaching the significance threshold for the WGAS for RFI. This region, however, did overlap with one of the most significant regions for ADFI using both the WGAS and haplotype approaches and was 3.6 Mb away from one of the most significant regions for ADG. LD patterns in this region showed evidence that marker order was inaccurate for this population, making it difficult to draw conclusions based on physical distances (data not shown).

Due to the fact that line is fit as a fixed effect in the WGAS model to account for random drift that is expected between the lines, within line variation contributes significantly to the WGAS results and causes them to differ from the allele frequency change results.

Removing the main effect of line from the model did not significantly alter which genomic regions were the most significant for RFI (data not shown), likely due to the generation*parity*line*on-test age fixed effect picking up most of the line effects.

Unfortunately, RFI was not calculated in such a manner that line could be completely eliminated from the fixed effects in this model to discover the genes responsible for differences between lines.

For the traits other than RFI, some of the genes had previously been shown to be involved, such as MC4R for ADFI and ADG [27], while others were novel results. The

NUAK1 gene, for instance, was located in a significant region for ADFI and is involved in glucose starvation tolerance. Meanwhile, the PHLPP1 gene, which was in a significant region for ADG, encodes a phosphatase that can terminate Akt signaling [28], which in turn can regulate cell survival, energy metabolism, and glucose uptake [29].

128

These gene results hold promise for being able to use marker-assisted selection to

select for animals that utilize feed more efficiently. Selection for RFI has been shown to be

effective [1], [7], but feed intake data is expensive to collect. Use of gene markers for RFI in

selection could be a more effective strategy for reducing feed costs without significantly

impacting gain.

Another potential use of these results is for developing feed additives or other drug treatments to improve feed efficiency. Multiple companies have been developing drugs to target the 5-HT2C receptor, for example, to reduce obesity [30]. This receptor is closely

related to the 5-HT2A serotonin receptor encoded by the HTR2A gene that was found to be

important in RFI in this study. Pharmaceuticals targeting 5-HT2A or other proteins that are found to be important in making pigs utilize feed more efficiently could be a cost-effective method to reduce the amount of feed the world needs to grow for the purpose of livestock production.

In conclusion, many genes were found to play a role in controlling feed efficiency in growing pigs. Targeting genes involved either more broadly in gene expression regulation or more specifically in processes such as appetite control for either marker-assisted selection or drug development could be cost-effective methods for improving feed efficiency. Further studies with additional animals and in different populations are needed to confirm and fine- map the associations reported in this study.

ACKNOWLEDGEMENTS

Funding was provided by the National Pork Board, Iowa Agricultural Home

Economics Experiment Stations, and State of Iowa and Hatch funding. FIRE Feeders were donated by PIC USA. Support for D. M. Gorbach and J. M. Young was provided by USDA

129

National Needs Graduate Fellowship Competitive Grant No. 2007-38420-17767 from the

National Institute of Food and Agriculture.

REFERENCES

1. Cai W, Casey DS, Dekkers JCM (2008) Selection response and genetic parameters

for residual feed intake in Yorkshire swine. J Anim Sci 86: 287-298.

2. Koch RM, Swiger LA, Chambers D, Gregory KE (1963). Efficiency of feed use in

beef cattle. J Anim Sci 22(2): 486-494.

3. Foster WH, Kilpatrick DJ, Heaney IH (1983) Genetic variation in the efficiency of

energy utilization by the fattening pig. Anim Prod 37: 387-393.

4. Mrode RA, Kennedy BW (1993) Genetic variation in measures of food efficiency in

pigs and their genetic relationships with growth rate and backfat. Anim Prod 56: 225-

232.

5. Von Felde A, Roehe R, Looft H, Kalm E (1996) Genetic association between feed

intake and feed intake behavior at different stages of growth of group-housed boars.

Livest Prod Sci 47: 11-12.

6. Johnson ZB, Chewning JJ, Nugent RA (1999) Genetic parameters for production

traits and measures of residual feed intake in large white swine. J Anim Sci 77: 1679-

1685.

7. Gilbert H, Bidanel JP, Gruand J, Caritez JC, Billon Y, et al. (2007) Genetic

parameters for residual feed intake in growing pigs, with emphasis on genetic

relationships with carcass and meat quality traits. J Anim Sci 85: 3182–3188.

130

8. Hoque MA, Kadowaki H, Shibata T, Oikawa T, Suzuki K (2009) Genetic parameters

for measures of residual feed intake and growth traits in seven generations of Duroc

pigs. Livestock Sci 121: 45-49.

9. Young JM, Cai W, Dekkers JCM (2011) Effect of selection for residual feed intake

on feeding behavior and daily feed intake patterns in Yorkshire swine. J Anim Sci 89:

639-647.

10. Richardson EC, Herd RM (2004) Biological basis for variation in residual feed intake

in beef cattle. 2. Synthesis of results following divergent selection. Aust J Exp Agr

44: 431-440.

11. Barendse W, Reverter A, Bunch RJ, Harrison BE, Barris W, et al. (2007) A validated

whole-genome association study of efficient food conversion in cattle. Genetics 176:

1893-1905.

12. Meuwissen TH, Hayes BJ, Goddard ME (2001) Prediction of total genetic value

using genome-wide dense marker maps. Genetics 157(4): 1819-1829.

13. Fan B, Onteru SK, Du Z-Q, Garrick DJ, Stalder KJ, et al. (2011) Genome-wide

association study identifies loci for body composition and structural soundness traits

in pigs. PLoS ONE 6(2): e14726.

14. Onteru SK, Fan B, Nikkilä MT, Garrick DJ, Stalder KJ, et al. (2011) Whole-genome

association analyses for lifetime reproductive traits in the pig. J Anim Sci 89(4): 988-

995.

15. Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, et al. (2009)

Design of a high density SNP genotyping assay in the pig using SNPs identified and

characterized by next generation sequencing technology. PLoS One 4: e6524.

131

16. Casey DS, Stern HS, Dekkers JCM (2005) Identifying errors and factors associated

with errors in data from electronic swine feeders. J Anim Sci 83: 969-982.

17. Cai W, Kaiser MS, Dekkers JCM (2011) Genetic analysis of longitudinal

measurements of performance traits in selection lines for residual feed intake in

Yorkshire swine. J Anim Sci 89: 1270-1280.

18. Fernando RL, Garrick DJ (2008) GenSel – User manual for a portfolio of genomic

selection related analyses. Animal Breeding and Genetics, Iowa State University,

Ames.

19. Kizilkaya K, Fernando RL, Garrick DJ (2010) Genomic prediction of simulated

multibreed and purebred performance using observed fifty thousand single nucleotide

polymorphism genotypes. J Anim Sci 88(2): 544-551.

20. Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, et al. (2002) The structure of

haplotype blocks in the human genome. Science 296: 2225–2229.

21. Barrett JC, Fry C, Maller J, Daly MJ (2005) Haploview: analysis and visualization of

LD and haplotype maps. Bioinformatics 21(2): 263-265.

22. Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale

population genotype data: applications to inferring missing genotypes and haplotypic

phase. Am J Hum Genet 78: 629-644.

23. Gilmour AR, Cullis BR, Thompson R (2008) ASReml Reference Manual. NSW

Department of Primary Industries, Orange, Australia. http://www.genome.iastate.edu/

bioinfo/resources/manuals/ASReml/asreml3.pdf

24. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and

powerful approach to multiple testing. JRSSB 57: 289-300.

132

25. Boddicker N, Gabler NK, Spurlock ME, Nettleton D, Dekkers JCM (2011) Effects of

ad libitum and restricted feed intake on growth performance and body composition of

Yorkshire pigs selected for reduced residual feed intake. J Anim Sci 89(1): 40-51.

26. Kumar S, Boehm J, Lee JC (2003) p38 MAP kinases: key signaling molecules as

therapeutic targets for inflammatory diseases. Nat Rev Drug Discov 2: 717-726.

27. Kim KS, Larsen N, Short T, Plastow G, Rothschild MF (2000) A missense variant of

the porcine melanocortin-4 receptor (MC4R) gene is associated with fatness, growth,

and feed intake traits. Mamm Genome 11(2): 131-135.

28. Gao T, Furnari F, Newton AC (2005) PHLPP: a phosphatase that directly

dephosphorylates Akt, promotes apoptosis, and suppresses tumor growth. Mol Cell

18(1): 13-24.

29. Aviv Y, Kirshenbaum LA (2010) Novel phosphatase PHLPP-1 regulates

mitochondrial Akt activity and cardiac cell survival. Circ Res 107: 448-450.

30. Dutton AC, Barnes NM (2006) Anti-obesity pharmacotherapy: future perspectives

utilizing 5-HT2C receptor agonists. Drug Discov Today: Therapeutic Strategies 3(4):

577-583.

133

Table 1. Significant (p < 0.01) SNP window results for feed intake and growth traits based on the Bayesian analysis Trait Number of Chromosomes Genes of potential interest significant regions RFI 24 1, 2, 3, 4, 6, 8, PTEN, PLIN2, PLIN4, PLIN5, HTR2A, ANO3, 10, 11, 13, 14, SLC5A12, ERO1LB, MKNK1, MKNK2, SOX17, 15, unknown INPP4B, RGS18, ARID4B, DOLPP1, CRAT, ATPAF1 ADFI 22 1, 3, 4, 5, 7, 9, NUAK, PIAS1, MC4R 10, 13, 14, 16 ADG 16 1, 5, 6, 7, 8, 12, PHLPP1, DOCK10, MC4R 14, 15, 17 BF 19 5, 6, 8, 9, 11, WWOX, ATF3, PPP2R5A, PIK3R5, BRWD1, 12, 13, 15, 17, TM2D2, ADAM9, NDP X, unknown LMA 18 1, 2, 4, 5, 6, 14, MED13L, CDC25A, HESX1 15, 16, 18, unknown

134

Table 2. Significant haplotype blocks created from the top 150 SNP windows based on the Bayesian analyses for each trait Trait Number of Number Number Number of Most significant haplotype significant significant SNPs per chromosomes (p<0.001) blocks at p<0.05* at p<0.001* haplotype RFI 53 47 28 2 to 12 1,2,3,4,6,10,11,14, unknown ADFI 44 37 16 2 to 27 1,3,6,7,9,11,13,14,16 ADG 34 34 15 2 to 39 1,6,7,12,14,15 BF 42 25 9 2 to 24 11,12,13,15,X LMA 48 38 16 2 to 11 1,2,13,14,15,16,unknown *indicates at least one haplotype in the block is significant at the stated p-value threshold

135

Figure 1: Proportion of variance in total GEBV explained by each window of five consecutive markers in the genome for RFI (a), ADFI (b), ADG (c), BF (d), and LMA (e). In order to show results for mapped SNPs in more detail, figures d and e exclude the five windows of largest effect that contain unmapped SNPs. Markers are positioned based on order in the genome with monochromatic blocks representing chromosomes. SSC1 is on the left, SSC18 in gray on the right, followed by SSCX in black and unmapped markers in gray. Note: y-axes differ in scale.

136

Figure 1. (continued)

137

CHAPTER 7. PREDICTING PLEIOTROPY FOR FEED INTAKE AND GROWTH TRAITS IN SWINE USING RELATIONSHIPS BETWEEN SNP WINDOWS’ GENOMIC ESTIMATED BREEDING VALUES

A paper prepared for submission to Genetics.

D.M. Gorbach, W. Cai, J.C.M. Dekkers, D.J. Garrick, J.M. Young, R.L. Fernando, and M.F. Rothschild

ABSTRACT

Genetic correlations between traits have been well studied for years, but the use of genome-wide scans for identifying cryptic genes and genomic regions that differ from or break the overall correlation between traits are lacking. This study focused on identifying genes or genomic regions underlying such cryptic correlations (i.e. pleiotropic effects that go against the overall genetic correlation) between residual feed intake (RFI), average daily feed intake (ADFI), average daily gain (ADG), and backfat in Sus scrofa. Following estimation of single nucleotide polymorphism (SNP) effects for over 55,000 SNPs using a Bayesian approach, genomic estimated breeding values (wGEBVs) were calculated for five-SNP windows across the genome for each of 716 pigs. For each SNP window, correlations and covariances were computed between wGEBVs for each pair of traits. Most covariances were slightly greater than zero for all trait combinations. The NTN1 gene, an apoptosis regulator, was in a region with strong negative covariances between backfat and RFI and between backfat and ADFI. Another gene involved in apoptosis, DAPK3, was within significant SNP windows for the covariance between ADFI and ADG. Many of the other genes implicated in having cryptic pleiotropic effects had varied functions that could be related to the traits under study. Overall, these results provide information about cryptic genes with beneficial

138

pleiotropic effects and are potentially useful for improving livestock selection practices by

understanding the genes that simultaneously control multiple traits of interest.

INTRODUCTION

Genetic correlations between traits result in sometimes undesirable changes in

phenotype for one trait due to selection for another trait. Genetic correlations between traits

are primarily due to either pleiotropism or linkage disequilibrium (LD) between genes

(DOBZHANSKY 1970). Many examples of pleiotropic genes have been studied over the years, such as white, yellow and vermilion genes in Drosophila melanogaster

(DOBZHANSKY & HOLZ 1943) and the melanocortin-4 receptor gene in Sus scrofa (KIM et al. 2000). As described by WRIGHT (1956), selection on pleiotropic alleles will likely be determined by the total selective value of the allele across all traits under selection, which could result in negative characteristics being selected along with more favorable ones.

Although estimates of genetic correlations between traits vary widely depending on the population, trait measurement method, and environmental influences, correlation estimates between strongly related traits tend to have a consistent direction. Estimates for genetic correlations between residual feed intake (RFI), average daily feed intake (ADFI), average daily gain (ADG), and backfat (BF) for pigs with ad libitum or semi-ad libitum access to feed are as follows: 0.16 to 0.77 for RFI vs. ADFI, -0.35 to 0.24 for RFI vs. ADG,

-0.51 to 0.20 for RFI vs. BF (GILBERT et al. 2007; CAI et al. 2008; BERGSMA et al. 2010;

BUNTER et al. 2010), 0.32 to 0.89 for ADFI vs. ADG, 0.08 to 0.64 for ADFI vs. BF, and

-0.26 to 0.55 for ADG vs. BF (CLUTTER 2011). These genetic correlation estimates suggest that selection for one of these four traits should impact the other three, and this impact has indeed been observed in selection experiments such as those described by CAI et al. (2008)

139

and GILBERT et al. (2007). Although RFI is phenotypically designed to be uncorrelated with ADG and BF, selection for decreased RFI (more feed efficient animals) has resulted in some reductions in both ADG and BF in the Iowa State University selection line for low RFI due to genetic correlations between the traits (CAI et al. 2008). Although the overall genetic correlations between these traits may be slightly to moderately positive, it would be useful to identify hidden (cryptic) genomic regions where a negative correlation exists between traits such as RFI and ADG in an attempt to identify pleiotropic genes that may have favorable effects on both traits. Consequently, the objective of this study was to identify genomic regions which differ from the overall genetic correlations between RFI, ADFI, ADG, and BF.

MATERIALS AND METHODS

Two selection lines for residual feed intake (RFI) that were derived from a single purebred Yorkshire pig population, as described by CAI et al. (2008), were used for this study. One line was selected for decreased RFI for six generations; the other line was randomly selected for five generations followed by one generation of selection for increased

RFI. From generations zero, four, five, and six, a total of 730 animals with phenotypes for the traits of interest were genotyped with the Porcine SNP60 BeadChip (Illumina, San Diego,

CA, USA; Ramos et al. 2009) by GeneSeek, Inc. (Lincoln, NE, USA). Single nucleotide polymorphisms (SNPs) that were not segregating in the population (minor allele frequency less than 0.0001) or that genotyped unsuccessfully in greater than 20% of animals were removed from analyses, as described in GORBACH et al. (2011).

Phenotypes analyzed in this study included RFI, ADFI, ADG, and tenth-rib BF. Feed intake was measured from approximately 90 to 210 days of age using electronic FIRE feeders (Osborne, KS, USA). ADG was measured over the same time period. BF was

140

evaluated with an Aloka ultrasound machine (Corometrics Medical Systems, Inc.,

Wallingford, CT, USA) at approximately 115 kg. RFI was computed as in GORBACH et al.

(2011), where ADFI was adjusted for on-test, metabolic mid-test, and off-test weights, and for ADG and BF.

SNP associations with the described phenotypes were carried out as described in

GORBACH et al. (2011). In brief, the Bayes C model averaging approach from the GenSel software (http://bigs.ansci.iastate.edu; FERNANDO & GARRICK 2008) was used to estimate additive SNP effects with a prior of 0.5% of SNPs assumed to have a non-zero effect on phenotype during each iteration of the Markov chain. SNP effects were based on the posterior distributions of effects from the last 40,000 iterations of the 50,000 iteration chain.

For each window of five consecutive SNPs in the pig genome, based on build 10

(ftp.tgac.bbsrc.ac.uk), each animal’s genomic estimated breeding value for the five-SNP window (wGEBV) was computed by multiplying the number of copies of an allele for each

SNP in the window by the estimated effect of that allele and summing these effects across the five SNPs. Once a wGEBV was computed for each animal for each five-SNP window for each trait (RFI, ADFI, ADG, and BF), correlations and covariances were computed between wGEBVs for each five-SNP window. Distributions of correlations and covariances were examined and results were plotted against position in the genome using R software

(http://www.r-project.org).

For each pair of traits, correlations of SNP windows with a sign opposite from expectations based on previously published genetic correlations between traits were examined further. Genes in positions that overlapped with the five-SNP windows with the

141

strongest covariances were examined based on NCBI blastn searches with sequence from

build 10 of the porcine genome. Blast results with an e-value less than 1e-75 were considered. Gene functions were determined by literature searches.

RESULTS

The mean and median genetic correlations for the wGEBVs between the four traits generally had similar relative magnitudes to previously published estimates of overall genetic correlations between these traits in this population (Table 1). Median values were higher than mean values for all distributions of correlations based on the skewness of results toward the extremes, with more correlations being close to one in all comparisons (Figure 1). The distributions of covariances between wGEBVs were mostly skewed right, with most values close to zero and a tail of larger covariance values (Tables 2-8).

For RFI and ADFI, the wGEBV covariances were generally positive and ranged from

-7.5 x 10-9 to 3.0 x 10-6 (Table 3). Three genes were located within the genomic range of the

SNP windows with the 10 most negative covariances: GRM5 on Sus scrofa chromosome

(SSC) 9, LOC100293817 on SSC15, and NCOA3 on SSC17 (Table 9). The regions

containing the first and last genes mentioned had wGEBV correlations between RFI and

ADFI of less than -0.99. The region containing LOC100293817 had a wGEBV correlation of

-0.92.

For RFI and ADG, the maximum covariance (2.2 x 10-7) between wGEBVs occurred on SSC3 near 70 Mb. The magnitude of the most negative covariances was slightly less, at

-3.7 x 10-8 (Table 4). Based on blast searches, genes that likely fall within the regions having

the most negative covariances are ASB1 (SSC15 at 147 Mb), OTC (SSCX at 36 Mb), and

STAT4 (SSCX at 38 Mb).

142

For RFI and BF, the range of wGEBV covariances went from -1.4 x 10-5 to 7.3 x 10-6

(Table 5). The strongest positive covariances were attributable to SNP windows that contained an unmapped SNP (ASGA0096364) that was previously shown to have a very strong impact on BF in this population (GORBACH et al. 2011), an unmapped region that was predicted to contain the PLIN2 gene, and two regions on SSC11: at 1 Mb and at 83 Mb.

Only one gene, CENPJ, was found in the SSC11 regions. The strongest negative covariances occurred in SNP windows on SSC2 and SSC12 in regions containing the SLC5A12 and

NTN1 genes, respectively.

For ADG and ADFI, covariances between wGEBVs ranged from -5.7 x 10-9 to 2.1 x

10-6 (Table 6). A total of 23 genes were located in the 10 genomic regions with the most

negative covariances. DAPK3 on SSC2 (covariance = -5.3 x 10-9; correlation = -0.64) and

SLC36A1 on SSC16 (covariance = -5.5x10-9; correlation = -0.74) were the most interesting

candidate genes based on function.

For BF and ADFI, wGEBV covariances ranged from -5.9 x 10-6 to 1.1 x 10-4 (Table

7). The strongest negative covariances between wGEBVs occurred in regions containing

SOX17 on SSC4 (covariance = -4.1 x 10-6; correlation = -0.91), NTN1on SSC12 (covariance

= -5.9 x 10-6; correlation = -0.84), and MYO10 on SSC16 (covariance = -2.5 x 10-6;

correlation = -0.99).

For ADG and BF, the largest covariances (1.9 x 10-5) between wGEBVs occurred on

SSC9 at 141.8 Mb and on SSC13 at 32.2 Mb (Table 8). The most negative covariances (-3.2

x 10-5) were the result of an unmapped SNP that was previously shown to have a very strong

impact on BF in this population (GORBACH et al. 2011). Excluding the windows that

contained that unmapped SNP, eight genes were located within the 10 SNP windows with the

143

most negative covariances between ADG and BF. The most functionally interesting genes in these regions, SSTR2 and COG1, are both located on SSC12 near 8.5 Mb. The other gene within the most significant regions with known location and function is CHRDL1 (near 96

Mb on SSCX).

DISCUSSION

Genetic correlations between traits are the result of pleiotropy, a shared genetic regulatory element, or LD between the underlying genetic factors. The five-SNP windows examined in this study had an average length of 200 kb, indicating only the impact of LD on genetic correlations that occurred over a short range was evaluated in this study.

Based on results from WILKINSON et al. (1990), changes in genetic correlations and covariances were observed between traits in Drosophila melanogaster as a result of artificial selection. These results were expected due to both the Bulmer effect (BULMER 1971) and selection resulting in changes in allele frequencies of genes that affect one or both traits.

Changes in allele frequencies will affect the genetic variance of any trait on which that locus has an effect. Consequently, the genetic correlation between traits, but not the covariance, will be altered by changing frequencies at loci that only affect one of the traits. Allele frequency changes in pleiotropic genes will impact both genetic correlations and covariances.

Based on this knowledge, selection for traits of economic importance in swine such as

ADFI, ADG, and BF has likely altered the genetic correlations between these traits. While pleiotropy can understandably create genetic correlations between closely related traits such as feed intake and growth rate, artificial selection for production traits has likely increased the frequency of alleles that result in more desirable genetic relationships between economic traits, such as mutations which increase ADG while decreasing ADFI. Discovering these

144

cryptic genes or genetic regions will be useful for understanding the biology that sets these

traits apart from one another.

Genetic correlations between traits for SNP windows are highly influenced by the LD

within the SNP window. If the loci in the window are in complete LD, then the correlations

will be either +1 or -1 because all SNPs were calculated to have only a non-zero additive

effect on each trait. The frequency of strong LD is relatively high in this population due to

the inbreeding that has resulted from a small starting population size followed by strong

selection. The prevalence of correlations close to +1 and -1 seen in Figure 1 is due in part to

this LD structure. The presence of strong LD is not indicative of pleiotropy, however, and

thus window correlations by themselves are not evidence of pleiotropy. Window covariances

do not suffer from this flaw for detecting pleiotropy, although they are strongly influenced by

the allele frequencies in the study population. Consequently, window covariances provide

more information on pleiotropy than correlations do, but follow-up studies are needed to determine whether pleiotropy is truly responsible for strong covariances.

Genomic regions that have a large impact on multiple traits simultaneously should have been detected in the genome-wide association studies conducted earlier for the individual traits (GORBACH et al. 2011). While some overlap with these previous findings exists, particularly for wGEBV covariances involving BF, much more overlap would be expected if positive covariance results were also considered in this study. The positive covariances had larger magnitudes for most of the trait pairs, and thus, are more likely to fall in regions that had larger impacts on the individual traits. The current analysis identified several regions that overlapped with results for one of the traits in a trait pair, but not for both traits. For example, all of the covariance regions for ADFI and BF overlapped with either

145

ADFI (SSC4 and SSC16) or BF (SSC12) regions, but did not reach the significance threshold in the other trait.

For the genes putatively underlying a negative (antagonistic) genetic correlation between RFI and ADFI, GRM5 is involved in neurotransmission, and NCOA3 serves as a coactivator for nuclear receptors such as for steroids, thyroid hormone and prostanoids.

Within the most significant genomic regions for RFI versus ADG, ASB1 mediates protein ubiquitination and degradation, OTC is a mitochondrial matrix protein enzyme gene, and

STAT4 is a transcription factor for immune genes. For BF and RFI, PLIN2, which is involved in adipocyte differentiation, is in a region with a strong positive correlation between the traits. For the same two traits, DAVID software analysis (http://david.abcc.ncicrf.gov) classified all of the genes in the negative covariance regions as glycoproteins. Of the genes most interesting for ADG and ADFI, they appear to be involved in apoptosis (DAPK3) and amino acid and proton transport (SLC36A1). The SOX17, NTN1, and MYO10 genes within the most significant regions for BF versus ADFI encode a transcriptional regulator, an axon guide and apoptosis regulator, and an intracellular motor molecule, respectively. For ADG and BF, COG1 is required for normal glycoconjugate processing by the Golgi complex, and

SSTR2 can inhibit adenylyl cyclase.

Many of the genes implicated in this study make functional sense for creating the observed trait correlations, but additional studies in other populations and species are needed to determine if these correlations and covariances between wGEBVs from SNP windows are shared across different populations. Certainly, selection for differences in RFI in this population may make the results less similar to other populations, but this needs to be investigated. The method employed in this research holds considerable promise for

146

determining regions that will be useful for various reasons from understanding basic biology

to predicting more about the effects of multiple trait selection.

ACKNOWLEDGEMENTS

Funding was provided by the National Pork Board, Iowa Agricultural Home

Economics Experiment Stations, and State of Iowa and Hatch funding. FIRE Feeders were

donated by PIC USA. Support for D. M. Gorbach and J. M. Young was provided by USDA

National Needs Graduate Fellowship Competitive Grant No. 2007-38420-17767 from the

National Institute of Food and Agriculture.

REFERENCES

Bergsma, R., E. Kanis, M. W. A. Verstegen and E. F. Knol, 2010 Genetic correlations

between lactation performance and growing-finishing traits in pigs. Proc. 9th World

Congr. Genet. Appl. Livest. Prod., Leipzig, Germany, Aug. 1-6.

Bulmer, M. G., 1971 The effect of selection on genetic variability. Am. Nat. 105: 201-211.

Bunter, K. L., W. Cai, D. J. Johnston and J. C. M. Dekkers, 2010 Selection to reduce residual

feed intake in pigs produces a correlated response in juvenile insulin-like growth

factor-I concentration. J. Anim. Sci. 88: 1973-1981.

Cai, W., D. S. Casey and J. C. Dekkers, 2008 Selection response and genetic parameters for

residual feed intake in Yorkshire swine. J. Anim. Sci. 86: 287–298.

Clutter, A. C., 2011 Genetics of performance traits, in The Genetics of the Pig, edited by

M. F. Rothschild and A. Ruvinsky. CABI Press, Oxon, UK.

Dobzhansky, T., 1970 Genetics of the evolutionary process. Columbia University Press, New

York and London.

Dobzhansky, T., and A. M. Holz, 1943 A re-examination of the problem of manifold effects

147

of genes in Drosophila melanogaster. Genetics. 28: 295-303.

Fernando, R. L., and D. J. Garrick, 2008 GenSel – User manual for a portfolio of genomic

selection related analyses. Animal Breeding and Genetics, Iowa State University,

Ames.

Gilbert, H., J. P. Bidanel, J. Gruand, J. C. Caritez, Y. Billon et al., 2007 Genetic parameters

for residual feed intake in growing pigs, with emphasis on genetic relationships with

carcass and meat quality traits. J. Anim. Sci. 85: 3182–3188.

Gorbach D. M., W. Cai, J. M. Young, J. C. M. Dekkers, D. J. Garrick et al., 2011 Whole-

genome association analyses for residual feed intake and related traits in swine. (in

preparation).

Kim K.S., N. Larsen, T. Short, G. Plastow, and M. F. Rothschild, 2000 A missense variant

of the porcine melanocortin-4 receptor (MC4R) gene is associated with fatness,

growth, and feed intake traits. Mamm. Genome. 11: 131-135.

Ramos, A. M., R. P. M. A. Crooijmans, N. A. Affara, A. J. Amaral, A. L. Archibald et al.,

2009 Design of a high density SNP genotyping assay in the pig using SNPs identified

and characterized by next generation sequencing technology. PLoS One. 4: e6524.

Sun, X., D. Habier, R. L. Fernando, D. J. Garrick and J.C.M. Dekkers, 2011 Genomic

breeding value prediction and QTL mapping of QTLMAS2010 data using Bayesian

methods. BMC Proc. 5(Suppl 3): S13.

Wilkinson, G.S., K. Fowler and L. Partridge, 1990 Resistance of genetic correlation structure

to directional selection in Drosophila melanogaster. Evolution. 44: 1990-2003.

Wright, S., 1956 Modes of selection. Am. Nat. 90: 5-24.

148

Table 1. Estimates of genetic correlations between traits, overall GEBV correlations, and mean and median correlations of 5-SNP window wGEBVs between traits

Traits Genetic Correlations of Mean correlations Median correlation whole-genome of SNP window correlations of estimates ± SEa GEBVs wGEBVs SNP window wGEBVs RFI vs. ADFI 0.52 ± 0.12 0.82 0.58 0.86 RFI vs. ADG 0.17 ± 0.18 0.40 0.13 0.26 RFI vs. BF -0.14 ± 0.16 0.33 0.09 0.18 ADFI vs. 0.88 ± 0.05 0.76 0.57 0.85 ADG ADFI vs. BF 0.57 ± 0.10 0.42 0.25 0.47 ADG vs. BF 0.45 ± 0.13 0.20 0.22 0.42 a Based on estimates from Cai et al. (2008) for the first 4 generations of these selection lines.

Table 2. Mean and median covariances of 5-SNP window wGEBVs and of whole- genome GEBVs between traits

Traits Covariances of Mean covariances Median covariances whole-genome of SNP window of SNP window GEBVs wGEBVs wGEBVs RFI vs. ADFI 4.0x10-3 4.1x10-9 5.0x10-10 RFI vs. ADG 8.3x10-4 3.9x10-10 3.5x10-11 RFI vs. BF 3.6x10-2 1.3x10-8 1.1x10-9 ADFI vs. ADG 3.3x10-3 4.5x10-9 4.7x10-10 ADFI vs. BF 9.3x10-2 1.1x10-7 8.2x10-9 ADG vs. BF 1.9x10-2 3.7x10-8 3.1x10-9

Table 3. Frequency table of covariances of 5-SNP window wGEBVs between RFI and ADFI on a log scale

Minimum Maximum Count 1x10-6 1x10-5 5 1x10-7 1x10-6 311 1x10-8 1x10-7 3,366 1x10-9 1x10-8 16,624 1x10-10 1x10-9 20,384 0 1x10-10 5,955 -1x10-10 0 3,644 -1x10-9 -1x10-10 4,683 -1x10-8 -1x10-9 735

149

Table 4. Frequency table of covariances of 5-SNP window wGEBVs between RFI and ADG on a log scale

Minimum Maximum Count 1x10-7 1x10-6 10 1x10-8 1x10-7 488 1x10-9 1x10-8 5,868 1x10-10 1x10-9 16,253 0 1x10-10 9,562 -1x10-10 0 8,336 -1x10-9 -1x10-10 12,128 -1x10-8 -1x10-9 2,993 -1x10-7 -1x10-8 69

Table 5. Frequency table of covariances of 5-SNP window wGEBVs between RFI and BF on a log scale

Minimum Maximum Count 1x10-6 1x10-5 208 1x10-7 1x10-6 3,068 1x10-8 1x10-7 13,470 0 1x10-8 14,062 -1x10-8 0 12,503 -1x10-7 -1x10-8 10,439 -1x10-6 -1x10-7 1,873 -1x10-5 -1x10-6 81 -1x10-4 -1x10-5 3

Table 6. Frequency table of covariances of 5-SNP window wGEBVs between ADFI and ADG on a log scale

Minimum Maximum Count 1x10-6 1x10-5 10 1x10-7 1x10-6 365 1x10-8 1x10-7 3,351 1x10-9 1x10-8 16,162 1x10-10 1x10-9 20,137 0 1x10-10 6,296 -1x10-10 0 4,091 -1x10-9 -1x10-10 4,740 -1x10-8 -1x10-9 555

150

Table 7. Frequency table of covariances of 5-SNP window wGEBVs between ADFI and BF on a log scale

Minimum Maximum Count 1x10-4 1x10-3 2 1x10-5 1x10-4 49 1x10-6 1x10-5 1,035 1x10-7 1x10-6 8,315 1x10-8 1x10-7 17,290 0 1x10-8 9,553 -1x10-8 0 7,940 -1x10-7 -1x10-8 9,329 -1x10-6 -1x10-7 2,134 -1x10-5 -1x10-6 60

Table 8. Frequency table of covariances of 5-SNP window wGEBVs between ADG and BF on a log scale

Minimum Maximum Count 1x10-5 1x10-4 10 1x10-6 1x10-5 350 1x10-7 1x10-6 4,293 1x10-8 1x10-7 15,457 0 1x10-8 15,078 -1x10-8 0 12,021 -1x10-7 -1x10-8 7,557 -1x10-6 -1x10-7 913 -1x10-5 -1x10-6 23 -1x10-4 -1x10-5 5

151

Table 9. Genes and genomic regions implicated in negative covariances between traits

Trait Chromosomes Genes in regions Covariances Correlations combination 2 None found -7.5x10-9 -0.54 4 None found -6.6x10-9 -0.62 RFI and ADFI 9 GRM5 -4.9x10-9 -1.00 15 LOC100293817 -5.7x10-9 -0.92 17 NCOA3 -4.9x10-9 -0.99 7 None found -2.7x10-8 -0.33 12 None found -2.2x10-8 -0.56 RFI and ADG 15 ASB1 -2.6x10-8 -0.85 X OTC, STAT4 -3.7x10-8 -1.00 2 ANO3, SLC5A12, MUC15 -6.5x10-6 -0.92 RFI and BF 12 NTN1 -1.4x10-5 -0.98 1 NCS1 -5.0x10-9 -0.98 2 CELF5, DAPK3, NFIC, -5.3x10-9 -0.64 NFIX, PIAS4, SIRT6a ADFI and 4 LOC401282 -4.6x10-9 -0.82 ADG 5 NELL2 -5.7x10-9 -0.75 12 TBCD, B3GNTL1 -5.2x10-9 -1.00 13 NRIP1 -5.6x10-9 -0.99 16 FAT2, SLC36A1 -5.5x10-9 -0.74 4 RP1, SOX17 -4.1x10-6 -0.99 ADFI and BF 12 NTN1 -5.9x10-6 -0.84 16 MYO10 -3.5x10-6 -0.99 12 SSTR2, FAM104A, COG1 -1.2x10-6 -0.89 13 None found -1.3x10-6 -0.54 16 BASP1, LOC401177 -1.2x10-6 -0.89 ADG and BF X CHRDL1 -1.1x10-6 -0.93 unknown EXOC4, MBD5 -1.5x10-6 -0.67 unmapped No sequence available -3.2x10-5 -1.00 aSelection of potentially important genes from the list of all genes identified in the region.

152

16000 14000 12000

10000 8000 6000 Frequency 4000 2000 0

0.6 0.5 0.3 0.2 0.7 0.4 0.1 0.8 0.9 - - - - 0.9 to 1 0 to 0.1 0.1 to 0 1 to - - 0.1 to 0.2 0.2 to 0.3 0.3 to 0.4 0.4 to 0.5 0.5 to 0.6 0.6 to 0.7 0.7 to 0.8 0.8 to 0.9 0.8 to 0.5 to 0.2 to 0.7 to - 0.6 to - 0.4 to - 0.3 to - 0.9 to ------Correlations

Figure 1. Distribution of correlations between 5-SNP window wGEBVs between ADFI and BF. This graph is representative of overall distributions of correlations between all 6 pairs of traits based on frequencies being greatest at the extremes of the distribution.

153

CHAPTER 8. GENERAL CONCLUSIONS

Current topics of research point to the present being the “era of SNPs”, but with on- going advances in genomic technologies, it is easy to see that this era will not last long.

Genotyping of additional types of mutations in addition to SNPs and/or sequencing animals will soon replace the current practice of focusing almost exclusively on SNPs. The questions then become: how can emerging technologies be utilized to gain more information on genetic

markers, how can the information we have gained from SNPs thus far best be utilized, and

how can what we have learned from analyzing SNPs be translated into analysis methods for

even larger-scale data analysis problems. This thesis has encompassed many methods for discovery and utilization of SNPs based on current technologies, and the next few pages describe the author’s views on how these results and others may impact the future of genetics and genomics in animal production.

MARKER DISCOVERY

Newer and cheaper sequencing technologies are creating a plethora of sequence data from a wide variety of species. How effectively this sequence data can be mined for genetic marker information will depend substantially on the population sub-structure and relatedness of the sequenced animals compared to the animals in which the genetic markers are intended for use (target population). If the available sequence information comes from animals that are related to the target population, then data mining software such as the SNPidentifier program described earlier should provide accurate predictions of genetic markers and require minimal validation. If, however, many mutations have occurred that differ between the sequenced and target populations, it will soon be more cost-effective to sequence members of the target

154

population instead of mining the existing sequence databases and incurring the costs of marker validation in the target population.

USEFULNESS OF CURRENT SNP RESULTS

The usefulness of SNPs in animal breeding today is primarily three-fold: understanding population history and structure, obtaining better estimates of breeding values, and understanding trait biology. When considering the newly adopted methods for genomic selection, not much overlap currently exists between the latter two uses, and the author does not foresee large changes in this division in the very near future. Admittedly, genomic selection uses whole-genome biological information when determining which phenotypes to measure and how to measure them, as well as for estimating genetic and phenotypic variances for traits. On an individual SNP basis, however, genomic selection models do not account for known functional information. Based on current prices and technologies, the statistical ‘black-box’ of genomic selection looks very promising for increasing rates of response to selection. Meanwhile, the biological understanding of traits based on individual

SNP markers’ effects should be useful for informing management decisions and assessing phenotypes, but will be less directly useful for selection decisions in populations where genomic selection is a financially viable option.

From the standpoint of conservation of genetic diversity and minimization of inbreeding, it is useful to know the true population structure. SNP markers allow breeders to accurately trace relationships between animals and can thus be useful for deciding which animals to preserve or mate when attempting to maintain maximal genetic diversity

(Oliehoek et al. 2006). In the case of the Kenyan dairy animals, encouraging breeding of

155

animals from smaller scale farms to cattle from larger operations should help to reduce

inbreeding levels in the larger operations. Focus should also be placed on identifying animals

of native African descent, which appeared to be lacking in the study population, for

conservation purposes and to evaluate their relative performance to determine if breed

introgression would be beneficial.

In terms of computing breeding values, the genetic architecture of the trait of interest

must be considered. For phenotypes that are controlled by a very limited number of

mutations, such as polydactyly probably is in pigs, SNP markers could be very useful for

informing selection and/or mating decisions. In order to eliminate the occurrence of single

gene defects, SNP markers or other similar genetic markers could easily and cheaply be used

to avoid matings that have a chance of producing affected offspring once a causative

mutation is identified. Although statistical methods alone can help to narrow the region of the

genome in which the causative mutation could be located, gene function information and

additional sequencing are usually necessary to identify the precise causative mutation. In the

case of the polydactyl phenotype, statistical evidence provided in chapter five indicated that a

mutation exists somewhere in the 25 Mb stretch described on chromosome 8, but more information about gene function is needed to make an educated guess about which gene or genes are involved. The incomplete penetrance that was observed when this trait was

modeled as a single-gene recessive defect suggests that another mutation may be acting with

the recessive mutation or a strong environmental influence is involved.

For quantitative traits, genomic selection is showing significant potential to improve the success of selection (Meuwissen et al. 2001; VanRaden et al. 2009). The cost of high- density SNP panels makes this practice cost-prohibitive currently in species or breeding

156

operations where animals are not individually valuable enough to justify the genotyping

costs. To overcome this problem, Habier et al. (2009) suggested genotyping selection

candidates with low-density panels of evenly-spaced SNPs to impute high-density SNP genotypes. They showed through simulation that for single trait selection, it is advantageous

to focus on SNPs which have the largest contribution to the trait of interest. As they discuss,

the advantage of utilizing small panels of SNPs that are particular to a single trait goes away

when multiple trait selection is used. Considering that animal breeders in industry almost always use multiple trait selection, focusing on SNPs based on their biological significance to a trait appears to be less pertinent than simply using evenly spaced SNPs across the genome

with imputation to the high-density SNP panel genotypes and creating one low-density SNP chip to select for all traits (Habier et al. 2009). The roles of biological information in genomic selection currently are in determination of phenotypes and use of genetic and phenotypic variance estimates as prior information for the Bayesian models or to combine

GEBVs into a selection index. These methods do not currently utilize knowledge about the function of individual SNPs or genes, though this information could be useful for setting prior distributions in the future.

In the short-term, breeders who are not ready to adopt genomic selection may find some benefit from using individual SNPs in MAS, but as genotyping and sequencing costs decrease over the longer term, this is not going to be the best solution. As discussed in chapter one, MAS has drawbacks of focusing on markers with large effects, which tend to be rare within the genome and thus contribute only a small fraction of the total genetic variance

(Dekkers and Hospital 2002). Genomic selection accounts for more of the genetic variance in traits than traditional MAS. Thus, the accuracy of the GEBV is increased when compared to

157

using a few markers in MAS, especially at a young age, which increases selection response,

decreases generation intervals, and reduces inbreeding levels (Dekkers 2007). Consequently, as costs for implementing genomic selection decrease, the advantages of MAS will disappear. However, for species where little genomic information is available and animals are not very valuable individually, such as shrimp, MAS may be the best short-term option.

Studies are underway that utilize the Pacific white shrimp linkage map to map disease traits in this species (Z-Q Du, personal communication), which should prove useful in MAS while more advanced methods are being developed.

Biological information can be useful for improving management practices, such as deciding when to cull sows based on old age or what feed additives could act on genes known to improve feed efficiency. Many of the studies of production animal genetics today focus on the application of results to improving selection practices, but these results in many cases could also be useful for creating environments adapted to the animals’ genetic tendencies.

APPLICABILITY OF CURRENT METHODS TO FUTURE TECHNOLOGIES

As marker panels become denser and/or genome sequence information becomes cheaper to obtain, more causative mutations may be discovered which could prove useful for both selection and improving biological understanding. However, most of the mutations with very large effects are likely either already known or very rare in the populations that are currently under study. Deep sequence data would allow researchers to detect rare mutations, epigenetic alterations, and mutations that occur between parents and offspring, giving a more

thorough understanding of the biology underlying inheritance.

158

As more is understood about the genetic bases of phenotypes, this information could be used to make biology more of a predictive science. Understanding more about gene functions and transcription and translation regulation will allow researchers to predict the outcome of novel changes to one or more base pairs of DNA. Such understanding will enhance genetic modification processes by making the testing of thousands of novel mutations or gene insertions obsolete. This additional knowledge will also aid disease treatment and prevention in humans and eventually livestock. Knowing which individuals are more susceptible to various diseases means that more preventative actions can be taken in advance. Some day, similar approaches will probably be used in livestock to treat animals with vaccines and other medications that are expected to work in concert with each animal’s genetics. Finally, being able to better predict phenotypes from genotypes may play a role in how animals are selected in the future. The genomic selection methods of today are the beginning of an era where more selection may be based on individual genes as they contribute to genotype, and adding biological information about genes and their regulation will likely only increase the uses of genotypes for selection.

Overall, SNP data is useful today and will continue to be used in the animal production industry for many years. As other types of genetic information, such as copy number variants, epigenetic markers and sequence data, come of age, SNPs will become just a member of a large family of genetic mutations we are trying to interpret. Their bi-allelic nature makes them an easy type of marker for starting to learn how to analyze the vast amounts of genomic data that are coming. To fully use this information, though, an iterative process needs to be developed whereby animal merit predictions from genetic information

159

need to inform our biological understanding of traits, and that biological information needs to

be used to train the next set of breeding value predictions.

REFERENCES

Dekkers JCM, 2007. Prediction of response to marker-assisted and genomic selection using

selection index theory. J Anim Breed Genet. 124(6):331-341.

Habier D, Fernando RL, Dekkers JCM, 2009. Genomic selection using low-density marker

panels. Genetics. 182:343-353.

Meuwissen THE, Hayes BJ, Goddard ME, 2001. Prediction of total genetic value using

genome-wide dense marker maps. Genetics. 157:1819-1829.

Oliehoek P, Windig J, van Arendonk J, Bijma P, 2006. Estimating relatedness between

individuals in general populations with a focus on their use in conservation programs.

Genetics. 173:483-496.

VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF,

Schenkel FS, 2009. Reliability of genomic predictions for North American Holstein

bulls. J Dairy Sci. 92(1):16-24.

160

APPENDIX A. A GENE-BASED SNP LINKAGE MAP FOR PACIFIC WHITE SHRIMP, LITOPENAEUS VANNAMEI

A paper published in Animal Genetics. Volume 41(3):286-294.

Z.-Q. Du, D. C. Ciobanu, S. K. Onteru, D. Gorbach, A. J. Mileham, G. Jaramillo, M. F. Rothschild

SUMMARY

Pacific white shrimp (Litopenaeus vannamei) are of particular economic importance to the global shrimp aquaculture industry. However, limited genomics information is available for the penaeid species. We utilized the limited public information available, mainly single nucleotide polymorphisms (SNPs) and expressed sequence tags, to discover markers for the construction of the first SNP genetic map for Pacific white shrimp. In total,

1344 putative SNPs were discovered, and out of 825 SNPs genotyped, 418 SNP markers from 347 contigs were mapped onto 45 sex-averaged linkage groups, with approximate coverage of 2071 and 2130 cM for the female and male maps, respectively. The average- squared correlation coefficient (r2), a measure of linkage disequilibrium, for markers located

more than 50 cM apart on the same linkage group was 0.15. Levels of r2 increased with

decreasing inter-marker distance from ~80 cM, and increased more rapidly from ~30 cM. A

QTL for shrimp gender was mapped on linkage group 13. Comparative mapping to model

organisms, Daphnia pulex and Drosophila melanogaster, revealed extensive rearrangement

of genome architecture for L. vannamei, and that L. vannamei was more related to Daphnia

pulex. This SNP genetic map lays the foundation for future shrimp genomics studies,

especially the identification of genetic markers or regions for economically important traits.

161

INTRODUCTION

The shrimp/prawn industry is expanding with annual product volume tripling and

economic values increasing 1.7 times in the past decade (1996-2005, FAO Fisheries and

Aquaculture Statistics). However, production is unstable because of environmental problems

and disease challenges, which make the selective breeding of specific-pathogen-free and fast-

growing shrimp strains a high priority.

The Pacific white shrimp (Litopenaeus vannamei) represents a large proportion of the world shrimp aquaculture industry. Traditional genetic evaluations for growth rate and pond survival using only phenotypic records have already been practiced in several specific genetic improvement programmes (Cuzon et al. 2004; De Donato et al. 2005; Gitterle et al.

2005a,b; Castillo-Juárez et al. 2007; Ibarra & Famula 2008). However, while this type of evaluation can address performance traits with moderate to high heritabilities, it is unsuitable for low-heritability traits, such as disease-resistance. The complex interplay of negative genetic correlations between disease resistance and growth and performance traits could make genetic improvement difficult (Gitterle et al. 2006a,b; Cock et al. 2009). Advances in livestock genomics and breeding have opened up new avenues and opportunities, and the advent of genomic selection heralds a bright future for the improvement of agriculturally important traits (Meuwissen et al. 2001; Goddard & Hayes 2009).

The Pacific white shrimp has 44 pairs of chromosomes (2n = 88) and a genome of

~2.5 × 109 base pairs (Chow et al. 1990). Unfortunately, very limited genome sequence

information is available for penaeid shrimp species (Tassanakajon et al. 2006; Alcivar-

Warren et al. 2007b; Robalino et al. 2007). Several preliminary genetic maps for Pacific

white shrimp have already been published, mostly composed of amplified fragment length

162

polymorphisms and a small number of microsatellite markers (Meehan et al. 2003; Pérez et al. 2005; Alcivar-Warren et al. 2006, 2007a; Dhar et al. 2007; Garcia & Alcivar-Warren,

2007; Zhang et al. 2007), and genetic maps for other penaeid shrimp species, e.g. Penaeus

monodon (Wilson et al. 2002; Klinbunga et al. 2006; Maneeruttanarungroj et al. 2006;

Preechaphol et al. 2007; Staelens et al. 2008), Penaeus chinensis (Li et al. 2006b) and

Penaeus japonicus (Moore et al. 1999; Li et al. 2003) have also been published. These

genetic maps show that penaeid shrimp could have a ZW/ZZ sex determination system

(Zhang et al. 2007; Staelens et al. 2008), and a PCR assay for sex determination has been

developed for P. monodon (Staelens et al. 2008). Quantitative trait loci (QTL) for growth

have also been mapped for P. japonicus, and one candidate gene located in a major QTL

region was also characterized (Li et al. 2006a; Lyons et al. 2007).

Recently, single nucleotide polymorphism discovery and association analyses found

several SNP markers associated with test weekly gain, grow-out survival and resistance to

Taura Syndrome Virus (Ciobanu et al. 2009). Construction of a SNP genetic map can be

useful in the further dissection of these and other QTL (Lander & Botstein 1989). It can also

bring new genomic insights to poorly characterized species, such as the linkage

disequilibrium (LD) landscape (Shifman et al. 2006; Rockman & Kruglyak 2009),

comparative synteny, localization of genes and genomic sequences (Beldade et al. 2009;

Kucuktas et al. 2009; Vingborg et al. 2009), and genomic sequence features associated with

recombination rate (Groenen et al. 2009). The development of genomics technologies has

significantly reduced the cost and increased the throughput of sequencing, SNP discovery

and genotyping (Glenn et al. 2005; Snelling et al. 2005; Frazer et al. 2007; The International

Hapmap Consortium 2007). Whole genome association analyses using large numbers of

163

newly discovered SNPs are beginning to be widely applied to search for the molecular

mechanisms underlying important phenotypes, e.g. human diseases and agricultural

economic traits (McCarthy et al. 2008). This could in turn help to disentangle the complex

interplay between genetic factors and their interaction with environmental effects, to design

novel diagnostic tools, and to incorporate marker information in breeding programs of

important agricultural species.

Discovery of large numbers of new genetic markers and the construction of a dense

genetic map of the shrimp genome using both available and newly discovered markers would

be of great value for future breeding programmes. In the process of our large-scale discovery of SNPs and construction of a linkage map, we combined bioinformatics and molecular technologies and tools to mine and annotate all the available public nucleotide sequence data relevant to Pacific white shrimp.

MATERIALS AND METHODS

L. vannamei resource families

The resource family was produced by Shrimp Improvement Systems using a standard

F2-design. The F1 animals were generated by crossing males from one shrimp line (red) with

females from another line (yellow). Crosses between six F1 animals produced three F2

families. A total of 144 animals, 43 (22♀/21♂), 44(21♀/23♂) and 43(20♀/23♂) shrimp from

the F2 families, eight grandparents and six F1 parents, were used in the construction of the

map. Individual tissue samples were collected from all animals and stored at -80°C. Genomic

DNA was extracted using the DNeasy Midi kit (Qiagen), according to manufacturer’s

protocol with slight modifications.

164

SNPs

Previously identified SNPs (Ciobanu et al. 2009) were used along with novel SNPs

identified as follows. A new computational pipeline (SNPidentifier) was designed to predict

SNPs using public DNA resources, mainly NCBI and the marine genomics database

(http://www.marinegenomics.org) (Gorbach et al. 2009). SNPidentifier was designed for the analysis of ESTs sequences without accompanying chromatogram sequence quality information. For a base to be predicted as a SNP, the minor allele frequency must be greater than 0.1, the minor allele must be observed at least four times and the 15 bases on either side of the SNP must exactly match the consensus sequence.

At the beginning of this project, we could only access a moderate number of expressed sequence tags (ESTs) (25 937) from L. vannamei. These sequences were analysed with SNPidentifier to predict shrimp SNPs. All sequences potentially containing SNP markers were compared against the public database to check redundancy. Public sequences corresponding to microsatellites and candidate genes (from L. vannamei and P. monodon) were also retrieved and analysed for SNP discovery.

PCR primers were designed using Primer3 (http://frodo.wi.mit.edu/primer3/) and amplicons were limited to 200 bp. This strategy was chosen to increase the success rate of amplification, taking into account the uncertainty of the exon and intron junctions in the EST sequences. A pool of DNA from 16 other F1 animals, derived from the grandparents of the mapping families, was used for amplification and direct sequencing. Sequencing traces were clustered and checked manually by SEQUENCHER 4.8 (Gene Codes Corporation). Some predicted SNPs were immediately validated by PCR-restriction fragment length polymorphism with appropriate restriction enzymes.

165

SNP genotyping

The Sequenom MassArray iPLEX platform was used for SNP genotyping for all 144

animals (Du et al. 2009). In addition to the newly discovered SNPs, 232 SNPs reported by

Ciobanu et al. (2009) were confirmed by sequencing in our family and successfully genotyped. Multiplex primers were designed using the ASSAYDESIGN 3.1 (Table S1). One blank, two positive and two negative controls were used in each of the genotyping chips for quality control. Two rounds of SNP selection were performed to maximize the coverage of unique contigs. If, in the first round, SNPs failed to be genotyped, or were not informative, or were in mis-inheritance, etc., then other SNPs from the same contigs were chosen to increase the chance of including the contig into the map in the second round. SNP genotypes were analysed using TYPER 3.4 Software, and the minor allele frequency (MAF) and deviation from Hardy-Weinberg equilibrium status was estimated for each SNP.

Annotation and characterization of SNPs

Contigs built during the process of SNP prediction using SNPidentifier, and containing SNPs, were compared against the NCBI protein database using BLASTX, and annotated with the top hit of an E-value <10-10. Other matches to hypothetical proteins and

genomic sequences were filtered out (Table S3).

SNPs were classified into the same five categories used be Moen et al. (2008) based

on manual inspection of Sequenom output: (i) normal: polymorphic, reliably scored and non-

duplicated; (ii) multiple sequence variant: sequences carrying the SNPs were likely

duplicated with polymorphism at one or both loci (characterized by heterozygote excess,

more than one cluster of heterozygotes, and presence of homozygotes); (iii) paralogous

sequence variant: duplicated sequences carrying the SNPs where the paralogs are fixed for

166

the opposite allele (no homozygotes); (iv) all homozygous: all animals were homozygous; and (v) failed: poor clustering of genotypes and/or unreliably scored genotypes.

Genetic map construction

After performing quality control, the genotyping data were formatted and exported into CRIMAP (Green et al. 1990), for the construction of a genetic linkage map.

Marker grouping was done at a minimum LOD score of 3.0 using the twopoint option in CRIMAP. Markers were added into the linkage group using the CRIMAP all option.

Marker order was calibrated using the CRIMAP flips function, with window size moving at 7 markers (flips7). Finally, the Kosambi mapping function was used to build the final maps, using the fixed function of CRIMAP. Sex-averaged maps were constructed from sex-specific homologous linkage groups. The map was drawn in MAPCHART (Voorrips 2002.). The chrompic function of CRIMAP was used to count number of recombination events per meiosis. To estimate double recombinant events for each linkage group, which may indicate genotyping errors, the number and location of recombination was defined for non- overlapping windows of five markers using the chrompic function in CRIMAP. SNPs located within the same contig were not combined into haplotypes, to check genotyping and mapping quality. After the order of markers in the same contig was determined, we combined them into one haplotype marker. Linkage groups were numbered according to their lengths. In the cases where two unlinked female segments corresponded to a single male homologue X, the segments were designated Xa and Xb to indicate that they were likely part of the same linkage group (and vice versa for unlinked male segments corresponding to a single female homologue).

167

Linkage disequilibrium status

The MAF and Hardy-Weinberg equilibrium values were calculated using the package

“Genetics” in R (http://www.r-project.org). Segregation distortion for each SNP was

estimated based on a chi-squared test for the genotype frequency distribution (1:1 or 1:2:1).

The LD status in the F2 generation was calculated, and SNPs with MAF <0.10 and deviating

from Hardy-Weinberg equilibrium (P < 0.01) were culled.

QTL mapping for gender

QTL mapping for shrimp gender treated as a binary trait was performed using the F2-

design module in QTLexpress (Seaton et al. 2002). QTL were searched on a 1-cM grid,

fitting only additive effect of the QTL into the regression model. The permutation procedure

was performed by using 2000 iterations to determine the chromosome-wide significance level.

Comparative gene mapping

EST sequences from SNP discovery were used as queries to blast against the

Drosophila melanogaster (fruitfly) and Daphnia pulex (water flea) sequence databases

(BLASTX, E-value <10-5), because no genome sequence of other closely related decapod

was available. Daphnia pulex is the first sequenced crustacean, and the Daphnia genome

databased (http://wfleabase.org/) and JGI Genome Portal (http://genome.jgi-psf.org/

Dappu1/Dappu1.home.html) were used. Positive hits were recorded and listed in Table S4.

RESULTS AND DISCUSSION

SNP discovery and genotyping

We discovered 1344 putative SNPs by assembling 357 unique genomic contigs, mostly derived from EST sequences deposited in the public databases (74.79%). This

168

included 244 SNPs identified using candidate gene cDNA sequences as well as simple repeat

sequences (25.21%). SNPs discovered in microsatellite DNA sequences can also be used to

integrate microsatellite-based genetic maps into a common framework in the future.

In total, 825 SNPs (including 232 from Ciobanu et al. 2009) were selected for

genotyping in the shrimp resource family on the Sequenom platform (Table S2). The genome

of Pacific white shrimp appears to have high rates of duplication, which was indicated by a

significant number of paralogous and multiple sequence variations, 7.14% and 3.44%

respectively. Also, 11.53% of the markers failed genotyping and 17.91% were homozygous

for all animals. In all, 453 SNPs were successfully genotyped and were informative in at least

one F2 family, and these were used for the construction of a genetic linkage map.

Linkage map

Of all 453 SNPs, 418 were incorporated into a linkage map (only linkage groups containing ≥3 markers are listed in Table 1, Figure 1). These SNPs belonged to 347 contigs,

67 of them having two SNPs genotyped, two having three SNPs genotyped and the remaining 278 having one SNP. Furthermore, 172 SNPs (49.57%) were annotated by

BLASTX (Table S3). During the construction of the genetic map, SNPs from the same contig were not combines into haplotypes to check the genotyping and mapping quality. All these

SNPs were located in 69 contigs mapped next to each other. The sex-averaged map has 418 markers grouped onto 45 linkage groups, with group lengths ranging from 0 to 171.3 cM and a total coverage of 2262.3 cM (Table 1). The grandparental genotyping data for the resource families helps in the determination of the marker linkage phase and the precise ordering of genetic markers.

169

Differences in recombination between sexes

The male and female SNP maps have 48 and 45 linkage groups (≥3 markers)

respectively (Table 1), which corresponds to the karyotype of 2n = 88 (Chow et al. 1990).

The male map was 2130.2 cM with 413 markers, whereas the female map was slightly shorter at 2071.1 cM with 418 markers. The lengths of linkage groups ranged 0 to 130.9 cM for the male map, and 0 to 196.6 cM for the female map (Table 1). For the most part, the male and female maps had the same linkage groups of markers and were in the same orientation. However, markers on female linkage groups 1, 3 and 6 were broken apart into two male linkage groups.

The coverage of the male and female maps was about 48.27% and 37.25% of the estimated shrimp genome length respectively (Zhang et al. 2007), indicating that considerable future improvement is required. The female map is slightly shorter than the male map (nearly the same number of SNP markers), which supports the proposed ZW sex determination system for penaeid shrimp (Staelens et al. 2008).

Linkage disequilibrium

We evaluated the status of LD in the F2 families, using the squared correlation

coefficient (r2) between pairs of SNPs with MAF >0.10. The average r2 for markers located

more than 50 cM apart on the same linkage group was 0.15. Levels of r2 increased with

decreasing inter-marker distance from ~80 cM and increased more rapidly below 30 cM (Fig.

2). With relatively high LD covering a certain genomic region, it should be easier to identify

genetic loci or regions associated with quantitative traits.

170

Quantitative trait loci for shrimp gender

We treated sex phenotype as a binomial trait and performed the QTL mapping. A significant QTL was mapped onto linkage group 13 at 59 cM (P < 0.05), and a suggestive

QTL was mapped at 31 cM on linkage group 8. A genetic marker has been identified to differentiate sex in P. monodon (Staelens et al. 2008). However, the same marker was not polymorphic in our resource L. vannamei population (Du Z. Q. and Rothschild M. F., unpublished data). Therefore, we could not investigate whether the same linkage group

(potentially the sex chromosome) in P. monodon corresponds to that in L. vannamei.

Genetic maps are very useful to identify QTL for important traits, such as shrimp growth rate and disease resistance. Information collected from other studies could also be combined in the QTL mapping process. On linkage groups 25, 26 and 35, three SNPs

(accession numbers: 142459553, 142459445 and 142459495), corresponding to SYG442,

SYG329 and SYG379 (Ciobanu et al. 2009), were found to be significantly associated with test weekly gain, grow-out survival and survival following a challenge with Taura Syndrome

Virus. Future studies could be directed towards these regions for further dissection of the underlying genetic causal variants.

Comparative gene mapping

Comparative genomics can provide valuable information about the architecture and functional organization of the genome, especially in species without genome sequences. Our analysis shows that some L. vannamei genes are related to their orthologs in the crustacean

Daphnia pulex, whereas others are more related to orthologs in Drosophila melanogaster.

Half of the unique ESTs/contigs (172) seem specific to L. vannamei, as no hits were found in the two model organisms (Table S3).

171

Selenoprotein M on linkage group 16, chloride channel calcium activated 6 on linkage group 24, and a gene encoding a novel protein similar to vertebrate sec1 family domain containing 2 (SCFD2) on linkage group 31, were found in L. vannamei and Daphnia pulex, but not in Drosophila melanogaster (Table S4). This is not unexpected, as L. vannamei and Daphnia pulex are both crustaceans. Instead, other members of the gene families could compensate or exert the necessary functions for Drosophila melanogaster

(Driscoll & Copeland 2003; Demuth & Hahn 2009). No large blocks of conserved synteny were observed between L. vannamei, Daphnia pulex, and Drosophila melanogaster (Table

S4). This could be explained either by the limited density and coverage of the genetic map of

L. vannamei, or evolutionarily by the fact that L. vannamei is relatively distant from Daphnia pulex. The ongoing sequencing project for relatively close amphipod crustaceans (Jassa slatteryi and Parhyale hawaiensis) should produce more useful genome information

(Stillman et al. 2008).

Although the coverage and density of the first SNP genetic map of Pacific white shrimp needs further improvement, it should be useful as a tool for shrimp biology and genomic studies. With the advancement of next-generation sequencing and genotyping technologies, the exploration of the shrimp genome should provide more valuable information and enable production and sustainability of shrimp aquaculture industry.

ACKNOWLEDGEMENTS

We appreciate the funding from CP Indonesia, Shrimp Improvement Systems, Gold

Dragon Research and State of Iowa and Hatch Funding. Technical help from Kim Glenn,

Heidi Bruns in Dr. Rothschild’s lab group and Dr. Lu Gao in Dr. Patrick Schnable’s group is also appreciated.

172

REFERENCES

Alcivar-Warren A., Meehan-Meola D., Wang Y. et al. (2006) Isolation and mapping of

telomeric pentanucleotide (TAACC)n repeats of the Pacific whiteleg shrimp, Penaeus

vannamei, using fluorescence in situ hybridization. Marine Biotechnology (NY) 8,

467-80.

Alcivar-Warren A., Meehan-Meola D., Park S.W., Xu Z., Delaney M. & Zuniga G. (2007a)

SHRIMPMAP: a low-density, microsatellite-based linkage map of the pacific

whiteleg shrimp, Litopenaeus vannamei: identification of sex-linked markers in

linkage group 4. Journal of Shellfish Research 26, 1259–77.

Alcivar-Warren A., Song L., Meehan-Meola D., Xu Z., Xiang J.-H. & Warren W. (2007b)

Characterization and mapping of expressed sequence tags isolated from a subtracted

cDNA library of Litopenaeus vannamei injected with white spot syndrome virus.

Journal of Shellfish Research 26, 1247–58.

Beldade P., Saenko S.V., Pul N. & Long A.D. (2009) A gene-based linkage map for Bicyclus

anynana butterflies allows for a comprehensive analysis of synteny with the

lepidopteran reference genome. PLoS Genetics 5, e1000366.

Castillo-Juárez H., Casares J.C.Q., Campos-Montes G., Villela C.C., Ortega A.M. &

Montaldo H.H. (2007) Heritability for body weight at harvest size in the Pacific white

shrimp, Penaeus (Litopenaeus) vannamei, from a multi-environment experiment

using univariate and multivariate animal models. Aquaculture 273, 42-9.

Chow S., Dougherty W.J. & Sandifer P.A. (1990) Meiotic chromosome complements and

nuclear DNA contents of four species of shrimps of the Genus Penaeus. Journal of

Crustacean Biology 10, 29-36.

173

Ciobanu D.C., Bastiaansen J.W.M., Magrin J. et al. (2009) A major SNP resource for

dissection of phenotypic and genetic variation in Pacific white shrimp (Litopenaeus

vannamei). Animal Genetics, doi:10.1111/j.1365-2052.2009.01961.x.

Cock J., Gitterle T., Salazar M. & Rye M. (2009) Breeding for disease resistance of Penaeid

shrimps. Aquaculture 286, 1–11.

Cuzon G., Lawrence A., Gaxiola G., Rosas C. & Guillaume J. (2004) Nutrition of

Litopenaeus vannamei reared in tanks or in ponds. Aquaculture 235, 513–51.

De Donato M., Manrique R., Ramirez R., Mayer L. & Howell C. (2005) Mass selection and

inbreeding effects on a cultivated strain of Penaeus (Litopenaeus) vannamei in

Venezuela. Aquaculture 247, 159–67.

Demuth J.P. & Hahn M.W. (2009) The life and death of gene families. BioEssays 31, 29–39.

Dhar A.K., Meehan-Meola D., Breland V., Carr W.H., Lotz J.M. & Alcivar-Warren A.

(2007) Linkage mapping of developmentally expressed cDNAs containing AT-rich

elements in shrimp (Litopenaeus vannamei) – differential expression after challenge

with Taura Syndrome virus (TSV). Journal of Shellfish Research 26, 1217–24.

Driscoll D.M. & Copeland P.R. (2003) Mechanism and regulation of selenoprotein synthesis.

Annual Review of Nutrition 23, 17–40.

Du Z.Q., Zhao X., Vukasinovic N., Rodriguez F., Clutter A.C. & Rothschild M.F. (2009)

Association and haplotype analyses of positional candidate genes in five genomic

regions linked to scrotal hernia in commercial pig lines. PLoS ONE 4, e4837.

Frazer K.A., Eskin E., Kang H.M. et al. (2007) A sequence-based variation map of 8.27

million SNPs in inbred mouse strains. Nature 448, 1050–3.

Garcia D.K. & Alcivar-Warren A. (2007) Characterization of 35 new microsatellite genetic

174

markers for the pacific whiteleg shrimp, Litopenaeus vannamei: their usefulness for

studying genetic diversity of wild and cultured stocks, tracing pedigree in breeding

programs, and linkage mapping. Journal of Shellfish Research 26, 1203–16.

Gitterle T., Rye M., Salte R., Cock J., Johansen H., Lozano C., Suarez J.A. & Gjerde B.

(2005a) Genetic (co)variation in harvest body weight and survival in Penaeus

(Litopenaeus) vannamei under standard commercial conditions. Aquaculture 243, 83–

92.

Gitterle T., Salte R., Gjerde B., Cock J., Johansen H., Salazar M., Lozano C. & Rye M.

(2005b) Genetic (co)variation in resistance to White Spot Syndrome Virus (WSSV)

and harvest weight in Penaeus (Litopenaeus) vannamei. Aquaculture 246, 139–49.

Gitterle T., Gjerde B., Cock J., Salazar M., Rye M., Vidal O., Lozano C., Erazo C. & Salte R.

(2006a) Optimization of experimental infection protocols for the estimation of genetic

parameters of resistance to White Spot Syndrome Virus (WSSV) in Penaeus

(Litopenaeus) vannamei. Aquaculture 261, 501–9.

Gitterle T., Ødegard J., Gjerde B., Rye M. & Salte R. (2006b) Genetic parameters and

accuracy of selection for resistance to White Spot Syndrome Virus (WSSV) in

Penaeus (Litopenaeus) vannamei using different statistical models. Aquaculture 251,

210–8.

Glenn K.L., Grapes L., Suwanasopee T., Harris D.L., Li Y., Wilson K. & Rothschild M.F.

(2005) SNP analysis of AMY2 and CTSL genes in Litopenaeus vannamei and

Penaeus monodon shrimp. Animal Genetics 36, 235–6.

Goddard M.E. & Hayes B.J. (2009) Mapping genes for complex traits in domestic animals

and their use in breeding programmes. Nature Review Genetics 10, 381–91.

175

Gorbach D.M., Hu Z.L., Du Z.Q. & Rothschild M.F. (2009) SNP discovery in Litopenaeus

vannamei with a new computational pipeline. Animal Genetics 40, 106–9.

Green P., Falls K. & Crooks S. (1990) Documentation for CRIMAP, Version 2.4.

Washington University School of Medicine, St Louis, MO.

Groenen M.A., Wahlberg P., Foglio M. et al. (2009) A high-density SNP-based linkage map

of the chicken genome reveals sequence features correlated with recombination rate.

Genome Research 19, 510–9.

Ibarra A.M. & Famula R. (2008) Genotype by environment interaction for adult body

weights of shrimp Penaeus vannamei when grown at low and high densities. Genetics

Selection Evolution 40, 541–51.

Klinbunga S., Preechaphol R., Thumrungtanakit S., Leelatanawit R., Aoki T., Jarayabhand P.

& Menasveta P. (2006) Genetic diversity of the giant tiger shrimp (Penaeus

monodon) in Thailand revealed by PCR-SSCP of polymorphic EST-derived markers.

Biochemical Genetics 44, 222–36.

Kucuktas H., Wang S., Li P. et al. (2009) Construction of genetic linkage maps and

comparative genome analysis of catfish using gene-associated markers. Genetics 181,

1649–60.

Lander E.S. & Botstein D. (1989) Mapping mendelian factors underlying quantitative traits

using RFLP linkage maps. Genetics 121, 185–99.

Li Y., Byrne K., Miggiano E., Whan V., Moore S., Keys S., Crocos P., Preston N. & Lehnert

S. (2003) Genetic mapping of the Kuruma Prawn Penaeus japonicus using AFLP

markers. Aquaculture 219, 143–56.

Li Y., Dierens L., Byrne K., Miggiano E., Lehnert S., Preston N. & Lyons R. (2006a) QTL

176

detection of production traits for the Kuruma prawn Penaeus japonicus (Bate) using

AFLP markers. Aquaculture 258, 198–210.

Li Z., Li J., Wang Q., He Y. & Liu P. (2006b) AFLP-based genetic linkage map of marine

shrimp Penaeus (Fenneropenaeus) chinensis. Aquaculture. 261, 463–72.

Lyons R.E., Dierens L., Preston N.P., Crocos P., Coman G. & Li Y. (2007) Identification and

characterization of QTL markers for growth traits in Kuruma shrimp P. japonicus.

Aquaculture 272, S284–5.

Maneeruttanarungroj C., Pongsomboon S., Wuthisuthimethavee S. et al. (2006)

Development of polymorphic expressed sequence tag-derived microsatellites for the

extension of the genetic linkage map of the black tiger shrimp (Penaeus monodon).

Animal Genetics 37, 363–8.

McCarthy M.I., Abecasis G.R., Cardon L.R., Goldstein D.B., Little J., Ioannidis J.P. &

Hirschhorn J.N. (2008) Genome-wide association studies for complex traits:

consensus, uncertainty and challenges. Nature Review Genetics 9, 356–69.

Meehan D., Xu Z., Zuniga G. & Alcivar-Warren A. (2003) High frequency and large number

of polymorphic microsatellites in cultured shrimp, Penaeus (Litopenaeus) vannamei

[Crustacea: Decapoda]. Marine Biotechnology (NY) 5, 311–30.

Meuwissen T.H., Hayes B.J. & Goddard M.E. (2001) Prediction of total genetic value using

genome-wide dense marker maps. Genetics 157, 1819–29.

Moen T., Hayes B., Baranski M., Berg P.R., Kjøglum S., Koop B.F., Davidson W.S., Omholt

S.W. & Lien S. (2008) A linkage map of the Atlantic salmon (Salmo salar) based on

EST-derived SNP markers. BMC Genomics 9, 223.

Moore S.S., Whan V., Davis G.P., Byrne K., Hetzel D.J.S. & Preston N. (1999) The

177

development and application of genetic markers for the Kuruma prawn Penaeus

japonicus. Aquaculture 173, 19–32.

Pérez F., Ortiz J., Zhinaula M., Gonzabay C., Calderón J. & Volckaert F.A. (2005)

Development of EST-SSR markers by data mining in three species of shrimp:

Litopenaeus vannamei, Litopenaeus stylirostris, and Trachypenaeus birdy. Marine

Biotechnology (NY) 7, 554–69.

Preechaphol R., Leelatanawit R., Sittikankeaw K., Klinbunga S., Khamnamtong B.,

Puanglarp N. & Menasveta P. (2007) Expressed sequence tag analysis for

identification and characterization of sex-related genes in the giant tiger shrimp

Penaeus monodon. Journal of Biochemistry and Molecular Biology 40, 501–10.

Robalino J., Almeida J.S., McKillen D. et al. (2007) Insights into the immune transcriptome

of the shrimp Litopenaeus vannamei: tissue-specific expression profiles and

transcriptomic responses to immune challenge. Physiological Genomics 29, 44–56.

Rockman M.V. & Kruglyak L. (2009) Recombinational landscape and population genomics

of Caenorhabditis elegans. PLoS Genetics 5, e1000419.

Seaton G., Haley C.S., Knott S.A., Kearsey M. & Visscher P.M. (2002) QTL Express:

mapping quantitative trait loci in simple and complex pedigrees. Bioinformatics 18,

339–40.

Shifman S., Bell J.T., Copley R.R., Taylor M.S., Williams R.W., Mott R. & Flint J. (2006) A

high-resolution single nucleotide polymorphism genetic map of the mouse genome.

PLoS Biology 4, e395.

Snelling W.M., Casas E., Stone R.T., Keele J.W., Harhay G.P., Bennett G.L. & Smith T.P.

(2005) Linkage mapping bovine EST-based SNP. BMC Genomics 6, 74.

178

Staelens J., Rombaut D., Vercauteren I., Argue B., Benzie J. & Vuylsteke M. (2008) High-

density linkage maps and sex-linked markers for the black tiger shrimp (Penaeus

monodon). Genetics 179, 917–25.

Stillman J.H., Colbourne J.K., Lee C.E. et al. (2008) Recent advances in crustacean

genomics. Integrative and Comparative Biology 48, 852–68.

Tassanakajon A., Klinbunga S., Paunglarp N. et al. (2006) Penaeus monodon gene discovery

project: the generation of an EST collection and establishment of a database. Gene

384, 104–12.

The International Hapmap Consortium. (2007) A second generation human haplotype map of

over 3.1 million SNPs. Nature 449, 851–61.

Vingborg R.K., Gregersen V.R., Zhan B. et al. (2009) A robust linkage map of the porcine

autosomes based on gene-associated SNPs. BMC Genomics 27, 134.

Voorrips R.E. (2002) MAPCHART: software for the graphical presentation of linkage maps

and QTLs. The Journal of Heredity 93, 77–8.

Wilson K., Li Y., Whan V. et al. (2002) Genetic mapping of the black tiger shrimp Penaeus

monodon with amplified fragment length polymorphisms. Aquaculture 204, 297–309.

Zhang L., Yang C., Zhang Y., Li L., Zhang X., Zhang Q. & Xiang J. (2007) A genetic

linkage map of Pacific white shrimp (Litopenaeus vannamei): sex-linked

microsatellite markers and high recombination rates. Genetica 131, 37–49.

179

LG01 LG02 LG03 LG04 LG05

LG06 LG07 LG08 LG09 LG10

LG11 LG12 LG13 LG14 LG15

Figure 1. Sex-averaged linkage maps for Pacific white shrimp. SNP accession numbers were used. *Another SNP from the same contig was mapped; **two other SNPs from the same contig were mapped.

180

LG16 LG17 LG18 LG19 LG20

LG21 LG22 LG23 LG24 LG25

LG26 LG27 LG28 LG29 LG30

LG31 LG32 LG33 LG34 LG35

LG36 LG37 LG38 LG39 LG40

Figure 1. (continued)

181

LG41 LG42 LG43 LG44 LG45

Figure 1. (continued)

182

Figure 2. Levels of LD between SNP pairs located on the same linkage group plotted versus genetic distance in the F2 population. SNPs with minor allele frequencies <0.10 were excluded.

183

Table 1. Properties of linkage groups for Pacific white shrimp.

♀ map length ♂ map length Sex-average map Linkage (cM) (cM) (no. length (cM) (no. group (no. markers) markers) ♀:♂ ratio markers) 1a 196.6 (25) 83.7 (18) 2.35 171.3 (25) 1b 4.6 (7) 2 118.0 (16) 130.9 (15) 0.90 155.6 (16) 3a 87.1 (9) 58.3 (6) 1.49 112.1 (9) 3b 29.6 (3) 4 107.7 (12) 80.1 (12) 1.34 92.2 (12) 5 38.1 (15) 39.5 (11) 0.96 90.5 (15) 6a 79.6 (7) 68.3 (4) 1.17 90.5 (7) 6b 33.8 (3) 7 96.0 (10) 86.2 (10) 1.11 87.7 (10) 8 69.2 (14) 94.3 (14) 0.73 85.3 (14) 9 81.9 (13) 88.8 (13) 0.92 82.0 (13) 10 76.4 (12) 102.4 (12) 0.75 81.6 (12) 11 96.6 (12) 71.0 (12) 1.36 81.4 (12) 12 58.1 (9) 100.6 (9) 0.58 81.1 (9) 13 66.8 (11) 78.2 (11) 0.85 71.0 (11) 14 47.1 (23) 89.3 (23) 0.53 66.2 (23) 15 57.2 (6) 53.7 (6) 1.07 64.9 (6) 16 52.3 (6) 18.1 (6) 2.89 54.6 (6) 17 32.9 (10) 73.4 (10) 0.45 51.1 (10) 18 57.4 (7) 30.9 (7) 1.86 47.9 (7) 19 48.7 (5) 24.2 (5) 2.01 45.8 (5) 20 42.6 (10) 54.0 (10) 0.79 44.9 (10) 21 34.8 (5) 44.2 (5) 0.79 41.2 (5) 22 11.5 (8) 60.4 (8) 0.19 40.6 (8) 23 45.4 (5) 35.4 (5) 1.28 39.3 (5) 24 40.6 (11) 33.2 (11) 1.22 38.8 (11) 25 54.7 (19) 27.4 (19) 2.00 38.1 (19) 26 14.2 (7) 60.7 (7) 0.23 36.2 (7) 27 70.4 (8) 13.6 (8) 5.18 36.0 (8) 28 46.7 (6) 46.9 (6) 1.00 34.2 (6) 29 51.1 (16) 19.0 (16) 2.69 32.9 (16) 30 35.6 (11) 34.2 (11) 1.04 32.3 (11) 31 0 (6) 31.4 (6) 0.00 31.4 (6) 32 16.5 (10) 51.8 (10) 0.32 29.4 (10) 33 19.0 (6) 58.4 (6) 0.33 28.0 (6) 34 13.4 (12) 29.4 (12) 0.46 26.5 (12)

184

Table 1. (continued)

♀ map length ♂ map length Sex-average map Linkage (cM) (cM) (no. length (cM) (no. group (no. markers) markers) ♀:♂ ratio markers) 37 18.5 (3) 3.5 (3) 5.29 18.1 (3) 38 0.0 (3) 13.2 (3) 0.00 13.2 (3) 39 16.3 (11) 4.6 (11) 3.54 11.2 (11) 40 17.3 (3) 7.4 (3) 2.34 11.1 (3) 41 0 (3) 9.8 (3) 0.00 9.5 (3) 42 2.3 (6) 7.2 (6) 0.32 8.1 (6) 43 17.6 (5) 0.9 (5) 19.56 7.0 (5) 44 2.1 (5) 3.4 (5) 0.62 2.9 (5) 45 0 (3) 0 (3) - 0 (3) Total 2071.1 (418) 2130.2 (413) 0.97 2262.3 (418)

185

APPENDIX B. THE PIG GENOME PROJECT HAS PLENTY TO SQUEAL ABOUT

A paper published in Cytogenetic and Genome Research. Volume 134:9-18.

B. Fan, D. M. Gorbach, M. F. Rothschild

ABSTRACT

Significant progress on pig genetics and genomics research has been witnessed in recent years due to the integration of advanced molecular biology techniques, bioinformatics and computational biology, and the collaborative efforts of researchers in the swine genomics community. Progress on expanding the linkage map has slowed down, but the efforts have created a higher-resolution physical map integrating the clone map and BAC end

sequence. The number of QTL mapped is still growing and most of the updated QTL

mapping results are available through PigQTLdb. Additionally, expression studies using

high-throughput microarrays and other gene expression techniques have made significant

advancements. The number of identified non-coding RNAs is rapidly increasing and their exact regulatory functions are being explored. A publishable draft (build 10) of the swine

genome sequence was available for the pig genomics community by the end of December

2010. Build 9 of the porcine genome is currently available with Ensembl annotation; manual

annotation is ongoing. These drafts provide useful tools for such endeavors as comparative

genomics and SNP scans for fine QTL mapping. A recent community-wide effort to create a

60K porcine SNP chip has greatly facilitated whole-genome association analyses, haplotype

block construction and linkage disequilibrium mapping, which can contribute to whole-

genome selection. The future ‘systems biology’ that integrates and optimizes the information

186

from all research levels can enhance the pig community’s understanding of the full

complexity of the porcine genome. These recent technological advances and where they may

lead are reviewed.

INTRODUCTION

Archaeological evidence from figurines and bones revealed that the domestication of

the pig (Sus scrofa) occurred as early as 9,000 years ago in the Middle East, and some

evidence indicated the domestication of the pig could date back even earlier in China

[Department of Animal Science Oklahoma State University, 1995; Clutton-Brock, 1999].

After transformation of pigs from hunted wild animals to fully managed domestic ones,

relationships between humans and pigs became closer, yet more diverse and complex

[Albarella et al., 2007]. Humans have developed a variety of pig breeds for meat production

and a source of fat during the co-evolution of both species. To date, there are 541 reported

pig breeds world-wide [Rischkowsky and Pilling, 2007]; however, only 6 of them including

Large White (Yorkshire), Landrace, Duroc, Hampshire, Berkshire and Piétrain are dominant in the commercial pork industry [Rothschild and Ruvinsky, 2010].

Enormous progress has been made for swine breeding with the utilization of classic genetics and quantitative genetics approaches. Many studies identified candidate genes and quantitative trait loci (QTL) significantly associated with economically important traits and some of them have been utilized in the commercial pig industry [Rothschild et al., 2007]. It is

predicted that even more progress can be achieved after pig genomics information is better

understood and then integrated into marker-assisted selection. Coordinated efforts to better understand the pig genome were initiated in the early 1990s with gene identification and mapping efforts. After entering the new millennium, a historical ‘White Paper’ outlining the

187

roles pigs play in agriculture and as biological models for humans was announced, with the

objective to sequence the whole swine genome [Rohrer et al., 2002]. The annotated build 9 of

the porcine genome was made available in September 2009. The publishable build 10 was

completed by the end of December 2010 with Ensembl annotation to come a few months

later. At the same time, significant progress has been made on gene expression using high-

throughput arrays and functional studies of the pig genome [Tuggle et al., 2007].

Additionally, the 60K porcine SNP chip [Ramos et al., 2009] released in January of 2009 has

facilitated whole-genome association analyses and research into whole-genome selection.

Swine genome science has come of age and the advances in swine genomics promise to

greatly benefit the pig industry and consumers in the future.

LINKAGE MAPPING AND GENE MAPPING PROGRESS

The first 2 pig linkage maps were generated by the PiGMaP and USDA-MARC genome mapping projects around 15 years ago [Archibald et al., 1995; Rohrer et al., 1996].

The USDA-MARC linkage map was comprised of ~1,200 microsatellite markers, and since its first publication additional genetic markers such as microsatellites, amplified fragment length polymorphisms (AFLPs), single nucleotide polymorphisms (SNPs) and expressed sequence tags (ESTs) have been being identified and merged (http://www.ncbi.nlm.nih.gov/ mapview/map_search.cgi?taxid=9823; http://www.marc.usda.gov/genome/swine/swine. html). The ArkDB database combines information from various linkage maps, including the

USDA-MARC map, and has information on nearly 1,600 genes and 3,300 markers (http:// www.thearkdb.org/arkdb/do/getChromosomeDetails?accession=ARKSPC00000001).

Two recently developed SNP linkage maps include a primary map composed of

~1,800 in silico identified variants [Li et al., 2008] and a map consisting of 66 SNPs derived

188

from bone-related genes based on Berkshire x Yorkshire pigs [Onteru et al., 2008]. The steps

to expand the linkage map have slowed down in recent years, partially due to the advances in

the genome sequencing project and high-density porcine SNP chip development. Instead, efforts have moved to developing the pig-human comparative map and integrating the linkage, physical and cytogenetic maps.

Driven by the pig genome sequencing project, a new human-pig comparative map mainly consisting of ESTs and BAC-end sequences was constructed as a first step towards

sequencing the pig genome [Meyers et al., 2005]. A large-scale porcine BAC physical map

was furthermore developed and used for the selection of BACs to construct a minimal tiling

path for genome sequencing and targeted gap filling [Rogatcheva et al., 2008]. Eventually, an

updated physical map integrating the clone map and BAC end sequence data was reported

[Humphray et al., 2007], which not only greatly contributed to the assembly of the pig genome but also facilitated electronic positional cloning of genes and human-pig comparative genomics studies.

As one of the principal and most accurate mapping techniques, FISH is still being used for porcine gene mapping in a few studies. A 7,000-rad radiation hybrid (RH) panel

(ImpRH) [Yerle et al., 1998], which contains nearly 6,000 markers including microsatellites

and over 2,000 ESTs, and a somatic cell hybrid panel [Yerle et al., 1996] have been broadly

utilized for gene mapping in recent years. An advanced 12,000-rad RH map was also developed, and it allows for even greater precision for mapping within and across species

[Yerle et al., 2002]. The utilization of these valuable tools has led to the rapid increase in the number of genes mapped to pig chromosomes during the last decade.

189

QTL AND CANDIDATE GENES

For pork production to be cost-effective, it requires an efficient growth rate, a low feed-to-gain ratio, high levels of reproductive success and survivability and increased pig health. Consumers are also looking for pork products with good carcass merit and high meat quality. Using pig resource families founded by commercial and exotic pig breeds,

researchers have identified quantitative trait loci (QTL) affecting these traits. A large number

of QTL have been reported on nearly all chromosomes for growth, carcass and meat quality

traits and several chromosomes for reproduction and feet and leg structure. The QTL

affecting immune response and disease resistance traits are far less numerous, but this is an area where gene expression studies are particularly valuable. An extensive summary of pig

QTL is available at the PigQTLdb (http://www.animalgenome.org/QTLdb/pig.html) [Hu et al., 2005], which combines all the published QTL information, integrates linkage and physical maps, and is an excellent repository for updated QTL mapping results. Currently, there are 6,344 QTL reported in the PigQTLdb, which represent 593 different traits.

Some valuable lessons learned from recent QTL mapping studies include: (1) Many traits result from a complex interaction of several different factors. A QTL study that focuses on a single phenotype measurement may result in an incomplete understanding of the multitude of factors that combine to produce the trait of interest. Instead, a systematic study of factors that may contribute to the analyzed trait can help illustrate the underlying causes of

a specific phenotype with better accuracy. For example, ovulation rate, follicle-stimulating

hormone levels, uterine length, feed intake and environmental factors may likely add value to

QTL affecting litter size. (2) Joint and QTL analyses which combine existing results from

different studies may better utilize resources. Due to funding limitations, sample sizes are

190

frequently smaller than ideal. By combining the data from multiple QTL studies, possible

experimental limitations can be corrected while increasing the power for detecting real QTL

[Kim et al., 2005; Muñoz et al., 2009]. Development of new approaches will also improve

multi-factorial and complex trait analysis for QTL [Rothschild et al., 2007].

With the identification of a large number of pig SNPs and the application of high-

throughput genotyping platforms, fine mapping within specific QTL areas is being carried

out. On the end of SSC17 where QTL for meat quality traits were mapped, around 50

additional SNPs were mined. Combining the linkage disequilibrium association and QTL analyses, several genes such as CTSZ and AURKA were found to be significantly associated with meat color and lactate-related traits [Ramos, 2006]. On SSC2, fine mapping and haplotype analyses confirmed that the CAST gene was one of the important candidate candidate genes influencing meat quality [Meyers et al., 2007]. Fine mapping was also

performed on SSC8 to identify 2 candidate genes responsible for ovulation rate [Campbell et

al., 2003]. Due to the availability of the newly developed high-density 60K porcine SNP

chip, large-scale association studies are discovering many new QTL.

A few studies in swine have started to focus on non-traditional QTL effects, such as

eQTL and imprinted QTL. The study of pig eQTL, which measures and treats each

observation of gene expression from a microarray as a new trait, shows promise for traits

with a strong environmental influence on gene expression. A simulation study on selective

transcriptional profiling and data analysis strategies for eQTL mapping in outbred pig

populations was conducted, which suggested a strategy for future eQTL mapping analyses

[Cardoso et al., 2008]. Another non-traditional type of QTL study involves imprinted QTL,

of which many have been found. For example, imprinted QTL were found on SSC2 [Van

191

Laere et al., 2003] on which the IGF2 gene is located, and on SSC6 and SSC14 [de Koning

et al., 2000; Holl et al., 2004]. The discovery of imprinted QTL provides an opportunity to

further understand the genetic control of traits of interest in pigs.

Linkage disequilibrium (LD) plays a vital role in QTL mapping, so genotyping of high-density SNPs in genes or chromosomal regions of interest is an essential prerequisite to discover the LD structure within a region. This information allows causative mutations,

LD blocks and selective sweeps in a population to be discovered. For example, as evidence that LD levels can be low even within a single gene, an absence of LD was observed between

2 mutations (Arg236His and Asp298Asn) in the MC4R gene, and both mutations were significantly associated with fatness and growth traits [Fan et al., 2009a]. Hence, high- density SNPs may give further clues about causative mutations in certain genes. Similarly, large-scale SNP analyses need to be validated in different populations and breeds due to the variability in some gene sequences and breeds due to the variability in some gene sequences and LD levels among different breeds. For instance, studies of the sequence variation of 3 different genes showed no evidence of reduction in genetic variability among different pig

breeds for FABP4 and IGF2 genes, but a significant reduction in genetic variability for

FABP5 in different breeds [Ojeda et al., 2008a, b]. Chromosome-wise LD analyses showed a difference in LD among different commercial pig breeds, as well as in European and Chinese indigenous pig breeds [Du et al., 2007; Amaral et al., 2008]. The LD blocks were extended up to 400 kb for European breeds and only 10 kb for Chinese breeds. The average extent of

LD in the pig was a bit higher than in humans, which ranged from 10–30 kb. Simulations and case studies have demonstrated that the extent of LD is correlated with the distance between markers, and it depends on the genetic structures of breeds/lines and mating systems

192

[Nsengimana et al., 2004]. More information on LD levels should be obtained in the near

future by the swine HapMap project’s utilization of the new 60K SNP chip.

Candidate gene analyses have been employed to investigate a variety of economic traits. To date, significant association has been demonstrated for candidate genes with litter size (ESR1, PRLR, RBP4, FSHB), growth (MC4R), muscle mass (IGF2, MSTN), meat quality

(HAL,RN, PRKAG3, CAST , disease resistance (K88, FUT1,SLA, NRAMP) and coat color

(KIT, MC1R). The commercial pig industry is actively using many of these gene markers in combination with traditional performance information to improve production by marker- assisted selection (MAS).

While the traits listed above have received most of the attention, many traits of economic importance to the swine industry have been greatly overlooked. As QTL studies have become more prevalent, there is more interest in trying to determine QTL for non- traditional traits. For example, recent studies using a high-throughput genotyping technique revealed that the genes CCR7 and CPT1A were associated with sow longevity traits [Mote et al., 2009], the genes CALCR and COL1A2 were significantly associated with leg action [Fan et al., 2009b], and the genes ELF5, KIF18A, COL23A1 and NPTX1 were associated with scrotal hernias in pigs [Du et al., 2009].

DATABASES AND ONLINE RESOURCES

With the rapid advances of the porcine genome sequence and other related projects, a tremendous amount of data is being generated every day. Thus, large databases to store this information and make it publicly available are becoming more essential, as are online resources that enable analysis of this new data. Significant pig bioinformatics efforts have

been undertaken by the Roslin Institute, Sanger Institute, TGAC, NCBI, NAGRP

193

Bioinformatics and Pig Genome Coordination Programs, and other research institutes around the world. Information is shown in table 1 about the online database resources beneficial for the pig genomics community.

FUNCTIONAL GENOMICS

In recent years, a number of studies have been carried out to profile expression of the pig genome. The number of porcine expressed sequence tags (ESTs) has expanded dramatically, with over 1,600,000 ESTs deposited in GenBank (http://www.ncbi.nlm.nih.gov/sites/entrez).

There are over 180 porcine cDNA libraries which have each contributed more than 1,000 sequences to UniGene. These libraries came from almost 50 categories of different tissues, such as muscle, brain, intestine, ovary, blood and embryonic tissue. In UniGene Build 40, the mRNAs (8,377), high-throughput cDNAs (19,562) and ESTs (1,250,988) have been clustered into 51,270 UniGene sets (http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?

TAXID=9823). Overall, data on pig gene expression is being produced at a rapid rate.

The combination of expression data and the completed genome sequence will certainly contribute to the better understanding of the physiological complexity of the pig transcriptome and uncover genes and gene pathways that control traits of economic importance. Many studies have shown that the differences that exist in the porcine transcriptome are dependent upon breed and developmental stage, among other factors, so there is much left to learn through expression studies. Recently developed techniques such as microarray, quantitative real-time PCR, serial analysis of gene expression (SAGE) and ChIP- chip (chromatin immunoprecipitation on a microarray chip) are being utilized for expression studies in livestock, and these tools can greatly enhance the swine community’s understanding of the porcine genome.

194

Microarray studies in particular have been used extensively to determine the

biological basis for such traits as reproduction, muscle development, feed intake and host

response to infection in pigs [Tuggle et al., 2007]. Cooperative efforts led by the US Pig

Genome Coordinator and a committee of interested scientists have produced a second-

generation 70-mer cDNA or oligo spotted array (http://www.pigoligoarray.org/). The array is a 20,400-element oligonucleotide microarray with probes designed based on comparison of

pig ESTs to known vertebrate proteins. Affymetrix also sells a 23,937-probe porcine

microarray.

Another technique for expression profiling is SAGE, which is a sequencing-based method. The sequenced SAGE tags are a digital representation of cellular gene expression

and have been used to identify genes differentially expressed at various stages of

development. Successful examples using SAGE profiling for 2 critical embryo development

stages and 3 skeletal development stages in 2 pig breeds have been reported [Blomberg and

Zuelke, 2004; Tang et al., 2007].

ChIP-chip technology received its name because it determines the results from

chromatin immunoprecipitation through use of a microarray. ChIP-chip is used to determine

which DNA regions are bound to different proteins and can be used to identify promoter

regions, enhancers, repressors, and silencers. Though this technology is not currently utilized

in pigs, it may come soon after completion of the genome sequence.

Recently, numerous studies demonstrated that non-protein-coding RNAs, including

small nucleolar RNA (snoRNA), microRNA (miRNA), short interfering RNA (siRNA),

piwi-interacting RNA (piRNA), and natural antisense RNA (NAT), function in gene

195

expression regulation at many levels, including chromatin architecture, RNA editing, and

RNA stability and translation, for example. Porcine miRNAs were reported from the Sino-

Danish 0.66x coverage pig genome sequence, but were not extensively examined

[Wernersson et al., 2005]. More recent studies have identified and begun characterization of miRNAs by means of direct cloning and sequencing, in silico prediction and northern hybridization validation [Huang et al., 2008; Reddy et al., 2009]. The present studies suggest some miRNAs are involved in reproduction, muscle growth and development. Further work on miRNA target site identification, functional mechanisms and biological significance is being explored.

RNA interference technology is a method for selectively silencing genes by creating double-stranded RNA. The pilot work using an RNAi technique in pigs was carried out for pig transgenesis studies [Clark et al., 2007]. The successful knock-down of target genes such as porcine endogenous retroviruses (PERVs) by siRNA transformation into cells was observed [Miyagawa et al., 2005; Palmer et al., 2006]. Cong et al. [2010] also showed the possibility of using Salmonella choleraesuis as a vector for siRNA administration to swine, but a standard, efficient platform for RNAi experiments in pigs has yet to be developed.

SEQUENCING PROGRESS

In addition to the recently sequenced livestock animals of biological significance, such as cow, dog, chicken and horse, the pig will be the next one with its entire genome published. With the endeavors of the Swine Genome Sequencing Consortium (SGSC) and many other research institutes, tremendous progress has been made in sequencing the pig genome [Archibald et al., 2010] and developing the related scientific projects such as the porcine HapMap project and a high-density SNP chip. The first swine genome sequence draft

196

with an overall depth of 4x coverage and 6x in some locations (build 9) became available to

the pig genomics community in April of 2009 (http://www.ensembl.org/Sus_scrofa/Info/

Index/). An updated sequence alignment (build 10) incorporating 24x coverage with Illumina

whole genome shotgun (WGS) reads from BGI, 1x coverage of WGS capillary reads from

National Livestock Research Institute-Korea, and additional short-read sequences from

Sanger Institute was available by the end of December 2010 [International Swine Genome

Sequencing Consortium 2010].

The primary strategy for sequencing the pig genome was to combine the WGS

approach with a minimal tiling path (MTP) of bacterial artificial chromosomes (BACs). The

MTP was constructed from previous landmarks, BAC end sequences, and restriction

fingerprinting of over 260,000 BACs derived from 4 BAC libraries that have been used for

positional cloning and QTL mapping studies [Schook et al., 2005]. It was estimated that the

length of the entire pig genome was about 2.6 Gb, and the original plan was to sequence

16,000–17,000 BAC clones and generate a draft 6x coverage of the genome from a

3x coverage of BACs comprising the MTP and a 3x coverage of WGS sequencing.

Sequencing work and assembly are ongoing at the Wellcome Trust Sanger Institute and the

Genome Analysis Centre, respectively, through this international collaboration composed of many different laboratories.

Different from the methods used for human and mouse genome sequencing, a high- resolution physical map integrating the fingerprint map and BAC end sequence data was constructed before the initiation of whole-genome sequencing. This map comprised 172 high

continuity clone contigs distributed over all 18 autosomes and the 2 sex chromosomes, X and

Y, representing ~2.6 Gb [Humphray et al., 2007]. A total of 98.3% of this integrated physical

197

map has been selected in 16,273 BAC clones for sequencing. To date (January 24th, 2011), a

total of 16,193 of these clones have been sequenced, which contributes 3,013 Mb (180.7 Mb

finished quality) (http://www.sanger.ac.uk/Projects/S_scrofa/) [International Swine Genome

Sequencing Consortium 2010]. In addition, efforts are being made to close the remaining

gaps by incorporating next-generation sequence reads from several sources. Information on

the current status of the genome sequencing project, as well as on each chromosome, is

available from the Sanger Institute website (http://www.sanger.ac.uk/Projects/S_scrofa) and

NAGRP Pig Genome Coordination Program website (http://www.animalgenome.org/pigs/

genomesequence).

Particular regions on a few chromosomes were selected for deeper sequencing (6–8x

coverage) due to the potential biological effects of these regions and the availability of

funding. The Institute for Pig Genetics in the Netherlands funded the additional sequencing

on all of chromosome 4, while the EU SABRE project funded additional sequencing on

chromosomes 7 and 14. Chromosome 17 was also made a high priority for sequencing.

Further work on the sex chromosomes has been funded and the Wellcome Trust Sanger

Institute plans to further annotate those chromosomes. The sequences from chromosomes 1–

18 and X have been released into the ENSEMBL website (http://www.ensembl.org/Sus_ scrofa/Info/Index/). The pig X chromosome sequencing project is also making excellent progress through the sequencing pipeline at the Sanger Institute, with sequence data now available for 879 clones.

Manual annotation of 3 interesting regions from the pig genome was completed, which included 2.4 Mb of the major histocompatibility complex (MHC) region on

198

SSC7p1.1–q1.1, ~8 Mb of chromosome 17, and part of a ~300-kb region on chromosome 6 orthologous to the human leukocyte receptor complex (LRC). The annotated data are publicly available via the Vertebrate Genome Annotation (VEGA) website (http://vega.

sanger.ac.uk/Sus_scrofa/index.html). The work on assembly and annotation of the remaining

sequenced regions is ongoing. Even after build 10 of the genome is published, more targeted

sequencing will still be needed to fill the remaining gaps for annotation to be complete.

Using the new genome sequence, the genetic effects observed in pigs when they are

studied as models of human diseases can be more easily translated into predictions for

humans. Though we currently do not know the level of similarity between orthologous

regions in pigs and humans, the previous draft 0.66x coverage of the genome completed by

the Sino-Danish Pig Genome Project [Wernersson et al., 2005] and the annotated region on

SSC17 [Hart et al., 2007] demonstrated that the pig was much closer to human than mouse,

as shown from the types of syntenic data. Pigs also seem to have an isochore structure most

similar to that in humans.

One discovery from human genome variation studies that is ongoing in pigs with the

new genome sequence is the identification of copy number variants (CNVs) [Fadista et al.,

2010]. CNVs represent those DNA segments longer than 1 kb present at variable copy

number between different individuals, and they might be responsible for phenotype variances

of disease associated traits. Few CNVs had been uncovered in pigs until recently when a

custom tiling oligonucleotide array was used for screening unrelated individuals on the

regions of SSC4, 7, 14 and 17 already sequenced and assembled. This screen identified 37

CNV regions [Fadista et al., 2008]. A feasible approach for CNV genotyping similar to the

human CNV array should be developed in the future.

199

SNP CHIP

The recent developments on the pig genome sequence and the next-generation sequencing technologies have driven the production of a high-density porcine SNP chip. A

60K chip has been developed and distributed by the International Swine SNP Consortium and Illumina for the pig community, and it will serve as an extremely useful tool for a variety of research fields including animal genetic evaluation, whole-genome association, LD mapping, population genetics and evolution genomics studies.

The porcine SNP chip was designed as an Illumina Infinium iSelect TM SNP

BeadChip with a selection of 64,232 SNPs, which were chosen from a total of over 549,000

SNPs [Ramos et al., 2009]. The SNPs genotyped by this SNP chip include already validated

SNPs from the public dbSNP database and those mined from the deposited porcine shotgun sequences [Kerstens et al., 2009], as well as SNPs identified de novo using high-throughput sequencing techniques from reduced representation libraries (RRL) [Wiedmann et al., 2008;

Ramos et al., 2009]. Decisions about which SNPs to include on the SNP chip were primarily based on each SNP’s position in the genome to ensure even coverage across chromosomes.

Other criteria considered include the estimated minor allele frequency (MAF) and whether the SNP passed other quality control measures and standards for Illumina Infinium iSelect

SNP BeadChip preparation. Based on build 9 of the porcine genome, the median distance between SNPs is 28 kb, though stretches as long as 1.7 Mb exist. The long gaps could be a result of current issues with the genome assembly, and their lengths will likely change in the upcoming build. Most of the SNPs reside on the autosomes and X chromosome.

A pilot study using the 60K SNP chip was carried out at Wageningen University using a panel of 158 individuals representing the original discovery panel. From this study,

200

62,121 SNPs (97.5%) were successfully scored. Minor allele frequencies were fairly high with 58,994 SNPs being polymorphic of which 57,109 had MAF > 0.05. The average MAF for all scorable SNPs was 0.274 [Ramos et al., 2009]. When the panel was tested on other

Sus species, 76–99% of the SNPs were monomorphic [Groenen et al., 2009].

A POSSIBLE FUTURE FOR SWINE GENOMICS

Large-scale whole-genome association analyses are already underway with the new

60K SNP chip in pigs, and they will change the conventional association model with one- trait-one-marker and accelerate progress in the search for genetic variation underlying the inheritance of important production traits and complex genetic diseases [Georges, 2007;

Rothschild and Plastow, 2008; Andersson, 2009]. The vast amounts of data produced will allow for haplotype block construction and linkage disequilibrium mapping analyses to determine the important chromosomal regions and facilitate positional cloning. In addition, selective sweeps and positive selection events on breed formation history can be identified through high-density SNP genotyping. These new chips will make it much easier to combine information from multiple studies to complete meta-analyses, since the markers used in each study will be consistent.

Another method which requires high-density markers is genomic selection, which integrates genotypic, phenotypic and pedigree data for breeding value prediction and marker- assisted selection on a genome wide scale. Different methods for genomic selection have been developed over the past decade, and simulations show it can increase the accuracy of estimates of genetic merit while reducing the generation interval [Meuwissen, 2007]. These genomic breeding values (GEBV) have already been combined into the selection model for

201

the USDA’s genetic evaluation of dairy cattle in the US, and similar utilization of genomic information to evaluate pigs will be feasible in the near future.

The rapid development of structural genomics and functional genomics is making

‘systems biology’ feasible in pigs. The integration of datasets from various levels of the genome, transcriptome, proteome, epigenome and others will contribute to better understanding of individual genes, genetic pathways and regulatory networks to identify significant variation in genes controlling important traits of interest in pigs [Brazma et al.,

2006].

Interest in the pig as a biological model for human biology has continued for several years. Because of the size and physiological similarity of pigs to humans, the pig has become a model of choice for a vast array of disciplines such as nutrition, digestive physiology, cardiovascular disease, diabetes, skin formation and healing, maternal mortality, melanoma and infectious disease [Lunney, 2007; Quilter et al., 2008]. One ongoing project utilizing pigs as a biological model for human disease involves measuring the phenotypic responses of thousands of pigs to Porcine Reproductive and Respiratory Syndrome Virus (PRRSV). The

International PRRS Host Genetics Consortium is working to map QTL for response to

PRRSV, in hopes of discovering loci that can make pigs more resistant to PRRSV and gaining knowledge that could be applied to infectious disease response in humans. In particular, these results could provide information about viral persistence and transmission pathways [Lunney, 2007].

CONCLUSIONS

The rapid advances in structural and functional genomics of the pig provide a good opportunity to better understand the genetic variation underlying economically important and

202

complex phenotypes and may also be of value for understanding human health. The high synteny between the pig and human genome sequences, as revealed by several newly developed high-resolution comparative physical maps, suggests that the pig may be the best

biomedical model to study many human diseases. Use of the new high-density porcine SNP

chip in association analyses will identify many interesting gene effects, but collaboration and

joint analyses between different research groups will become more critical. The development

of novel statistics methods for linkage disequilibrium mapping, fine mapping and whole-

genome selection are urgent, so as to speed the utilization of the existing and forthcoming pig genomics information. The discovery and characterization of genome sequence variants such

as SNPs, CNVs and indels, as well as non-coding RNA sequences like miRNA, piRNA and

NATs, are necessary. The organization, integration, and optimization of results from all

research levels using a ‘systems biology’ approach will provide a bountiful harvest for the

pig genomics community. Overall the pig genomics community has made great strides in

recent years and will reap the considerable rewards in terms of new knowledge applicable to

both agriculture and human health.

ACKNOWLEDGEMENTS

The authors acknowledge the partial support provided by the National Pork Board,

Iowa Pork Producers Association, U.S. Pig Genome Coordination Program, the College of

Agriculture and Life Sciences at Iowa State University, State of Iowa and Hatch funding and

the activities of the International Swine Sequencing Consortium. Support for D.M. Gorbach

is provided by USDA National Needs Graduate Fellowship Competitive Grant No. 2007-

38420-17767 from the National Institute of Food and Agriculture.

203

REFERENCES

Albarella U, Dobney K, Ervynck A, Rowley-Conwy PE: Pigs and Humans: 10,000 Years of

Interaction (Oxford University Press, Oxford 2007).

Amaral AJ, Megens HJ, Crooijmans RP, Heuven HC, Groenen MA: Linkage disequilibrium

decay and haplotype block structure in the pig. Genetics 179:569–579 (2008).

Andersson L: Genome-wide association analysis in domestic animals: a powerful approach

for genetic dissection of trait loci. Genetica 136:341–349 (2009).

Archibald AL, Haley CS, Brown JF, Couperwhite S, Mcqueen HA, et al: The PiGMaP

consortium linkage map of the pig (Sus scrofa). Mamm Genome 6:157–175 (1995).

Archibald AL, Boland L, Churcher C, Fredholm M, Groenen MAM, et al: Pig genome

sequence – analysis and publication strategy. BMC Genomics 11:438 (2010).

Blomberg LA, Zuelke KA: Serial analysis of gene expression (SAGE) during porcine

embryo development. Reprod Fert Develop 16:87–92 (2004).

Brazma A, Krestyaninova M, Sarkans U: Standards for systems biology. Nat Rev Genet

7:593–605 (2006).

Campbell EM, Nonneman D, Rohrer GA: Fine mapping a affecting

ovulation rate in swine on chromosome 8. J Anim Sci 81:1706–1714 (2003).

Cardoso FF, Rosa GJ, Steibel JP, Ernst CW, Bates RO, Tempelman RJ: Selective

transcriptional profiling and data analysis strategies for expression quantitative trait

loci mapping in outbred F2 populations. Genetics 180:1679–1690 (2008).

Clark KJ, Carlson DF, Fahrenkrug SC: Pigs taking wing with transposons and recombinases.

Genome Biol 8 Suppl 1:S13 (2007).

Clutton-Brock J: A Natural History of Domesticated Mammals, ed 2 (Cambridge University

204

Press, Cambridge; Natural History Museum, New York 1999).

Cong W, Jin H, Jiang C, Yan W, Liu M, et al: Attenuated Salmonella choleraesuis-mediated

RNAi targeted to conserved regions against foot-and-mouth disease virus in guinea

pigs and swine. Vet Res 41:30 (2010). de Koning DJ, Rattink AP, Harlizius B, van Arendonk JA, Brascamp EW, Groenen MA:

Genome-wide scan for body composition in pigs reveals important role of imprinting.

Proc Natl Acad Sci USA 97:7947–7950 (2000).

Department of Animal Science Oklahoma State University: Breeds of Livestock: Swine

http://www.ansi.okstate.edu/breeds/swine (1995).

Du FX, Clutter AC, Lohuis MM: Characterizing linkage disequilibrium in pig populations.

Int J Biol Sci 3:166–178 (2007).

Du ZQ, Zhao X, Vukasinovic N, Rodriguez F, Clutter AC, Rothschild MF: Association and

haplotype analyses of positional candidate genes in five genomic regions linked to

scrotal hernia in commercial pig lines. Plos One 4:e4837 (2009).

Fadista J, Nygaard M, Holm LE, Thomsen B, Bendixen C: A snapshot of CNVs in the pig

genome. Plos One 3:e3916 (2008).

Fadista J, Zhan B, Hedegaard J, Bendixen C: Structural Variation in the Pig Genome Using

Next-Generation Sequencing (32nd Conference of the International Society for

Animal Genetics, Edinburgh, Scotland 2010).

Fan B, Onteru SK, Plastow GS, Rothschild MF: Detailed characterization of the porcine

MC4R gene in relation to fatness and growth. Anim Genet 40:401–409 (2009a).

Fan B, Onteru SK, Mote BE, Serenius T, Stalder KJ, Rothschild MF: Large-scale association

205

study for structural soundness and leg locomotion traits in the pig. Genet Sel Evol

41:14 (2009b).

Georges M: Mapping, fine mapping, and molecular dissection of quantitative trait loci in

domestic animals. Annu Rev Genomics Hum Genet 8:131–162 (2007).

Groenen MAM, Crooijmans RPMA, Ramos AM, Amaral AJ, Kerstens H, et al: Design of

the Illumina Porcine 50K+ SNP iSelect Beadchip and Characterization of the Porcine

HapMap Population (Plant and Animal Genome Conference XVII, San Diego, CA,

USA 2009).

Hart EA, Caccamo M, Harrow JL, Humphray SJ, Gilbert JG, et al: Lessons learned from the

initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig

chromosome 17. Genome Biol 8:R168 (2007).

Holl JW, Cassady JP, Pomp D, Johnson RK: A genome scan for quantitative trait loci and

imprinted regions affecting reproduction in pigs. J Anim Sci 82:3421–3429 (2004).

Hu ZL, Dracheva S, Jang W, Maglott D, Bastiaansen J, et al: A QTL resource and

comparison tool for pigs: PigQTLDB. Mamm Genome 16:792–800 (2005).

Huang TH, Zhu MJ, Li XY, Zhao SH: Discovery of porcine microRNAs and profiling from

skeletal muscle tissues during development. Plos One 3:e3225 (2008).

Humphray SJ, Scott CE, Clark R, Marron B, Bender C, et al: A high utility integrated map of

the pig genome. Genome Biol 8:R139 (2007).

International Swine Genome Sequencing Consortium: Improving the High-Quality Draft

Swine Genome Reference: The Biology of Genomes (Cold Spring Harbor, NY, USA

2010).

Kerstens HH, Kollers S, Kommadath A, Del Rosario M, Dibbits B, et al: Mining for single

206

nucleotide polymorphisms in pig genome sequence data. BMC Genomics 10:4

(2009).

Kim JJ, Rothschild MF, Beever J, Rodriguez-Zas S, Dekkers JC: Joint analysis of two breed

cross populations in pigs to improve detection and characterization of quantitative

trait loci. J Anim Sci 83:1229–1240 (2005).

Li XP, Hu ZL, Moon SJ, Do KT, Ha YK, et al: Development of an in silico coding gene SNP

map in pigs. Anim Genet 39:446–450 (2008).

Lunney JK: Advances in swine biomedical model genomics. Int J Biol Sci 3:179–184

(2007).

Meuwissen T: Genomic selection: marker assisted selection on a genome wide scale. J Anim

Breed Genet 124:321–322 (2007).

Meyers SN, Rogatcheva MB, Larkin DM, Yerle M, Milan D, et al: Piggy-BACing the human

genome. II. A high-resolution, physically anchored, comparative map of the porcine

autosomes. Genomics 86:739–752 (2005).

Meyers SN, Rodriguez-Zas SL, Beever JE: Fine-mapping of a QTL influencing pork

tenderness on porcine chromosome 2. BMC Genet 8:69 (2007).

Miyagawa S, Nakatsu S, Nakagawa T, Kondo A, Matsunami K, et al: Prevention of PERV

infections in pig to human xenotransplantation by the RNA interference silences

gene. J Biochem 137:503–508 (2005).

Mote BE, Koehler KJ, Mabry JW, Stalder KJ, Rothschild MF: Identification of genetic

markers for productive life in commercial sows. J Anim Sci 87:2187–2195 (2009).

Muñoz G, Ovilo C, Silió L, Tomás A, Noguera JL, Rodríguez MC: Single- and joint-

207

population analyses of two experimental pig crosses to confirm quantitative trait loci

on Sus scrofa chromosome 6 and leptin receptor effects on fatness and growth traits. J

Anim Sci 87:459–468 (2009).

Nsengimana J, Baret P, Haley CS, Visscher PM: Linkage disequilibrium in the domesticated

pig. Genetics 166:1395–1404 (2004).

Ojeda A, Estelle J, Folch JM, Perez-Enciso M: Nucleotide variability and linkage

disequilibrium patterns at the porcine FABP5 gene. Anim Genet 39: 468–473

(2008a).

Ojeda A, Huang LS, Ren J, Angiolillo A, Cho IC, et al: Selection in the making: a worldwide

survey of haplotypic diversity around a causative mutation in porcine IGF2 . Genetics

178:1639–1652 (2008b).

Onteru SK, Fan B, Rothschild MF: SNP detection and comparative linkage mapping of 66

bone-related genes in the pig. Cytogenet Genome Res 122:122–131 (2008).

Palmer ML, Lee SY, Carlson D, Fahrenkrug S, O’Grady SM: Stable knockdown of CFTR

establishes a role for the channel in P2Y receptor-stimulated anion secretion. J Cell

Physiol 206:759–770 (2006).

Quilter CR, Gilbert CL, Oliver GL, Jafer O, Furlong RA, et al: Gene expression profiling in

porcine maternal infanticide: a model for puerperal psychosis. Am J Med Genet B

Neuropsychiatr Genet 147B:1126–1137 (2008).

Ramos AM: Molecular Dissection of Genomic Regions Influencing Important Economic

Traits in the Pig: Animal Science (Iowa State University, Ames, IA, USA 2006).

Ramos AM, Crooijmans RP, Affara NA, Amaral AJ, Archibald AL, et al: Design of a high

208

density SNP genotyping assay in the pig using SNPs identified and characterized by

next generation sequencing technology. Plos One 4:e6524 (2009).

Reddy AM, Zheng Y, Jagadeeswaran G, Macmil SL, Graham WB, et al: Cloning,

characterization and expression analysis of porcine microRNAs. BMC Genomics

10:65 (2009).

Rischkowsky B, Pilling D: The State of the World’s Animal Genetic Resources for Food and

Agriculture (FAO, Rome 2007).

Rogatcheva MB, Chen K, Larkin DM, Meyers SN, Marron BM, et al: Piggy-BACing the

human genome. I. Constructing a porcine BAC physical map through comparative

genomics. Anim Biotechnol 19:28–42 (2008).

Rohrer GA, Alexander LJ, Hu ZL, Smith TPL, Keele JW, Beattie CW: A comprehensive

map of the porcine genome. Genome Res 6:371–391 (1996).

Rohrer G, Beever JE, Rothschild MF, Schook L, Gibbs R, Weinstock G: Porcine genomic

sequencing initiative. http://www.animalgenome.org/pigs/community/WhitePaper/

2002.html (2002).

Rothschild MF, Plastow GS: Impact of genomics on animal agriculture and opportunities for

animal health. Trends Biotechnol 26:21–25 (2008).

Rothschild MF, Ruvinsky A: The Genetics of the Pig, ed 2. (CABI, Wallingford,

Oxfordshire, UK 2010).

Rothschild MF, Hu ZL, Jiang Z: Advances in QTL mapping in pigs. Int J Biol Sci 3:192–197

(2007).

Schook LB, Beever JE, Rogers J, Humphray S, Archibald A, et al: Swine Genome

209

Sequencing Consortium (SGSC): a strategic roadmap for sequencing the pig genome.

Comp Funct Genomics 6:251–255 (2005).

Tang ZL, Li Y, Wan P, Li XP, Zhao SH, et al: LongSAGE analysis of skeletal muscle at

three prenatal stages in Tongcheng and Landrace pigs. Genome Biology 8:R115

(2007).

Tuggle CK, Wang YF, Couture O: Advances in swine transcriptomics. Int J Biol Sci 3:132–

152 (2007).

Van Laere AS, Nguyen M, Braunschweig M, Nezer C, Collette C, et al: A regulatory

mutation in IGF2 causes a major QTL effect on muscle growth in the pig. Nature

425:832–836 (2003).

Wernersson R, Schierup MH, Jorgensen FG, Gorodkin J, Panitz F, et al: Pigs in sequence

space: A 0.66X coverage pig genome survey based on shotgun sequencing. BMC

Genomics 6:70 (2005).

Wiedmann RT, Smith TP, Nonneman DJ: SNP discovery in swine by reduced representation

and high throughput pyrosequencing. BMC Genet 9:81 (2008).

Yerle M, Echard G, Robic A, Mairal A, Dubut-Fontana C, et al: A somatic cell hybrid panel

for pig regional gene mapping characterized by molecular cytogenetics. Cytogenet

Cell Genet 73:194–202 (1996).

Yerle M, Pinton P, Robic A, Alfonso A, Palvadeau Y, et al: Construction of a whole-genome

radiation hybrid panel for high-resolution gene mapping in pigs. Cytogenet Cell

Genet 82:182–188 (1998).

Yerle M, Pinton P, Delcros C, Arnal N, Milan D, Robic A: Generation and characterization

210

of a 12,000-rad radiation hybrid panel for fine mapping in pig. Cytogenet Genome

Res 97:219–228 (2002).

211

Table 1. The online resources and databases useful for the pig genomics community

Category Online Description URL resource/database Genome US pig genome Comprehensive http://www.animalgenome.org/pigs/ mapping site Genome Pig genome resource Comprehensive http://www.ncbi.nlm.nih.gov/genome/ at NCBI guide/pig/ Genome Swine Genome Comprehensive http://www.piggenome.org/ Sequencing Consortium (SGSC) Sequencing Porcine genome Physical mapping and http://www.sanger.ac.uk/Projects/S_ sequencing project at genome sequencing scrofa/ Sanger Institute Sequencing Sino-Danish pig genome sequences and http://piggenome.dk/ genome sequencing EST sequences project Sequencing Pre-Ensemble (Sus Preliminary assemblies http://pre.ensembl.org/Sus_scrofa/Info/ scrofa) for chromosomes 1-18 Index and X Sequencing VEGA (Sus scrofa) Preliminary annotation http://vega.sanger.ac.uk/Sus_scrofa/ of the sequenced region index.html on SSC6, 7 and 17 SNP dbSNP (Sus scrofa) Database of SNPs and http://www.ncbi.nlm.nih.gov/SNP/snp_ other genetic variation batchSearch.cgi?org=9823&type=SNP Linkage map ArkDB linkage mapping, RH http://www.thearkdb.org/arkdb/do/ mapping, physical getChromosomeDetails?accession= mapping ARKSPC00000001 Linkage map USDA/MARC Genetic linkage map http://www.marc.usda.gov/genome/ Linkage Map swine/swine.html Physical map Cytogenetic Map pig-human comparative https://www-lgc.toulouse.inra.fr/pig/ map cyto/cyto.htm Physical map IMpRH maps IMpRH mapping https://www-lgc.toulouse.inra.fr/pig/ RH/Menuchr.htm Expression DFCI Pig Gene Index ESTs annotation and http://compbio.dfci.harvard.edu/tgi/cgi- cluster bin/tgi/gimain.pl?gudb=pig Expression Agenae Project Pig ESTs http://public-contigbrowser.sigenae. org:9090/Sus_scrofa/index.html Expression Pig array Porcine microarray http://www.pigoligoarray.org/ Expression UniGene (Sus scrofa) transcribed sequences http://www.ncbi.nlm.nih.gov/UniGene/ and gene-based clusters UGOrg.cgi?TAXID=9823 Disease Online Mendelian Porcine genetic http://omia.angis.org.au/adv_search_ Inheritance in Animals disorders results.shtml?field1=sci_name&query1 =Sus+scrofa

212

Table 1. (continued)

Category Online Description URL resource/database Disease MHC database Swine SLA sequences http://www.ebi.ac.uk/ipd/mhc/sla/index .html Disease IMGT® Immunogenetics http://imgt.cines.fr/textes/ IMGTveterinary/#Pig QTL PigQTLdb QTL repository and http://www.animalgenome.org/QTLdb/ integration of linkage pig.html and physical maps QTL GridQTL QTL mapping http://www.gridqtl.org.uk/ Non-coding miRBase published miRNA http://microrna.sanger.ac.uk/cgi-bin/ RNA sequences and targets/v5/genome.pl annotation Non-coding Rfam database Non-coding RNAs http://rfam.sanger.ac.uk/ RNA annotation Bioinformatics NAGRP Bioinformatics tools http://www.animalgenome.org/bioinfo/ Bioinformatics Coordination Program Bioinformatics Geocities Bioinformatics tool list http://www.geocities.com/ bioinformaticsweb/tools.html Bioinformatics BLAST (Sus scrofa) BLAST Pig Sequences http://www.ncbi.nlm.nih.gov/projects/ genome/seq/BlastGen/BlastGen.cgi? taxid=9823

213

APPENDIX C. DEVELOPMENT AND APPLICATION OF HIGH-DENSITY SNP ARRAYS IN GENOMIC STUDIES OF DOMESTIC ANIMALS

A paper published in Asian-Australasian Journal of Animal Sciences. Volume 23(7):833- 847.

Bin Fan, Zhi-Qiang Du, Danielle M. Gorbach, Max F. Rothschild

ABSTRACT

In the past decade, there have been many advances in whole-genome sequencing in

domestic animals, as well as the development of “next-generation” sequencing technologies

and high-throughput genotyping platforms. Consequently, these advances have led to the

creation of the high-density SNP array as a state-of-the-art tool for genetics and genomics analyses of domestic animals. The emergence and utilization of SNP arrays will have significant impacts not only on the scale, speed, and expense of SNP genotyping, but also on theoretical and applied studies of quantitative genetics, population genetics and molecular evolution. The most promising applications in agriculture could be genome-wide association studies (GWAS) and genomic selection for the improvement of economically important traits. However, some challenges still face these applications, such as incorporating linkage disequilibrium (LD) information from HapMap projects, data storage, and especially appropriate statistical analyses on the high-dimensional, structured genomics data. More efforts are still needed to make better use of the high-density SNP arrays in both academic studies and industrial applications.

INTRODUCTION

The first decade of the 21st centurt has been a golden time for the advancement of genomics, driven by the completion of the Human Genome Project (HGP). Various

214

methodologies and technologies have been developed during and after the process of

building the human genetic blueprint that have been directly transferred into the studies of

domestic animal genomics (Andersson, 2009; Goddard and Hayes, 2009; Rothschild et al.,

2010). The search for genetic underpinnings of human diseases perplexed researchers for

many years. Only recently did the genetic factors underlying various human diseases begin to

be revealed, especially with the help of genome-wide association studies (GWAS) using SNP arrays.

Single nucleotide polymorphisms (SNPs) are bi-allelic genetic markers, and they are easy to evaluate and interpret and are widely distributed within genomes. With proper coverage and density over the whole-genome, SNPs could capture the linkage disequilibrium

(LD) information embedded in the genome, which could be used to pinpoint genes underlying human diseases. For domestic animals, these tools can contribute to i) better understanding of species’ evolution, domestication and breed formation, and developing new theories of population genetics; ii) dissecting the genetic mechanisms of complex agricultural traits; and iii) improving selection methods for genetic improvement of animal production.

High-density SNP arrays were built for important farm animals, first for those with reference genomes and then recently also for others without reference genomes with the advent and application of the massive parallel sequencing technologies. The preparation and utilization of SNP arrays are having considerable impacts on the theory and practice of animal breeding and genetics, which will play important roles in the years to come.

In this review, the whole-genome sequencing and HapMap studies of several important domestic animals are briefly summarized. Then, details about the development of

215

SNP arrays and their applications in various genetics and genomics research projects are also

reviewed. Lastly, lessons learned from the reported studies and prospects for future work are

discussed.

ANIMAL GENOME SEQUENCING AND HapMap PROJECTS

The whole-genome sequencing strategies for most domestic animals were taken directly from

human genome sequencing, i.e., combining both whole-genome shotgun (WGS) and BAC-

to-BAC sequencing (Green, 2001). Based on their significance in agriculture and as

biomedical models, chickens, dogs, cattle, horses and pigs have had their genomes

sequenced, as well as some other important animals (Table 1). Due to the rapid development

of “next-generation” sequencing technologies and the availability of reference genomes,

these strategies have been modified for different species. For those well-studied species in which high levels of genomics knowledge and sequence coverage are required, such as chicken, dog, cattle, horse and pig, the dual approaches of WGS and BAC sequencing were applied.

These sequences provide comprehensive information for comparative genomics studies on the evolution and function of important genes and genomic regions (International

Chicken Genome Sequencing Consortium, 2004; Lindblad-Toh et al., 2005; Dalrymple et al.,

2007; The Bovine Genome Sequencing and Analysis Consortium, 2009; Wade et al., 2009;

Groenen et al., 2010a). The comparative studies among the genomes of human and domestic animals have also demonstrated a high level of conservation and orthology for protein coding genes. However, huge differences were found in non-coding regions, especially intergenic

repetitive regions, which may be one of the major forces driving evolution. The HapMap

studies also revealed abundant genetic variability within and between domestic breeds. The

216

majority of the variation was discovered by large-scale genotyping of SNPs and insertions or

deletions of DNA fragments with variable sizes, such as copy number variation (CNV),

which could in part contribute to the phenotypic diversity of domestic animals.

The additional information derived from linkage mapping, radiation hybridization

(RH) mapping, fluorescent in situ hybridization (FISH) mapping and expressed sequence tags (ESTs) were used to assist in the genome assembly and annotation. For those with 'light' coverage genomes, such as dog (1.5×) and cat (2×), WGS with “next-generation” sequencing technologies were utilized, and the sequences were assembled using human and other closely-related species as references. More recently, due to the reduced cost of sequencing, both deep-sequencing and individual genome sequencing have been attempted in the cow and chicken (Eck et al., 2009; Rubin et al., 2010), along with the 1,000 Genomes Project in human.

With the completion of whole-genome sequencing of domestic animals, HapMap projects were developed. Since domestic animals have rich sources of phenotypic diversity, which can be interrogated by SNPs across the genome, HapMap studies can be helpful to characterize the complexity of a genome and disentangle the genetic bases of complex traits

(International Chicken Polymorphism Map Consortium, 2004; Lindblad-Toh et al., 2005;

The Bovine HapMap Consortium, 2009; Wade et al., 2009; Groenen et al., 2010a).

The advantages of the HapMap studies include i) production of a large number of

SNPs for design and preparation of high-density SNP arrays; ii) clarification of the genetic relationships among diverse breeds and the phylogenetic relationships between domestic animals and their wild ancestors; iii) prediction of the potentially significant historical events

that occurred during domestication and breed formation, such as bottleneck effects and

217

selective sweeps; and iv) identification of the potentially important candidate genomic regions associated with distinct morphology, disease and other quantitative traits. The implications of HapMap studies will be demonstrated in the later sections of this review.

DESIGN AND DEVELOPMENT OF HIGH-DENSITY SNP ARRAYS

SNP discovery

A very large number of SNPs are essential for the design and construction of arrays.

Different methods and resources can be used for SNP discovery, including analysis of predicted SNPs generated from genome sequencing and HapMap studies, completing reduced representation library (RRL) sequencing, downloading SNP information from dbSNP of NCBI or collections of SNPs from individual research institutes or lab groups.

The completion of whole-genome sequencing and HapMap projects uncovered a large number of genetic variants across the genomes of domestic animals, most of which were SNPs. In the chicken, ~2.8 million SNPs were identified (International Chicken

Polymorphism Map Consortium, 2004). There were more than 2.5 million potential SNPs in the dog genome, with one SNP per 0.9 kb between breeds and one SNP per 1.5 kb within breeds (Lindblad-Toh et al., 2005). In cattle, ~2.2 million draft SNPs were detected with one

SNP per kb (The Bovine HapMap Consortium, 2009). In the horse genome, ~1.1 million draft SNPs were discovered with one SNP per 2 kb (Wade et al., 2009).

Although many SNPs were predicted during genome sequencing projects, SNP prediction software could confuse sequencing errors with true SNPs, meaning further validation is needed. The candidate SNPs for array design should be validated and have high minor allele frequency (MAF) in the testing populations. Matukumalli et al. (2009) found an uneven distribution of SNPs across the genome based on an analysis of cattle draft SNPs.

218

Additionally, it was determined that the nucleotide conversion rate of SNPs was usually low

and MAF was not estimated accurately because of the limited sample size of animals used in

the HapMap studies. In the commercially released BovineSNP50 array, around three-fifths of

SNPs were from the filtered draft SNPs from genome sequencing (Matukumalli et al., 2009).

Kerstens et al. (2009) used a similar pipeline to obtain 104,525 SNPs from 1.2 Gb of draft swine genome sequence and verified the polymorphisms of 134 from 163 filtered SNPs in several tested pig populations.

Another effective approach to identify large numbers of candidate SNPs is RRL sequencing, which was first introduced for creating a human SNP map (Altshuler et al.,

2000). This approach could reduce the complexity of the genome by several orders of magnitude, help discover SNPs that are extensively dispersed across the genome, and can even be performed without a priori knowledge of the genome sequence. The RRL sequencing procedure is briefly described in Figure 1.

Due to each species’ unique genome sequence, the most suitable restriction enzymes for RRL sequencing are variable. In the human genome, BglII cut sites are commonly distributed (Altshuler et al., 2000). Van Tassell et al. (2008) constructed several RRLs that were generated from DNA of eight commercial dairy and beef breeds that were digested by the HaeIII restriction enzyme, and identified 62,042 putative SNPs by deep-sequencing and filtering procedures. Around two-fifths of the SNPs on the BovineSNP50 array were from the

RRL sequencing approach (Matukumalli et al., 2009). In pigs, Wiedmann et al. (2008) identified 115,572 putative SNPs by sequencing of RRLs that were built from seven predominant commercial pig breeds and were digested by HaeIII. Amaral et al. (2009) also detected 17,489 pig SNPs using RRLs sequencing of pools of DNA from five Large

219

White×Pietrain crossbred boars digested by the DraI enzyme. Ramos et al. (2009) prepared

19 RRLs derived from four popular commercial pig breeds and a wild boar and three restriction enzymes (AluI, HaeIII and MspI), and eventually obtained 372,886 high- confidence SNPs. In total, the SNPs obtained from the RRLs sequencing comprised about

94% of the 64,232 SNPs used in the commercially released PorcineSNP60 array (Ramos et al., 2009).

The dbSNP database of NCBI (http://www.ncbi.nlm.nih.gov/projects/SNP/) is also a

SNP resource for array design. However, the unknown certainty level of each SNP polymorphism and limited genomic distribution of the SNPs might reduce their usefulness and the probability of being selected for a SNP array. On the PorcineSNP60 array, around

5,100 were from dbSNP and other private collections (Ramos et al., 2009).

Illumina’s iSelect technology

Illumina’s BeadArray based on single-base extension or allele-specific primer extension (http://www.illumina.com) and Affymetrix's GeneChip based on molecular inversion probe hybridization (www.affymetrix.com) are the two biggest and most competitive SNP chip genotyping platforms. The approaches of both arrays are different, but they have the capability to perform high-throughput genotyping for large scale samples. In comparison to the GeneChip, the BeadArray is cheaper and more flexible on probe designs

(Perkel, 2008). Currently, the majority of the commercially released SNP arrays for domestic animals are constructed using the BeadArray platform with Illumina's iSelect Infinium technology.

A bead chip is a micro-electro-mechanical system, in which wells attaching the beads are created by combining photolithography and plasma etching on silicon wafers. The beads

220

are randomly dispersed and assembled into wells on a silicon wafer (Steemers and

Gunderson, 2007). The location of each bead on the array can be identified through a decoding process that uses a 29 base tag sequence linked to the bead. Each bead has a number of 50-mer locus-specific primers following the tag sequence, which are used to anneal the genomic sequences flanking the target SNPs (Figure 2). After direct hybridization of the genomic DNA to the SNP array probes, each SNP locus is scanned by an enzymatic- based extension assay using fluorescent labeled nucleotides. The labels are visualized by staining with an immunohistochemistry assay to increase the signal intensity (Steemers and

Gunderson, 2007). The two different primer extension assays are allele-specific primer extension (ASPE) and single-base extension (SBE), which are called Infinium I and II assays, respectively. Infinium II can reduce the required number of synthesized beads by nearly half compared with Infinium I, and thus make this bead chip more economical.

Therefore, the probes for the majority of SNPs in the chip follow the Infinium II design

(Figure 2).

Criteria for SNP selection

Whole-genome sequencing and HapMap projects provided each species with draft

SNPs. High quality control (QC) criteria were then set up to filter these draft SNPs, which required i) each allele of the SNP is included in at least two sequence reads; ii) no repetitive elements surrounding the SNP (within 100 bp); iii) the SNP must be predicted by a minimum of six sequence reads; and iv) predicted SNPs cannot overlap with complex regions (e.g. duplicated sequences) (Matukumalli et al., 2009).

Candidate SNPs following this preliminary step were considered for placement on the

SNP array. A set of additional criteria were set up, such as SNP distribution across the

221

genome and each SNP’s properties. The physical distribution of SNPs evenly across the

genome and reasonable intervals between neighboring SNPs (except Y or W chromosomes in

mammals and birds, respectively) were prioritized. As far as the properties of each SNP were

concerned, high MAF determined from sequencing of representative samples, high quality

score and good validation status were required as well as the design score of Illumina's assay

design (Matukumalli et al., 2009; Ramos et al., 2009).

In addition, the bead density, the redundancy of beads per bead type and the final

expense (cost-effectiveness) influence the number of SNPs being assembled into the

commercially released SNP arrays (Steemers and Gunderson, 2007). For several important

farm animals including horse, cattle and pig, the number of SNPs in the first generation chips

was slightly more than 50K, ensuring that at least 50K SNPs would work. The first version of

the canine SNP array contained 20K, and now a high-density chip with 170K has been

released. A high-density 500K cattle SNP array is also being developed (http://www.

illumina.com).

APPLICATIONS OF THE SNP ARRAY

Genomic selection

Genomic selection (also termed as genomic prediction or genomic evaluation)

perhaps is the most fundamental change to breeding and genetics in agriculture as a direct result of the application of SNP arrays. It is different from human studies which have mainly concentrated on searching for disease genes or genealogy. Genomic selection is an advanced form of marker assisted selection (MAS) which concentrates on all markers across the whole genome (Meuwissen et al., 2001; Goddard and Hayes, 2009; Calus, 2010). The MAS strategy has been advocated for the past two decades, but its utilization is limited in practice because

222

of lack of genetic markers linking to QTL with significant effects and the expense of

genotyping. Meuwissen et al. (2001) proposed the original concept of genomic selection, i.e.,

predicting breeding values of animals using information offered by thousands of SNPs across

the genome (genomic estimated breeding value, GEBV), by assuming the availability of

abundant SNPs scattered throughout the genome and LD relationships between SNPs and

QTL. With the new SNP arrays, more SNP effects need to be predicted than there are phenotyped animals for use in predicting these effects. Consequently, Bayesian analysis methods were initially tested to address this problem. Two kinds of Bayesian approaches were developed to predict GEBV using dense SNPs in this pilot study. In both Bayes models, the effect of each SNP was considered to be independent and random, and the variance of

SNP effects were either assumed to be constant or locus specific, and then SNP effects were estimated by a Bayesian procedure with a prior distribution for this variance. These Bayesian methods had higher prediction accuracy compared to those of least squares (LS) and conventional best linear unbiased prediction (BLUP) based on simulation data. In recent years, different statistical approaches for genomic selection have been developed, derived from either non-parametric Bayesian models or parametric methods including genomic best linear unbiased prediction (GBLUP) and mixed regression models (Gianola et al., 2006;

Aulchenko et al., 2007; Verbyla et al., 2009; Calus, 2010).

Based on most of the studies using simulated data (Goddard, 2009; Hayes et al.,

2009b), several major factors which influence the accuracy of genomic selection were recognized: i) the LD extent between SNPs and the QTL; ii) the size of the training population (the individuals both phenotyped and genotyped for building statistical models

223

and predicting SNP effects); iii) the heritability or genetic basis of the analyzed trait and iv)

the distribution of QTL effects. Meuwissen et al. (2001) found that the prediction accuracy of genomic selection could reach 85% when r2 between the adjacent SNPs was greater than 0.2, which required that SNPs have high-density and even distribution across the genome. For

those traits with low heritability and composed of many QTL with small effects, a larger training population is necessary (Goddard and Hayes, 2009; Calus, 2010). Given the heritability of the targeted traits and the prediction accuracy in genomic selection programs, the numbers of animals required in the training population can be estimated (Figure 3).

Additionally, other methods have been proposed to improve the accuracy of genomic selection, such as estimation of missing genotypes, distinguishing the actual SNPs in LD with QTL from those only tracing relationships between animals, and developing novel approaches considering dominance and epistatic effects. Due to consistency of LD across populations, the prediction accuracy will likely remain at a high level whenever the training populations are at least partially related to the validation populations (animals for which

GEBVs are being predicted without phenotype data), suggesting that SNP effects obtained from crossbred populations are suitable for genomic selection in pure breeds (Ibánĕz-

Escriche et al., 2009; Toosi et al., 2010).

Another challenge is to carry out genomic selection in livestock across both national and global regions. Genomic selection has been adopted for genetic evaluation of dairy cattle in the United States of America (VanRaden et al., 2009) and is being considered in the

International Cattle Genetic Evaluation Project (http://www-interbull.slu.se). With the availability of higher density SNP chips which can help find more common haplotypes between breeds, the improvement of advanced statistical approaches and computer programs,

224

and joint sharing of phenotypes and SNP genotypes among research groups, breeding companies will be able to apply genomic selection for livestock across the globe (VanRaden and Sullivan, 2010).

Genome-wide association studies

Both candidate gene and QTL mapping strategies have been extensively utilized in domestic animals for the discovery of genetic markers suitable for MAS. However, the limitations of these approaches are becoming apparent. The biological mechanisms of quantitative traits and diseases are complicated, and they are still being explored. The determination of candidate genes according to their putative physiological roles is often difficult, and the candidate gene approach may miss the identification of novel genes and pathways associated with some traits. The regions with identified QTL are generally large and further fine mapping is necessary, and often consistency of results from QTL mapping is limited among different resource families (Rothschild et al., 2007). GWAS (also termed as whole-genome association studies, WGAS) is one of the most promising approaches to overcome these limitations.

Although GWAS have been carried out in domestic animals using the commercially available SNP arrays, most of them were on disease related traits because case-control study strategies could be easily utilized for association analyses (Karlsson et al., 2007; Andersson,

2009; Feugang et al., 2009; Snelling et al., 2009; Wood et al., 2009; Wilbe et al., 2010). For quantitative traits such as growth rate, lean meat percentage, intramuscular fat content and milk production, some researchers tried single marker mixed model or mixed regression models for association analyses (Abasht et al., 2009; Settlles et al., 2009). Other researchers

225

have used the posterior probability that is derived from a Bayesian approach originally

designed for genomic selection (Fan et al., 2009; Gorbach et al., 2009; Onteru et

al., 2009), where the SNPs having highest posterior probability (i.e., the frequency of a SNP

included in the model for GEBV prediction) are most likely to be linked to the QTL

(Meuwissen et al., 2001; Fernando and Garrick, 2009; Verbyla et al., 2009). A number of

GWAS in several important domestic animals have been completed to date with significant results (Table 3).

Whole-genome LD patterns

Construction of high-resolution LD maps, calculation of the extent of LD at the population level, and characterization of haplotype block structures are crucial for fine mapping and genomic selection (Georges, 2007; Goddard and Hayes, 2009). In most cases, the extent of LD between loci varies between populations, including lines, breeds and even different populations within a breed (Amaral et al., 2008; Bovine HapMap Consortium,

2009; Wade et al., 2009; Megens et al., 2010), and this inconsistency between groups of animals may have a significant impact on fine mapping, genomic selection and GWAS.

The extent of LD in a population also plays an important role in helping a researcher to decide the SNP density needed for a particular study. Differences in population structure and evolutionary forces affect how much LD exists in a population. For populations with longer range LD, there is less value in moving to a higher density SNP array because most

QTL may already be in LD with markers on a smaller array. If LD has a relatively short range, then not all QTL may be in LD with markers on a smaller array such that use of a larger SNP array may be worth the extra cost. The extent of LD in a given population can be

226

easily calculated from any SNP array study to predict the best array size to use in future studies.

The findings from the cattle and horse HapMap projects have demonstrated that the decay of LD relationship between SNPs slows beyond 100 kb, and haplotype blocks become smaller between breeds. It has been suggested that ~100K SNPs may be sufficient for association mapping within and across breeds (Amaral et al., 2008; The Bovine HapMap

Consortium, 2009; Wade et al., 2009).

Additionally, effective population size could be derived from the extent of LD within a given interval length, r2 = 1/(4Nc+1) or r2 = 1/(4Nc+2) (when mutation is considered in the model). Where N is the effective population size 1/2c generations in the past, and c is the recombination rate based on the number of Morgans between the examined markers (Sved,

1971). Although values of N varied between populations, rapid decreases in N were observed in recent generations in the populations examined (de Roos et al., 2008; Kim and Kirkpatrick et al., 2009; Villa-Angulo et al. 2009; Qanbari et al., 2010b), which implied that domestic animals have undergone inbreeding and extensive selection in the past two centuries, both well known occurrences.

In general, the amount of LD between any two markers decreases as the physical distance between those markers increases. Forces such as selection, however, can cause markers that are far apart physically (or even on different chromosomes) to be in high LD with one another. Having high LD for long stretches or between unlinked markers complicates fine mapping. One feasible approach for discovering SNPs that are widely applicable for selection is to carry out LD mapping in multiple breeds, so the SNPs in high

LD with QTL across populations can confirm the associations (Goddard and Hayes, 2009).

227

Population genetics

Selective sweeps : During domestication and breed formation of domestic animals,

they have experienced both natural and artificial selection. These selection pressures have led to increased allele frequencies of some mutations in a few specific genomic regions because these mutations made the animals more adaptable or gave them favorable characteristics based on human demands. Over time other polymorphisms may have decreased in frequency or vanished, and a single haplotype containing multiple genes may have become the only one

or the most prominent in the population. This has been termed as a selective sweep or

positive selection (Andersson and Georges, 2004).

Several statistical methods were proposed for detecting selective sweeps (Sabeti et

al., 2007). The integrated haplotype score (iHS) developed from integrated extended

haplotype homozygosity (EHH) detects selective sweeps by identifying genomic regions

with increased local LD. The fixation coefficient (Fst) can be used to predict selective

sweeps by comparing the Fst values among populations. The composite likelihood ratio test

(CLR) is based on the comparison of the maximum composite likelihoods under models with

and without selective sweeps.

The above methods have been utilized for selective sweep detection in the Bovine

HapMap Project (The Bovine HapMap Consortium, 2009). Based on the his method, specific

haplotype frequencies in the genomic regions containing MSTN (relevant to muscle

development), ABCG2 (relevant to milk yield and composition) and KHDRBS3 (relevant to

intra-muscular fatness) might have resulted from selective sweeps. Genomic regions relevant

to behavior, immune response and feed efficiency were discovered based on Fst estimates

(The Bovine HapMap Consortium, 2009). Both iHS and CLR approaches have revealed that

228

one region including SPOK1 was subject to a selective sweep in beef and dairy cattle. A total

of 12 putative selective sweep regions associated with residual feed efficiency, beef yield and

intra-muscular fatness were discovered when additional data sets were included (Barendse et

al., 2009). In addition, a set of genes including GHR, MC1R, FABP3, CLPN3, SPERT,

HTR2A5, ABCE1, BMP4 and PTGER2 were possibly subject to selective sweeps (Flori et al.,

2009; Qanbari et al., 2010a).

In the previously described chicken HapMap Project, most SNPs were thought to

arise before domestication. However, Rubin et al. (2010) using massively parallel sequencing

identified a possible selective sweep resulting from domestication and specialization of

broiler and layer birds and found one putative region including TSHR that was associated

with metabolic regulation and photoperiod control of reproduction in vertebrates. The TSHR

selective sweep may represent a significant feature of domestic animals, i.e., the restriction of

seasonal reproduction that is now absent from domestic animals. In broilers, the selective

sweep regions contained the genes IGF1, PMCH1 and TBC1D1, which are related to growth,

appetite and metabolic regulation.

In pigs, putative selective sweep regions on SSC1 and SSC3 have been observed

(Groenen et al., 2010a). The regions containing the genes IGF2, PRLR and GHR also had

undergone possible selection (Andersson and Georges, 2004; Iso-Touru et al., 2009).

Genetic diversity and genetic relationship analyses : Population genetics studies of

domestic animals focus on genetic variability within breeds and genetic distances between breeds. Their purposes are to unravel the possible historic events during domestication and breed formation and assist in preserving the genetic diversity within endangered indigenous

229

breeds. These studies will be helpful for scientific conservation and preservation measures, and for clarifying the population stratification for genomic selection and GWAS.

Genetic relationships among 19 cattle breeds with different geographical distributions were analyzed using 37,470 SNPs during the Bovine HapMap Project (The Bovine HapMap

Consortium, 2009). When the population was divided into two groups (K = 2) using

Bayesian approaches, the cattle from the taurine and indicine breeds could be distinguished and crossbred populations showed admixture characteristics. Assuming nine groups (K = 9),

most of the analyzed cattle breeds could be classified into separate groups. Recently the

phylogenetic relationships among 372 animals from 48 cattle breeds were characterized

using the BovineSNP50 array. The results were consistent with the biogeography of breeds

but also clearly depicted the admixed nature of many populations and revealed pedigree

relationships between individuals (Decker et al., 2009).

Kijas et al. (2009) analyzed the genetic relationships among 403 individuals from 23

sheep breeds and 210 individuals from two wild sheep species with 1,536 SNPs. The genetic

variability within both African and Asian sheep breeds were lower than those of European

breeds, and genetic distances between individuals from African and Asian breeds were smaller than those of European breeds. The genetic relationships among breeds were consistent with the geographical distribution and history of breed formation. Close phylogeographical structure, high genetic similarity and low differentiation were observed in sheep breeds, which was in agreement with the previous findings from other genetic markers.

vonHoldt et al. (2010) detected genetic relationships of 912 dogs from 85 breeds and

225 grey wolves using 48,000 SNPs. Both the neighbor-joining (NJ) clustering tree based

230

on SNP genotypes of individuals and population clustering based on haplotype similarity

showed single breeds could be distinguished from one another and grouped into Asian,

Middle Eastern and Northern groups, which were consistent with the history of breed

formation. In addition, domestic dogs had a higher proportion of multi-locus haplotypes unique to Middle Eastern grey wolves, suggesting that domestic dogs may originate from the

Middle East instead of the Far-east as previously hypothesized.

Breed clustering is not always as successful as it was in the previous studies. Wade et al. (2009) analyzed the genetic relationships between 11 horse breeds using 1,007 SNPs and found that the relationships between the studied breeds could not be clarified. This result may be due to the close relationships among domestic horse breeds.

Muir et al. (2008) examined genetic variability of chickens representing commercial, experimental and standard breeds using 2,551 SNPs. Based on the proportion of missing alleles and inbreeding coefficients, commercial broiler and layer line birds were found to have lost a significant amount of genetic diversity (~50% or more) from ancestral breeds. It was suggested that genetic diversity could be recovered within lines by crossing multiple pure lines from chicken breeding companies.

The high-density SNP array has also been useful in understanding the phylogenetic relationships of domestic animals. The dog was determined to be most closely related to the grey wolf, followed by the coyote, the golden jackal and the Ethiopian wolf (Lindblad-

Toh et al., 2005). For pecoran (higher ruminant) species, 17 novel relationships were identified and another 16 previously proposed nodes within the infraorder were confirmed

(Decker et al., 2009).

231

CNV detection

Copy number variation (CNV) refers to a DNA segment that is 1 kb or larger and has variable numbers of copies in comparison with a reference genome. CNVs generally occur in more than 1% of the population, and they have often been found to be associated with specific diseases in humans. The comparison of the fluorescent signal intensity ratios of alleles at each SNP across the genome based on the Illumina BeadChip platform is one approach for CNV identification (http://www.illumina.com). Matukumalli et al. (2009) predicted 79 CNVs in diverse cattle breeds using the BovineSNP50 array, and ten of them were verified by comparative genome hybridization (CGH) array genotyping results. Fan et al. (unpublished data) predicted 12 CNV regions in pigs with the PorcineSNP60 array and found two large CNV regions of interest on SSC14.

Other applications

High-density SNP arrays have been used for relationship and paternity testing and tracing the geographic origins of animal products (Fisher et al., 2009; Kijas et al., 2009;

Weller et al., 2010). The arrays were also utilized for constructing high-resolution linkage maps, improving physical mapping orders, exploring the potential relationships between genomic sequence features and recombination rates, and carrying out linkage disequilibrium and linkage analysis (LDLA) mapping in particularly interesting regions (Arias et al., 2009;

Groenen et al., 2010b).

FUTURE PROSPECTS

Even though SNP arrays have been widely applied in animal breeding and genomics, they still have some limitations with regard to the coverage and annotation of probes on the arrays. The coverage of the arrays for certain species is still low and uneven. Some genomic

232

regions have very few SNPs. Further population information from HapMap studies could

potentially solve this issue in combination with new SNP arrays. In addition, on some of the currently released commercial SNP arrays, there are still a number of unmapped SNPs. For example, ~1,800 SNPs on the BovineSNP50 array were unassigned based on Btau4.0, and

~8,000 SNPs on the PorcineSNP60 array were unmapped based on Sus scrofa build 9.

WGAS have uncovered some associations with unmapped SNPs, and a few of them could be

localized by LD estimates with the mapped SNPs. Additionally, the physical locations of

some mapped SNPs have been corrected with the production of new genome assemblies

(Ramos et al., 2009). Therefore, it is necessary to continuously improve the genome

assembly and assign these SNPs to the correct physical positions.

Another issue is the annotation of the findings from SNP arrays. According to the

reported GWAS, most of the trait-associated SNPs (TASs) were located in genes without

obvious biological significance on the analyzed phenotypes, or they were located in the

intergenic regions or introns of certain genes. Similarly, in GWAS in humans, the TASs were

not always in or near putative candidates relevant to the diseases (Manolio et al., 2009).

Furthermore, a statistical summary indicated that TASs were intronic (45%), intergenic

(43%), exonic and nonsynonymous (9%), exonic and synonymous (2%), or in a 5’ or 3’ untranslated regulatory regions (2%) (Hindorff et al., 2009). These unexpected results may

be due to i) the TASs may be from genes that have not yet been annotated or may be

unmapped SNPs demonstrating that further annotations of the current genome assemblies are

necessary; ii) the sample size (especially for lowly heritable traits) and the genetic

backgrounds of the studied populations may have effects on the association analyses, and

multiple pure breeds and larger samples sizes may be of help; iii) the large (~40 Mb)

233

average interval length between SNPs and uneven SNP distributions of the current arrays are major limitations for haplotype block analyses and fine mapping, so higher density SNP panels may be worth developing for improved analyses depending on the extent of LD in the analyzed populations; and iv) the robustness of statistical methods for GWAS could be improved.

As more and more studies using SNP arrays become available, effective storage of the original data and curation of results could be other important issues. With the emergence of the large WGAS and population genetics studies, it will be feasible to build databases related to genome variation and/or candidate genes as public repositories, facilitating the comparisons of data across studies. In humans, several genome variation and GWAS databases have been developed (http://projects.tcag.ca/variation/; http://www.genome.gov/

26525384; https://gwas.lifesciencedb.jp/cgi-bin/gwasdb/gwas_top.cgi). Ogorevc et al. (2009) constructed a gene database on cattle milk production and mastitis traits, but the capabilities of this database are limited. The designed databases should be comprehensive toolkits, interactive with whole-genome sequence, QTL mapping results and as much other related information as possible (Hu Z-L, personal communication). Such comprehensive databases will contribute to better utilization by the community of researchers doing animal breeding and genetics research.

Lastly, statistical analysis of SNP array data still presents a challenge. The large volume of data generated by SNP arrays is computationally demanding, which requires more sophisticated statistical models and efficient analytical methods. Statistical methods for genomic selection and GWAS are always being developed and improved. The approaches derived from different theories and algorithms will certainly impact the accuracy of the

234

analyses. In addition, most early genomic selection studies were performed with simulated data, which are quite different from real data that often have limited sample sizes, and which

may lack detailed pedigree information. Therefore, novel efforts are still needed in

quantitative genetics, population genetics, and bioinformatics to develop advanced and

efficient statistical approaches that will improve the applications of high-density SNP arrays

in animal scientific research and production.

ACKNOWLEDGEMENTS

The authors thank Zhi-Liang Hu for valuable suggestions on manuscript preparation,

and thank Drs. Hans Cheng, John McEwan, James Brian and Curt Van Tassell for providing

references. The authors appreciate that Drs. Michael E. Goddard and Frank J. Steemers

allowed to reprint their published figures. Support for this work was provided by the College

of Agriculture and Life Sciences, State of Iowa and Hatch and funding from the Ensminger

Endowment. Support for Danielle M. Gorbach is provided by USDA National Needs

Graduate Fellowship Competitive Grant No. 2007-38420-17767 from the National Institute of Food and Agriculture.

REFERENCES

Abasht, B., E. Sandford, J. Arango, P. Settar, J. E. Fulton et al. 2009. Extent and consistency

of linkage disequilibrium and identification of DNA markers for production and egg

quality traits in commercial layer chicken populations. BMC Genomics 10 (Suppl

2):S2.

Altshuler, D., V. J. Pollara, C. R. Cowles, W. J. Van Ettern, J. Baldwin et al. 2000. An SNP

map of the human genome generated by reduced representation shotgun sequencing.

Nature 407:513-516.

235

Amaral, A. J., H. J. Megens, H. H. D. Kerstens, H. C. M. Heuven, B. Dibbits et al. 2009.

Application of massive parallel sequencing to whole genome SNP discovery in the

porcine genome. BMC Genomics 10:374.

Amaral, A. J., H. J. Megens, R. P. M. A. Crooijmans, H. C. M. Heuven and M. A. M.

Groenen. 2008. Linkage disequilibrium decay and haplotype block structure in the

pig. Genetics 179:569-579.

Andersson, L. and M. Georges. 2004. Domestic-animal genomics: Deciphering the genetics

of complex traits. Nat. Rev. Genet. 5:202-212.

Andersson, L. 2009. Genome-wide association analysis in domestic animals: a powerful

approach for genetic dissection of trait loci. Genetica 136:341-349.

Arias, J. A., M. Keehan, P. Fisher, W. Coppieters and R. Spelman. 2009. A high density

linkage map of the bovine genome BMC Genet. 10:18.

Aulchenko, Y. S., D. J. de Koning and C. Haley. 2007. Genomewide rapid association using

mixed model and regression: a fast and simple method for genomewide pedigree-

based quantitative trait loci association analysis. Genetics 177:577-585.

Awano, T., G. S. Johnson, C. M. Wade, M. L. Katz, G. C. Johnson et al. 2009. Genome-wide

association analysis reveals a SOD1 mutation in canine degenerative myelopathy that

resembles amyotrophic lateral sclerosis. Proc. Natl. Acad. Sci. USA 106:2794-2799.

Bannasch, D., A. Young, J. Myers, K. Truvé, P. Dickinson et al. 2009. Localization of canine

Brachycephaly using an across breed mapping approach. PLoS ONE 5:e9632.

Barendse, W., B. E. Harrison, R. J. Bunch, M. B. Thomas and L. B. Turner. 2009. Genome

wide signatures of positive selection: The comparison of independent samples and the

identification of regions associated to traits. BMC Genomics 10:178.

236

Barendse, W., A. Reverter, R. J. Bunch, B. E. Harrison, W. Barris and M. B. Thomas. 2007.

A validated whole-genome association study of efficient food conversion in cattle.

Genetics 176:1893-1905.

Becker, D., J. Tetens, A. Brunner, D. Bürstel, M. Ganter et al. 2010. Microphthalmia in

Texel sheep is associated with a missense mutation in the paired-like homeodomain 3

(PITX3) gene. PLoS ONE 5:8689.

Brooks, S. A., N. Gabreski, D. Miller, A. Brisbin, H. E. Brown et al. 2010. Whole-genome

SNP association in the horse: Identification of a deletion in Myosin Va responsible

for Lavender Foal Syndrome. PLoS Genet. 6:e1000909.

Cadieu, E., M. W. Neff, P. Quignon, K. Walsh, K. Chase et al. 2009. Coat variation in the

domestic dog is governed by variants in three genes. Science 326:150-153.

Calus, M. P. L. 2009. Genomic breeding value prediction: methods and procedures. Animal

4:157-164.

Charlier, C., W. Coppieters, H. Rollin, D. Desmecht, J. S. Agerholm et al. 2008. Highly

effective SNP-based association mapping and management of recessive defects in

livestock. Nat. Genet. 40:449-454.

Dalrymple, B. P., E. F. Kirkness, M. Nefedov, S. McWilliam, A. Ratnakumar et al. 2007.

Using comparative genomics to reorder the human genome sequence into a virtual

sheep genome. Genome Biol. 8:R152. de Roos, A. P., B. J. Hayes, R. J. Spelman and M. E. Goddard. 2008. Linkage disequilibrium

and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics

179:1503-1512.

Decker, J. E., J. C. Pires., G. C. Conant, S. D. McKay, M. P. Heaton et al. 2009. Resolving

237

the evolution of extant and extinct ruminants with high-throughput phylogenomics.

Proc. Natl. Acad. Sci. USA 106:18644-18649.

Drogemuller, C., E. K. Karlsson, M. K. Hytonen, M. Perloski, G. Dolf et al. 2008. A

mutation in hairless dogs implicates FOXI3 in ectodermal development. Science

321:1462.

Eck, S. H., A. Benet-Pagès, K. Flisikowski, T. Meitinger, R. Fries and T. Storm. 2009.

Whole genome sequencing of a single Bos taurus animal for single nucleotide

polymorphism discovery. Genome Biol. 10:R82.

Fan, B., S. K. Onteru, D. Garrick, K. J. Stalder and M. F. Rothschild. 2009. A genome-wide

association study for pig production and feet and leg structure traits using the

PorcineSNP60 BeadChip. Pig Genome III Conference, November 2-4, 2009,

Hinxton, Cambridge, UK. Abstract No. 6.

Fernando, R. L. and D. J. Garrick. 2009. GenSel - User manual for a portfolio of genomic

selection related analyses. Animal Breeding and Genetics, Iowa State University,

Ames.

Feugang, J. M., A. Kaya, G. P. Page, L. Chen, T. Mehta et al. 2009. Two-stage genome-wide

association study identifies integrin beta 5 as having potential role in bull fertility.

BMC Genomics 10:176.

Fisher, P. J., B. Malthus, M. C. Walker, G. Corbett and R. J. Spelman. 2009. The number of

single nucleotide polymorphisms and on-farm data required for whole-herd parentage

testing in dairy cattle herds. J. Dairy Sci. 92:369-374.

Flori, L., S. Fritz, F. Jaffrézic, M. Boussaha, I. Gu et al. 2009. The genome response to

artificial selection: A case study in dairy cattle. PLoS ONE 4:e6595.

238

Georges, M. 2007. Mapping, fine mapping, and molecular dissection of quantitative trait loci

in domestic animals. Annu. Rev. Genomics Hum. Genet. 8:131-162.

Gianola, D., R. L. Fernando and A. Stella. 2006. Genomic-assisted prediction of genetic

value with semiparametric procedures. Genetics 173:1761-1776.

Goddard, M. E. 2009. Genomic selection: prediction of accuracy and maximisation of long

term response. Genetica 136:245-257.

Goddard, M. E. and B. J. Hayes. 2009. Mapping genes for complex traits in domestic animals

and their use in breeding programmes. Nat. Rev. Genet. 10:381-391.

Gorbach, D. M., W. Cai, J. C. M. Dekkers, J. M. Young, D. J. Garrick et al. 2009. Whole-

genome analyses for genes associated with residual feed intake and related traits

utilizing the PorcineSNP60 BeadChip. Pig Genome III Conference. November 2-4.

Hinxton, UK. Abstract No. 11.

Green, E. D. 2001. Strategies for the systematic sequencing of complex genomes. Nat. Rev.

Genet. 2: 573-583.

Groenen, M. A. M., A. Amaral, H. J. Megens, G. Larson, A. L. Archibald et al. 2010a. The

Porcine HapMap Project: genome-wide assessment of nucleotide diversity, haplotype

diversity and footprints of selection in the pig. The International Plant & Animal

Genome XVIII Conference. January 9-13, San Diego, California. W609.

Groenen, M. A. M., P. Wahlberg, M. Foglio, H. H. Cheng, H. J. Megens et al. 2010b. A

high-density SNP-based linkage map of the chicken genome reveals sequence

features correlated with recombination rate. Genome Res. 19:510-519.

Hayes, B. J., P. J. Bowman, A. J. Chamberlain, K. Savin, C. P. van Tassell et al. 2009a. A

239

validated genome wide association study to breed cattle adapted to an environment

altered by climate change. PLoS ONE 4:e6676.

Hayes, B. J., P. J. Bowman, A. J. Chamberlain and M. E. Goddard. 2009b. Genomic

selection in dairy cattle: Progress and challenges. J. Dairy Sci. 92:433-443.

Hindorff, L. A., P. Sethupathy, H. A. Junkins, E. M. Ramos, J. P. Mehta, F. S. Collins and T.

A. Manolio. 2009. Potential etiologic and functional implications of genome-wide

association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106:9362-

9367.

Ibánĕz-Escriche, N., R. L. Fernando, A. Toosi and J. C. M. Dekkers. 2009. Genomic

selection of purebred for crossbred performance. Genet. Sel. Evol. 41:12.

International Chicken Genome Sequencing Consortium. 2004. Sequence and comparative

analysis of the chicken genome provide unique perspective on vertebrate evolution.

Nature 432:695-716.

International Chicken Polymorphism Map Consortium. 2004. A Genetic variation map for

chicken with 2.8 million single-nucleotide polymorphism. Nature 432:717-722.

Iso-Touru, T., J. Kantanen, M. H. Li, Z. Gizejewski and J. Vilkki. 2009. Divergent evolution

in the cytoplasmic domains of PRLR and GHR genes in Artiodactyla. BMC Evol.

Biol. 9:172.

Karlsson, E. K., I. Baranowska, C. M. Wade, N. S. Hillbertz, M. C. Zody et al. 2007.

Efficient mapping of mendelian traits in dogs through genome-wide association. Nat.

Genet. 39:1321-1328.

Kerstens, H. H. D., S. Kollers, A. Kommadath, M. del Rosario, B. Dibbits et al. 2009.

240

Mining for single nucleotide polymorphism in pig genome sequence data. BMC

Genomics 10:4.

Kijas, J. W., D. Townley, B. P. Dalrymple, M. P. Heaton, J. F. Maddox et al. 2009. A

genome wide survey of SNP variation reveals the genetic structure of sheep breeds.

PLoS ONE 4:e4668.

Kim, E. S. and B. W. Kirkpatrick. 2009. Linkage disequilibrium in the North American

Holstein population. Anim. Genet. 40:279-288.

Kirkness, E. F., V. Bafna, A. L. Halpern, S. Levy, K. Remington et al. 2003. The dog

genome: Survey sequencing and comparative analysis. Science 301:1898-1903.

Lindblad-Toh, K., C. M. Wade, T. S. Mikkelsen, E. K. Karlsson, D. B. Jaffe et al. Genome

sequence, comparative analysis and haplotype structure of the domestic dog. Nature

438:803-819.

Manolio, T. A., F. S. Collins, N. J. Cox, D. B. Goldstein, L. A. Hindorff et al. 2009. Finding

the missing heritability of complex disease. Nature 461:747-753.

Matukumalli, L. K., C. T. Lawley, R. D. Schnabel, J. F. Taylor, M. F. Allan et al. 2009.

Development and characterization of a high density SNP genotyping assay for cattle.

PLoS ONE 4:e5350.

Megens, H. J., R. P. M. A. Crooijmans, J. W. M. Bastiaansen, H. H. D. Kerstens, A. Coster et

al. 2009. Comparison of linkage disequilibrium and haplotype diversity on macro-

and microchromosomes in chicken. BMC Genet. 10:86.

Meuwissen, T. H. E., B. J. Hayes and M. E. Goddard. 2001. Prediction of total genetic value

using genome-wide dense marker maps. Genetics 157:1819-1829.

Muir, W. M., G. K. S. Wong, Y. Zhang, J. Wang, M. A. M. Groenen et al. 2008. Genome-

241

wide assessment of worldwide chicken SNP genetic diversity indicates significant

absence of rare alleles in commercial breeds. Proc. Natl. Acad. Sci. USA 105:17312-

17317.

Murdoch, B. M., M. L. Clawson, W. W. Laegreid, P. Stothard, M. Settles et al. 2010. A 2cM

genome-wide scan of European Holstein cattle affected by classical BSE. BMC

Genet.11:20.

Ogorevc, J., A. Razpet and P. Dovc. 2009. Database of cattle candidate genes and genetic

markers for milk production and mastitis. Anim. Genet. 40:832-851.

Onteru, S. K., B. Fan, D. Garrick, K. J. Stalder and M. F. Rothschild. 2009. Whole genome

analyses for pig reproductive traits using the PorcineSNP60 BeadChip. Pig Genome

III Conference. November 2-4, 2009, Hinxton, Cambridge, UK. Abstract No. 5.

Pant, S. D., F. S. Schenkel, C. P. Verschoor, Q. You, D. F. Kelton et al.. 2010. A principal

component regression based genome wide analysis approach reveals the presence of a

novel QTL on BTA7 for MAP resistance in holstein cattle. Genomics 95:176-182.

Parker, H. G., B. M. VonHoldt, P. Quignon, E. H. Margulies, S. Shao et al. 2009. An

expressed fgf4 retrogene is associated with breed-defining chondrodysplasia in

domestic dogs. Science 325:995-998.

Perkel, J. 2008. SNP genotyping: six technologies that keyed a revolution. Nat. Methods

5:447-453.

Pontius, J. U., J. C. Mullikin, D. R. Smith, Agencourt Sequencing Team, K. Lindblad-Toh et

al. 2007. Initial sequence and comparative analysis of the cat genome. Genome Res.

17:1675-1689.

Qanbari, S., E. C. G. Pimentel, J. Tetens, G. Thaller, P. Lichtner et al. 2010a. A genome-wide

242

scan for signatures of recent selection in Holstein cattle. Anim. Genet. Published

Online doi:10.1111/j.1365-2052.2009.02016.x

Qanbari, S., E. C. G. Pimentel, J. Tetens, G. Thaller, P. Lichtner et al. 2010b. The pattern of

linkage disequilibrium in German Holstein cattle. Anim. Genet. Published Online

doi:10.1111/j.1365-2052.2009.02011.x

Ramos, A. M., R. P. M. A. Crooijmans, N. A. Affara, A. J. Amaral, A. L. Archibald et al.

2009. Design of a high density SNP genotyping assay in the pig using SNPs

identified and characterized by next generation sequencing technology. PLoS ONE

4:e6524.

Rothschild, M. F., Z. L. Hu and Z. H. Jiang. 2007. Advances in QTL mapping in pigs. Int. J.

Biol. Sci. 3:192-197.

Rothschild, M. F., D. M. Gorbach, B. Fan, S. K. Onteru, Z. Q. Du et al. 2010. Applications of

new porcine genomic tools to trait discovery and understanding genomic architecture.

The International Plant & Animal Genome XVIII Conference. January 9-13, San

Diego, California. W614.

Rubin, C. J., M. C. Zody, J. Eriksson, J. R. S. Meadows, E. Sherwood et al. 2010. Whole-

genome resequencing reveals loci under selection during chicken domestication.

Nature 464:587-591.

Sabeti, P. C., P. Varilly, B. Fry, J. Lohmueller, E. Hostetter et al. 2007. Genome-wide

detection and characterization of positive selection in human populations. Nature

449:913-918.

Settles, M., R. Zanella, S. D. McKay, R. D. Schnabel, J. F. Taylor et al. 2009. A whole

genome association analysis identifies loci associated with Mycobacterium avium

243

subsp. Paratuberculosis infection status in US Holstein cattle. Anim. Genet. 40:655-

662.

Sherman, E. L., J. D. Nkrumah and S. S. Moore. 2010. Whole genome single nucleotide

polymorphism associations with feed intake and feed efficiency in beef cattle. J.

Anim. Sci. 88:16-22.

Snelling, W. M., M. F. Allan, J. W. Keele, L. A. Kuehn, T. McDaneld, T. P. L. Smith, T. S.

Sonstegard, R. M. Thallman and G. L. Bennett. 2010. Genome-wide association study

of growth in crossbred beef cattle. J. Anim. Sci. 88:837-848.

Steemers, F. J. and K. L. Gunderson. 2007. Whole genome genotyping technologies on the

BeadArrayTM platform. Biotechnol. J. 2:41-49.

Sved, J. A. 1971. Linkage disequilibrium and homozygosity of chromosome segments in

finite populations. Theor. Popul. Biol. 2:125-141.

The Bovine Genome Sequencing and Analysis Consortium. 2009. The genome sequence of

Taurine cattle: A window to ruminant biology and evolution. Science 324:522-528.

The Bovine HapMap Consortium. 2009. Genome-wide survey of SNP variation uncovers the

genetic structure of cattle breeds. Science 324:528-532.

Toosi, A., R. L. Fernando and J. C. M. Dekkers. 2010. Genomic selection in admixed and

crossbred populations. J. Anim. Sci. 88:32-46.

Van Tassell, C. P., T. P. L. Smith, L. K. Matukumalli, J. F. Taylor, R. D. Schnabel et al.

2008. SNP discovery and allele frequency estimation by deep sequencing of reduced

representation libraries. Nat. Methods 5:247-252.

VanRaden, P. and P. G. Sullivan. 2010. International genomic evaluation methods for dairy

cattle. Genet. Sel. Evol. 42:7.

244

VanRaden, P. M., C. P. Van Tassell, G. R. Wiggans, T. S. Sonstegard, R. D. Schnabel et al.

2009. Reliability of genomic prediction s for North American Holstein bulls. J. Dairy

Sci. 92:16-24.

Verbyla, K. L., B. J. Hayes, P. J. Bowman and M. E. Goddard. 2009. Accuracy of genomic

selection using stochastic variable selection in Australian Holstein Friesian dairy

cattle. Genet. Res. Camb. 91:307-311.

Villa-Angulo, R., L. K. Matukumalli, C. A. Gill, J. Choi, C. P. Van Tassell and J. J.

Grefenstette. 2009. High-resolution haplotype block structure in the cattle genome.

BMC Genet. 10:19.

vonHoldt, B. M., J. P. Pollinger, K. E. Lohmueller, E. Han, H. G. Parker et al. Genome-wide

SNP and haplotype analyses reveal a rich history underlying dog domestication.

Nature 464:898-902.

Wade, C. M., E. Giulotto, S. Sigurdsson, M. Zoli, S. Gnerre et al. 2009. Genome sequence,

comparative analysis, and population genetics of the domestic horse. Science

326:865-867.

Weller, J. I., G. Glick, Y. Zeron, E. Seroussi and M. Ron. 2010. Paternity validation and

estimation of genotyping error rate for the BovineSNP50 BeadChip. Anim. Genet.

Published online doi:10.1111/j.1365-2052.2010.02035.x

Wernersson, R., M. H. Schierup, F. G. Jørgensen, J. Gorodkin, F. Panitz et al. 2005. Pigs in

sequence space: A 0.66× coverage pig genome survey based on shotgun sequencing.

BMC Genomics 6:70.

Wiedmann, R. T., T. P. L. Smith and D. J. Nonneman. 2008. SNP discovery in swine by

reduced representation and high throughput pryosequencing. BMC Genet. 9:81.

245

Wilbe, M., P. Jokinen, K. Truvé, E. H. Seppala, E. K. Karlsson et al. 2010. Genome-wide

association mapping identifies multiple loci for a canine SLE-related disease

complex. Nat. Genet. 42: 250-254.

Wood, S. H., X. Y. Ke, T. Nuttall, N. McEwan, W. E. Ollier et al. 2009. Genome-wide

association analysis of canine atopic dermatitis and identification of disease related

SNPs. Immunogenetics 61:765-772.

Zhao, X., H. T. Blair, S. Onteru, S. A. Piripi, M. F. Rothschild et al. 2010. A genome wide

association study and fine mapping for Chondrodysplasia of Texel sheep. The

International Plant & Animal Genome XVIII Conference. January 9-13, San Diego,

California. P579.

246

Figure 1. The strategy of reduced representation library (RRL) sequencing for SNP discovery. The details about each procedure can be obtained from Altshuler et al., 2000; Van Tassell et al., 2008; Wiedmann et al., 2008.

247

Figure 2. The illustration of Infinium II assay that was developed based on single-base extension (SBE) for SNP genotyping (Steemers and Gunderson, 2007, Copyright Wiley- VCH Verlag GmbH & Co. KGaA. Reproduced with permission). In this genotyping system, A and T nucleotides were labeled in one color, and C and G were in another. The polymorphisms A>T and G>C could not be detected.

248

Figure 3. Estimation of number of individuals required in a reference population (Goddard and Hayes, 2009. Copyright Nature Publishing Group. Reproduced with permission). a) Number of individuals required in a reference population to obtain an accuracy of 0.7 for genomic estimated breeding values (GEBVs). The number of animals is negatively correlated to heritability of trait. b) Accuracy of the predicted GEBVs for individuals without genotypes in a validation population, assuming Ne = 100. The prediction accuracy is positively correlated to population size of reference individuals.

249

Table 1. A summary of the sequenced whole genomes of important domestic animals

Sequencing Species Genome No. of Sequenced strategy Sequencing Release (Latin length coding URL Reference animal (Fold organization year name) (Assembly) genes* coverage) Chicken A female Whole-genome 1.05 Gb 16,450 Washington 2004 http://genome.wustl.edu/genomes/ International (Gallus inbred red shotgun/BAC (WASHUC2) University view/gallus_gallus Chicken Genome gallus) jungle fowl and other Genome http://www.ensembl.org/Gallus_ Sequencing clones (6.6×) Sequencing gallus/Info/Index/ Consortium. Center 2004 Dog A male Whole-genome ~2.3-2.47 Gb ~18,473- The Institute for 2003 http://www.ncbi.nlm.nih.gov/ Kirkness et al. (Canis poodle shotgun (1.5×) 24,567 Genomic nuccore/36796739?report= 2003 familiaris) Research/ The genbank Center for Advancement of Genomics Dog A female Whole-genome 2.38 Gb 15,900 Broad 2005 http://www.broadinstitute.org/ Lindblad-Toh et (Canis boxer shotgun/BAC (CanFam2.0) Institute/MIT mammals/dog al. 2005 familiaris) and other Center for http://www.ensembl.org/Canis_ clones (7.5×) Genome familiaris/Info/Index/ Research Bovine A Hereford Whole-genome 2.91 Gb 20,684 Baylor HGSC 2009 http://genomes.arc.georgetown.edu/ The Bovine (Bos taurus) cow shotgun/BAC (Btau4.0) drupal/bovine Genome and other http://www.hgsc.bcm.tmc.edu/ Sequencing and clones (7.1×) project-species-m-Bovine. Analysis hgsc?pageLocation=Bovine Consortium. http://www.ensembl.org/Bos_ 2009 taurus/Info/Index/ Horse A female Whole-genome 2.47 Gb 17,254 Broad 2009 http://www.broadinstitute.org/ Wade et al. (Equus thoroughbred shotgun/BAC (EquCab 2) Institute/MIT mammals/horse 2009 caballus) and other Center for http://www.ensembl.org/Equus_ clones (6.8×) Genome caballus/Info/Index Research Pig Blood Whole-genome ~2.1 Gb - The Sino- 2005 http://www.piggenome.dk/ Wernersson et (Sus scrofa) samples from shotgun Danish pig al. 2005 five breeds** (0.66×) genome sequencing project Pig A single Minimal tile- 2.26 Gb 12,678 Wellcome Trust 2009 http://www.piggenome.org/ - (Sus scrofa) Duroc sow path BAC by (Sscrofa9) Sanger Institute http://www.sanger.ac.uk/ BAC (6×) Projects/S_scrofa/ http://www.ensembl.org/Sus_ scrofa/Info/Index/ Sheep Blood Whole-genome 2.78 Gb - AgResearch/ 2008 http://www.sheephapmap.org/ - (Ovis aries) samples from shotgun (3×) (OAR1.0) Baylor http://www.livestockgenomics. six breeds*** HGSC/CSIRO/ csiro.au/sheep/ University of https://isgcdata.agresearch.co.nz/ Otago Cat A female Whole-genome 1.64 Gb 13,271 Agencourt 2006 http://www.ensembl.org/Felis_ Pontius et al. (Felis catus) Abyssinian shotgun (CAT) Bioscience/ catus/Info/Index/ 2007 cat (1.87×) Broad Institute Rabbit A female Whole-genome 2.67 Gb 14,346 Broad Institute 2009 http://www.ensembl.org/Oryctolagus - (Oryctolagus Thorbecke shotgun (7×) (OryCun2) _cuniculus/Info/Index/ cuniculus) New Zealand http://www.broadinstitute.org/science/ White rabbit projects/mammals-models/rabbit/ rabbit-genome-sequencing-project Turkey A female BAC/other 1.08 Gb 11,145 Virginia 2009 http://www.ensembl.org/Meleagris_ - (Meleagris unknown large clone (UMD2) Bioinformatics gallopavo/Info/Index/ gallopavo) Turkey shotgun (-) Institute/ USDA Beltsville/ University of Maryland * Including known and projected protein-coding genes (http://www.ensembl.org/info/about/species.html, up to May 1, 2010). ** ErHuaLian, Duroc, Landrace, Yorkshire and Hampshire. *** Romney, Texel, Scottish Blackface, Merino, Poll Dorset and Awassi.

250

Table 2. Illumina's BeadChips developed for important domestic animals

Average interval Average MAF No. of SNPs No. of mapped Species BeadChip name between SNPs across tested Release status (Approximate) SNPs (Assembly) (kb) populations

Chicken Multiple - - - - Open with restriction chips*

Dog CanineSNP20 22,362 22,000 125 0.27 Commercially (CanFam2.0) available

Dog CanineHD 170,000 170,000 14.3 0.23 Commercially (CanFam2.0) available

Cattle BovineSNP50 54,001 52,255 (Btau4.0) 51.5 0.25 Commercially available

Cattle BovineHD >500,000 - - - Being developed

Cattle Bovine3K** 3,000 3,000 (Btau4.0) - - Being developed

Horse EquineSNP50 54,602 54,602 43.2 0.21 Commercially (EquCab2.0) available

Pig PorcineSNP60 64,232 55,446 (Sscrofa9) 40.7 0.27 Commercially available

Sheep OvineSNP50 54,241 - 46 ~0.3 Commercially available

* Multiple chips were produced, including a 60K SNP array. ** Selected from SNPs on the BovineSNP50, with the potential use for selecting breeding cattle prior to purchase in the dairy industry.

251

Table 3. Genome-wide association studies with reported candidate genes in domestic animals

Analysis method Species Population (Size) Phenotype Candidate gene(s) Replication Reference (Software) Dog Boxer (19) White coat color Case-control (Plink) MITF Yes Karlesson et al., 2007 Rhodesian Ridgeback (21) Ridgeless Case-control (Plink) FGF3, FGF4, FGF19 No Karlesson et al., 2007 Pembroke Welsh Corgi Degenerative myelopathy Case-control (Plink) SOD1 Yes Awano et al., 2009 (55) Chinese Crested dogs (19) Canine ectodermal Case-control (Plink) FOXI3 Yes Drogemuller et al,. dysplasia 2008 Golden Retriever (48) Atopic dermatitis Case-control (Plink) RAB3C, RAB7A, Yes Wood et al., 2009 SORCS2 etc ~ 80 breeds (>1,000) Coat phenotypes Case-control (Plink) RSPO2, FGF5, Fine mapping Cadieu et al., 2009 KRT71 etc 76 breeds (835) Chondrodysplasia Case-control (Plink) FGF4 Fine mapping Parker et al., 2009 Multiple breeds (51) Brachycephalic head type Case-control (Plink) THBS2, SMOC-2 Fine mapping Bannasch et al., 2009 Unknown population (138) Systemic lupus Case-control (Plink) PPP3CA, HOMER2, Yes Wilbe et al., 2010 erythematosus PTPN3 etc. Cattle Multiple breeds (~1,500) Efficient food conversion Linear regression PLA2G5, ATP1A1, Yes Barendse et al., 2007 DAG1 etc Belgian Blue (26) Congenital muscular Case-control ATPA2A1 Fine mapping Charlier et al., 2008 dystony 1 (CMD1) (ASSHOM/ASSIST) Belgian Blue (31) Congenital muscular Case-control SLC6A5 Fine mapping Charlier et al., 2008 dystony 2 (CMD2) (ASSHOM/ASSIST) Italian Chianina (12) Ichthyosis fetalis Case-control ABCA12 Fine mapping Charlier et al., 2008 (ASSHOM/ASSIST) Holstein cows (245) Johne's disease (JD) Case-control (Plink) EDN2, SOD1, PARP1 No Settles et al., 2009 etc Black/Red Angus (76) Black/Red coat color Case-control (Plink) MC1R No Matukumalli et al., 2009 Holstein bulls (5,360) Dairy-related traits Linear regression/ DGAT1, ABCG2, No Cole et al., 2009 Bayesian methods Integrin β2 etc. Holstein bulls (20) Bull fertility Two stage phases/ ITGB5 Yes Feugang et al., 2009 Linear regression Holstein Friesian cows Sensitivity of milk Bayesian method FGF4, G3PD etc Yes Hayes et al., 2009a (798) production to environment Multiple breeds/crossbreed Body weight Linear regression SPP1, NCAPG etc No Snelling et al., 2010 (~3,000) Holstein cows (481/343) Bovine spongiform Sib-TDT/Case- ANKS1B, B3GALT1, No Murdoch et al., 2010 encephalopathy (BSE) control (Plink) EXT1 etc Multiple breeds/crossbreed Feed intake and feed Linear regression FNDC3B, RUFY3, No Sherman et al., 2010 (464) efficiency RAI etc Unknown population (232) Johne's disease (JD) Principal component TUBA3D, CCDC59, No Pant et al., 2010 regression TMTC2 etc Pig Yorkshire (716) Average daily gain Bayesian method MC4R etc No Gorbach et al., 2009 (GenSel) Large White/Large Back fat Bayesian method MC4R, CHCHD3, No Fan et al., 2009 White×Landrace (815) (GenSel) ATP6V1H etc Large White/Large Loin muscle area Bayesian method BMP2, IGF2, FST etc No Fan et al., 2009 White×Landrace (815) (GenSel) Large White/Large Leg action Bayesian method HOXA, TWIST1, SP4 No Fan et al., 2009 White×Landrace (815) (GenSel) etc Large White/Large Reproduction Bayesian method MEF2C, PTX3, ITG6 No Onteru et al., 2009 White×Landrace (683) (GenSel) etc Large White/Large Sow longevity Bayesian method ANXA6, ZIC3, ZIC5 No Onteru et al. White×Landrace (683) (GenSel) (personal communication) Sheep Texel (23) Chondrodysplasia Identity by descent CADPS2, SLC13A1, No Zhao et al., 2010 (IBD) mapping N D U FA 5 e t c Texel (46) Microphthalmia Case-control (Plink) PITX3 Fine mapping Becker et al., 2010 Horse Arabian (36) Lavender Foal Syndrome Case-control (R) MYO5A Fine mapping Brooks et al., 2010

252

APPENDIX D. LARGE-SCALE SNP ASSOCIATION ANALYSES OF RESIDUAL FEED INTAKE AND ITS COMPONENT TRAITS IN PIGS

A paper published in the Proceedings of the 9th World Congress on Genetics Applied to Livestock Production

D.M. Gorbach, W. Cai, J.C.M. Dekkers, J.M. Young, D.J. Garrick, R.L. Fernando, M.F. Rothschild

Introduction

The high genetic correlation between growth and feed intake has increased feed requirements as pigs have been selected for increased growth rate. The largest variable cost in swine production today is feed. Boggess et al. (2009) claimed that around $500 million dollars annually could be saved by swine producers in the US by reducing the average feed:gain ratio from 2.75 to 2.45. Significant variability exists between pigs in the amount of feed intake required to achieve the same rate of growth. Residual feed intake (RFI) is a measure of the difference between what an animal actually consumes and the average amount of feed required for that animal’s levels of maintenance and growth. It is believed that animals can be successfully selected for both increased growth and reduced RFI, to reduce the feed costs per-unit gain.

Iowa State University has selected pigs for decreased RFI over six generations. A single foundation population of Yorkshire pigs was divided into two lines by randomly splitting litters to form a select line for low RFI and a randomly selected control line (Cai et al. (2008)). Following five generations of random selection, the control line was subsequently selected for increased RFI. Phenotypic RFI was calculated by fitting a quadratic random regression equation as average daily feed intake (ADFI) – (b1i*onTestWeight + b2i*offTestWeight + b3i*midTestWeight + b4i*(average daily gain (ADG)) +

253

b5i*offTestBackFat (BF)), where the regression coefficients (b’s) were generation and line dependent. An animal model was used to estimate EBV from the phenotypic RFI observations (Cai et al. (2008)). Estimated heritability for RFI calculated in this manner was

0.29 and RFI explained 34% of phenotypic variation in ADFI (Cai et al. 2008). These results indicate that financial progress could be made by selecting for RFI, if this could be done using SNPs without the expense of collecting feed intake data.

Material and methods

Genotyping. Tail samples were collected from each animal at birth and used for DNA isolation. The Qiagen (Valencia, CA, USA) DNeasy blood & tissue kit was used for DNA isolation. A total of 730 animals were selected for genotyping from generations 0, 4, 5, and 6 of the Iowa State University RFI selection lines (Figure 1). Genotyped animals included, 716 with RFI data, with 329 control and 387 select animals. GeneSeek Inc. (Lincoln, NE, USA) completed the genotyping with the Illumina (San Diego, CA, USA) PorcineSNP60

BeadChip.

Phenotyping. Feed intake data was measured on all animals using electronic FIRE Feeders

(Osborne, KS, USA) donated by PIC (Hendersonville, TN, USA), as described in Cai et al.

(2008). Animals were weighed at least every two weeks to compute ADG. At approximately

115 kg, 10th-rib BF was evaluated with an Aloka ultrasound machine (Corometrics Medical

Systems Inc., Wallingford, CT, USA).

Statistical analyses. Quality control included the removal of all single nucleotide polymorphisms (SNPs) which were fixed in the entire population or had a quality control

(QC) score less than 0.4 in greater than 20% of the population. A total of 55,533 SNPs remained for analysis. Bayes C model averaging, as implemented in GenSel

254

(http://bigs.ansci.iastate.edu) was used for data analyses. The regression model used was: Y =

Xβ + Zu + e, where X is an incidence matrix for fixed effects and Z is a matrix of SNP

genotypes fitted as random effects. Fixed effects included line, sex, on-test group, pen fitted

within group, and on-test age as a covariate, except for BF where on-test age was replaced

with off-test weight. The prior probability that a SNP in Z has zero effect was set to 0.995,

corresponding to about 300 non-zero SNP effects fitted in any particular realization of the

Monte-Carlo Markov Chain (MCMC) used for the Bayesian analysis. Following a 10,000 iteration burn-in period, 40,000 MCMC iterations were run. Results were obtained in the

form of a post burn-in posterior distribution for the effect of every SNP fitted simultaneously

with other informative SNPs. The posterior mean effect of each SNP across the chains was

used to predict the genomic breeding value of every chromosomal fragment consisting of 5

contiguous loci (5-SNP overlapping windows). Each such window’s contribution to the

additive genetic variance in the population was then derived, a statistic that has a multi-locus analogy to 2 2, the gene frequency specific contribution to genetic variance of the substitution effect푝푞훼� of a single locus. That variance was divided by an estimate of the total genetic variance.

The most significant regions of the genome for RFI, ADFI, ADG, and BF were examined for genes based on build 9 of the porcine genome. Gene positions were taken from

Ensembl (www.ensembl.org) identified by proximity to the most informative individual

SNPs or 5-SNP windows and gene function.

Results and discussion

In Figure 2 SNP association results are shown for RFI, ADFI, ADG, and BF. Sets of 250-300

markers explained 33% of phenotypic variation for RFI. This number was close to estimated

255

heritability (0.29 - Cai et al. (2008)) and litter variance of 4% (Bunter et al. (2010)). In the current study, litter was not fitted because few families had more than one genotyped offspring. For ADFI, ADG, and BF, 48%, 43%, and 69%, respectively, of the phenotypic variation was explained by sets of 250-300 markers. Heritabilities for these traits were estimated to be 0.51, 0.42, and 0.68, respectively, in this population (Cai et al. (2008)) and corresponded closely with the proportions of phenotypic variance explained by markers, as obtained from these marker analyses.

Proportions of genetic variance displayed in plots on the left side of Figure 2 reflect the genetic variance explained by single markers divided by the overall genetic variance of the trait explained by the sets of 250-300 markers. These values would be estimates of the true genetic variance accounted for by the SNP, if the SNP was in linkage equilibrium with the other SNPs. The existence of linkage disequilibrium (LD) reduces the genetic variance explained by each SNP as multiple SNPs share the effect of each QTL due to LD. Thus, SNP windows more accurately capture the contributions of QTL to genetic variance.

The largest SNP effect and largest SNP window effects for RFI were located on S. scrofa chromosome (SSC) 2 near 32 megabases (Mb) [coordinate position]. None of the other traits analyzed had large SNP effects in this region of SSC2. Some of the largest genetic effects for ADFI were near MC4R on SSC1 and around 49 Mb [coordinate position] on SSC16 near FGF18. Some of the largest effects for ADG were near 17 Mb [coordinate position] on SSC6 and near MC4R on SSC1. Finally, for BF, the largest effect was from an unmapped SNP that lacked high LD with any currently mapped SNPs. The largest effect on

BF from mapped SNPs was around 143 Mb [coordinate position] on SSC13. The SNPs with large effects on RFI have potential to be used in marker-assisted selection (MAS) to reduce

256

the feed intake requirements of pigs without negatively impacting other production traits.

The economic impact of such a reduction would greatly benefit hog producers.

Conclusions

This study has identified SNPs that may be useful in marker-assisted selection to predict

ADFI and/or RFI without the expense of gathering feed intake data in pigs. That information

could be used in index selection approaches to simultaneously improve growth and reduce

feed costs.

Acknowledgements

Funding was provided by the Iowa Agricultural Home Economics Experiment Stations, State of Iowa and Hatch funding, and the National Pork Board. Support for D.M. Gorbach is provided by USDA National Needs Graduate Fellowship Competitive Grant No. 2007-

38420-17767 from the National Institute of Food and Agriculture.

References

Boggess, M. (2009). Pig Genome III Conference.

Bunter, K.L., Cai, W., Johnston, D.J., et al. (2010). J. Anim. Sci. (in press).

Cai, W., Casey, D.S., and Dekkers, J.C.M. (2008). J.Anim. Sci., 86:287–298.

257

Full sibs

Figure 1: Generations and parities (P1/P2) of pigs in the ISU selection project with genotyped parities depicted in red.

258

a - Individual SNP Effects on RFI b - SNP Windows’ Effects on RFI

c - Individual SNP Effects on ADFI d - SNP Windows’ Effects on ADFI

e - Individual SNP Effects on ADG f - SNP Windows’ Effects on ADG

g - Individual SNP Effects on BF h - SNP Windows’ Effects on BF

Figure 2: Proportion of genetic variance explained by each marker or window of 5 consecutive markers in the genome for each trait. Each chromosome is a different color with SSC1 on the left, SSC18 in red on the right, followed by SSCX in green and unmapped markers in blue. Note: y-axes differ in scale.

259

ACKNOWLEDGEMENTS

I would like to thank my major professor, Max Rothschild, who has been more than just an academic advisor. You have advised me on many aspects of my life and helped me to make some major life decisions, whether you realized it at the time or not. I have been extremely grateful for your constant belief in me and encouragement. You were and continue to be a great advisor and friend.

To my family, thank you for always supporting me in all of my endeavors. My parents, Rita and Cromwell Bowen, inspired me to want to learn from a very early age and always provided the necessary tools for advancing my education. Thank you for always helping out immediately whenever I say I want or need something; I couldn’t ask for better parents. To my brother, David, you taught me how to get along with anyone, which I really appreciate. I always know I can turn to you for help, and you’ll be there for me. To my husband, Sasha, thank you for tolerating my ups and downs during these last few years; you are always able to give me useful insights on my situation. Thank you for loving and supporting me continuously.

To my friends, thank you for the great conversations, the encouragement, and the breaks from studying. There are far too many of you to specify here, which I apologize for, but I would like to single out a few of my very close graduate school friends. Erin, thank you for always listening to my joys and concerns; it was wonderful to teach the equine science course with you and share horse stories outside of class. Ceren, you are a unique individual; I loved discussing research with you and feel privileged to call you my friend. To the

Rothschild lab members, thank you for the personal, professional, and scientific advice and for the fun times together.

260

To the professors at Iowa State and Cornell College and the teachers at Rivermont

Collegiate, thank you for sharing your wisdom with me. I consider many of you to be more

than just teachers and advisors, but also great friends. Thank you for always encouraging me to try to reach my full potential and for helping me to determine what that potential might be.

To the countless others who have helped me through this journey, thank you. I greatly appreciate everything you have done for me, even if I did not thank you properly at the time.

I know I would not have been able to get where I am today without all of you.