Capstones, Theses and Graduate Theses and Dissertations Dissertations

2021

A well-rounded toolbox: Multiple approaches of animal breeding and genetics to improve livestock production, conservation and food security.

Josue Chinchilla Vargas Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd

Recommended Citation Chinchilla Vargas, Josue, "A well-rounded toolbox: Multiple approaches of animal breeding and genetics to improve livestock production, conservation and food security." (2021). Graduate Theses and Dissertations. 18476. https://lib.dr.iastate.edu/etd/18476

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. A well-rounded toolbox: Multiple approaches of animal breeding and genetics to improve livestock production, conservation, and food security.

by

Josué Chinchilla-Vargas

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Animal Science

Program of Study Committee: Max F. Rothschild, Co-major Professor Kenneth J. Stalder, Co-major Professor Laura L. Greiner James E. Koltes Alejandro Ramirez

The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University

Ames, Iowa

2021

Copyright © Josué Chinchilla-Vargas, 2021. All rights reserved. ii

DEDICATION

This work and all the effort behind it is dedicated to my wife, my family, friends and mentors that supported me and pushed me to achieve my goals. It is also dedicated to those that left us during this time.

In memoriam:

Digna Vindas Vindas

Josefa Adelia Claudina “Cuya” Quesada Calderon

Lilliam “Lili” Vargas Vindas

iii

TABLE OF CONTENTS

Page

LIST OF FIGURES ...... vi

LIST OF TABLES ...... viii

ACKNOWLEDGMENTS ...... x

ABSTRACT ...... xi

CHAPTER 1. GENERAL INTRODUCTION ...... 1 Dissertation organization ...... 3 Literature cited ...... 3

CHAPTER 2. REVIEW OF LITERATURE ...... 5 Livestock and food security ...... 5 Indicator traits in animal breeding ...... 7 A historical overview of animal breeding and the role of genomics in genetic selection...... 8 SNP genotyping and genome sequencing ...... 13 Genome-wide association studies (GWAS) ...... 17 Signatures of selection ...... 21 The era of precision livestock farming and big data ...... 26 Literature cited ...... 28

CHAPTER 3. MARKER DISCOVERY AND ASSOCIATIONS WITH BETA-CAROTENE CONTENT IN INDIAN DAIRY CATTLE AND BUFFALO BREEDS ...... 41 Abstract ...... 41 Introduction ...... 43 Materials and methods ...... 44 Animal care ...... 44 High throughput SNP discovery and building of the Sequenom custom panel ...... 44 Beta-carotene measurement and SNP genotyping ...... 46 Statistical association analyses ...... 46 Results and discussion ...... 49 SNP discovery and panel performance ...... 49 Genotyping of the beta-carotene samples and association analyses ...... 51 Conclusions ...... 57 Acknowledgements ...... 57 Literature cited ...... 58 Tables and figures ...... 62

CHAPTER 4. GENETIC BASIS OF BLOOD-BASED TRAITS AND THEIR RELATIONSHIP WITH PERFORMANCE AND ENVIRONMENT IN BEEF CATTLE AT WEANING ...... 83 Abstract ...... 83 Introduction ...... 84 iv

Materials and methods ...... 87 Data description ...... 87 Imputation ...... 88 Population structure ...... 88 Statistical analyses ...... 88 ontology and identification of candidate ...... 90 Results ...... 91 Population structure ...... 91 Phenotypic correlations ...... 91 Genetic correlations ...... 92 Estimation of narrow sense (h2) ...... 93 Genome-wide association study (GWAS) ...... 93 enrichment analysis ...... 95 Discussion ...... 96 Most blood-based traits and growth traits are weakly correlated ...... 96 Blood-based traits tend to have moderate to high heritability ...... 97 Maternal genetic effects do not impact genetic and genomic correlations or heritability ...... 98 Genome wide association study results identify few genomic regions with large effects ...... 99 Several candidate genes were identified for windows associated with blood-based traits and overlapping with growth traits ...... 99 Conclusions ...... 101 Availability of data and materials ...... 102 Author contributions ...... 102 Conflict of interest ...... 102 Funding ...... 102 Literature cited ...... 103 Tables and figures ...... 108

CHAPTER 5. ESTIMATING BREED COMPOSITION FOR PIGS: A CASE STUDY FOCUSED ON MANGALITSA PIGS AND TWO METHODS ...... 119 Abstract ...... 119 Introduction ...... 121 Materials and methods ...... 124 Animal care and welfare ...... 124 Animal genotype data sets ...... 124 Determining marker subsets for analyses ...... 126 Breed composition analyses ...... 127 Results and discussion ...... 128 Marker subsets for analyses ...... 128 Linear regression ...... 134 Comparing methods for Mangalitsa breed composition estimation...... 137 Conclusions ...... 138 Declaration of competing interests ...... 139 Acknowledgements ...... 139 Literature cited ...... 140 v

Tables and figures ...... 143 Appendix 5.1: Supplemental tables and figures ...... 151

CHAPTER 6. SIGNATURES OF SELECTION AND GENOMIC DIVERSITY OF MUSKELLUNGE (ESOX MASQUINONGY) FROM TWO POPULATIONS IN NORTH AMERICA...... 157 Abstract ...... 157 Background ...... 157 Results ...... 157 Conclusions ...... 158 Background ...... 158 Results ...... 161 Whole-genome sequencing and variant calling ...... 161 Population stratification analyses ...... 163 Pooled heterozygosity and genome wide Fst...... 164 Inbreeding and runs of homozygosity (ROH) ...... 166 Discussion ...... 166 Whole-genome sequencing, alignment to Northern Pike genome and variant calling .... 166 Population stratification ...... 168 Signatures of selection and inbreeding ...... 169 Conclusions ...... 172 Methods ...... 173 Individuals and sequencing ...... 173 Bioinformatics pipeline ...... 173 Population stratification analyses ...... 174 Pooled heterozygosity and Fst ...... 174 Inbreeding and runs of homozygosity (ROH) ...... 175 Declarations ...... 176 Ethics approval and consent to participate ...... 176 Availability of data and materials ...... 176 Consent for publication ...... 176 Competing interests ...... 176 Funding ...... 176 Authors’ Contributions ...... 176 Acknowledgements ...... 177 Literature cited ...... 177 Tables and Figures ...... 181 Appendix 6.1 Supplemental tables and figures ...... 192

CHAPTER 7. GENERAL CONCLUSIONS ...... 218 vi

LIST OF FIGURES

Page

Figure 3.1 Numbers of SNPs and their distribution derived by the next generation sequencing data across the three species...... 82

Figure 4.1. Genetic (above diagonal) and phenotypic (below diagonal) correlations between traits...... 114

Figure 4.2 Manhattan plot displaying one-megabase windows and the percentage of estimated genetic variance they account for along the genome...... 115

Figure 4.3 Manhattan plot showing percentage of estimated genetic variance explained by each 1megabase (MB) window for mean corpuscular hemoglobin (MCH)...... 116

Figure 4.4 Manhattan plot showing percentage of estimated genetic variance explained by each 1MB window for monocytes (MO)...... 117

Figure 4.5 Manhattan plot showing percentage of estimated genetic variance explained by each 1MB window for mean platelet volume (MPV)...... 118

Figure 5.1A. Out-of-bag (OOB) error for each breed along with the average OOB error across all breeds error 10 random markers with both selection strategies were used. B. OOB error across breeds and marker selection strategy for marker panels up to 1000 markers using random forest. C. Average estimated breed composition for purebred individuals across panels up to 1000 markers and marker selection strategies using random forest ...... 146

Figure 5.2 Comparison of estimated Mangalitsa breed composition for known crossbred individuals and individuals with unknown Mangalitsa breed composition in the validation data set using random forest...... 147

Figure 5.3 Average coefficient of determination (R2) for breed composition estimation for pigs in the data validation data set across panels and marker selection strategy. .. 148

Figure 5.4 Average estimated Mangalitsa breed composition for purebred individuals across panels and marker selection strategies using linear regression...... 149

Figure 5.5 Comparison of estimated Mangalitsa breed composition for known crossbred individuals and individuals with unknown Mangalitsa breed composition in the validation data set using linear regression...... 150

vii

Figure 5.6 A. Out of bag (OOB) error for all breeds and general average across marker selection methods when 50 markers were used. B. Out of bag OOB error for all breeds and general average across marker selection methods when 100 markers were used. C. Out of bag OOB error for all breeds and general average across marker selection methods when 500 markers were used...... 156

Figure 6.1 Specimen of Muskellunge (Esox masquinongy) caught and released in Iowa from an artificially stocked lake...... 183

Figure 6.2 Average depth of coverage per mega base across all Iowa samples...... 184

Figure 6.3 Distribution of SNPs by ...... 185

Figure 6.4 Principal component analysis (PCA) results. A Samples from Iowa colored by lake of origin. Big Spirit in blue and Okoboji in red. B Samples from Iowa colored by sex. Females in orange and males in green. C Samples from Canada. D. Principal component analysis (PCA) results for Iowa and Canada populations combined. PC1 and PC2 indicate principal component 1 and 2, respectively...... 186

Figure 6.5 A. Cross-validation error value for multiple subpopulation numbers. B. Admixture plot for two subpopulations...... 187

Figure 6.6 Mean pooled heterozygosity (Hp) values for 0.5 mega base windows with a 50% overlap for: A. All individuals B. Females only. C. Males only...... 188

Figure 6.7 Mean Fst (MFst) values for 0.5 mega base windows with a 50% overlap contrasting males and females in the Iowan population...... 189

Figure 6.8 Mean Fst (MFst) values for 0.5 mega base windows with a 50% overlap contrasting the populations of Iowa and Canada...... 190

Figure 6.9 Distribution of inbreeding coefficients (Froh) for the populations of Iowa and Canada...... 191

Figure 6.10 A. Admixture analysis results for Muskellunge populations from Iowa and Canada with 12 assumed subpopulations...... 217 viii

LIST OF TABLES

Page

Table 3.1 Sequenom panel...... 62

Table 3.2 Number and mean beta-carotene (BC) concentration in milk...... 65

Table 3.3 Fixed effects included in the linear models used to analyze each cattle breed...... 65

Table 3.4 Fixed effects included in the linear models used to analyze each buffalo breed...... 67

Table 3.5 Cattle SNPs with P-values ≤ 0.30 and F-tests for significant SNPs in each gene...... 68

Table 3.6 Buffalo SNPs with P< 0.30 and F-tests for significant SNPs in each gene...... 73

Table 3.7 Significant SNPs for cattle and buffalo and Total effect (STD) using a Bayesian approach (BayesA)...... 77

Table 3.8 Significant SNPs for cattle and buffalo and Total effect (STD) using a Bayesian approach (Bayes Cpi)...... 79

Table 4.1Distribution of animals by farm, year and calving season ...... 108

Table 4.2 Description of traits analyzed...... 109

Table 4.3 Narrow sense heritability (h2) estimates for blood and growth traits...... 110

Table 4.4 Significant windows overlapping over different traits...... 111

Table 4.5 The ten most significantly enriched terms for biological process for each trait category...... 112

Table 4.6 Ten most significantly enriched terms for function for each trait category...... 113

Table 5.1 Number and usage of pig genotypes according to breed...... 143

Table 5.2 Variables used for prediction per tree for each SNP panel...... 144

Table 5.3 Distribution by chromosome of markers for each marker panel for random markers (left) and Fst1-filtered markers (right)...... 144

Table 5.4 Ten markers with the highest Fst1 Score...... 145

Table 5.5 : Results of breed composition estimation using 50 Fst1-filtered markers with the linear regression method...... 145 ix

Table 5.6 Description of the QTL associated to the 10 markers with the highest Fst score...... 151

Table 5.7 Average breed composition coefficient for individuals in the validation set obtained through linear regression for all breeds across all panel markers...... 154

Table 6.1 Depth of coverage for raw whole-genome sequence data for Iowa samples...... 181

Table 6.2 Average breadth of coverage across Iowa samples...... 182

Table 6.3 Pooled heterozygosity values for individuals from Iowa...... 182

Table 6.4 Individual and average Froh scores...... 183

Table 6.5 Annotated genes located in windows found significant in pooled heterozygosity analyses...... 192

Table 6.6 List of enriched GO terms related to Hp analyses...... 195

Table 6.7 Annotated genes located in windows found significant in mFst analyses...... 203

Table 6.8 List of enriched GO terms related to regions with high Fst scores...... 215

x

ACKNOWLEDGMENTS

I would like to thank my committee chairs, Dr. Max F. Rothschild and Dr. Ken J.

Stalder, and my committee members, Dr. Laura L. Greiner, Dr. James E. Koltes and Dr. Alex

Ramirez for their guidance and support. In addition, I would also like to thank the ABG group at

ISU and the Animal Science Department as a whole for making me feel at home for the last five years. People say a village is needed to raise a child; I believe the same to be true for a PhD. I thank all of my professors that through great discussions and lectures with them opened a world of knowledge for me. It has been an honor to learn from you, we are truly standing on the shoulders of giants.

I want to offer my appreciation to those who supported me in my research and all the coauthors, especially Francesca Bertolini and Luke Kramer whose guidance and patience were vital. I would also like to thank the institutions that funded my PhD program, the Ensminger fund, State of Iowa, and Hatch funds, etc.

Finally, I would like to thank my major professors and mentors. Dr. Rothschild and Dr.

Stalder. Thank you for giving me the chance to be part of your groups and to learn so much.

Thank you for pushing me to give my best and ultimately for believing in me. Your lessons both in life and science are invaluable, I am forever grateful to both of you.

Labor Omnia Vincit.

xi

ABSTRACT

To meet the demands of the 21st century, the livestock sector needs to efficiently and sustainably increase production paying close attention to consumers demands for animal welfare and social responsibility. Breeding and genetics provides a great approach to achieve these goals given that genetic gains are cumulative. Additionally, the fast pace at which technology advances provides geneticists with several molecular tools to tackle a series of challenges both productive and environmental. The manuscripts presented in this work represent varied applications of genomics to tackle these issues.

The first manuscript presented an approach to improve food and nutritional security in developing countries through the discovery of genes related to the content of beta-carotene in cow and buffalo milk using a candidate gene approach. Blood for DNA and milk samples for

Beta carotene (BC) were obtained from 2,291 Indian cows of 5 different breeds (Gir, Holstein cross, Jersey Cross, Tharparkar, and Sahiwal) and 2,242 Indian buffaloes (Jafarabadi, Murrah,

Pandharpuri, and Surti breeds). Multiple significant SNP were found using Bayesian and frequentist methodologies with allele substitution effects ranging from 6.21 (3.13) to 9.10 (5.43)

µg of BC per 100 mL of milk. Total gene effects exceeded the mean BC value for all breeds with both analysis methods. Moreover, the recommendation of selection for significant specific alleles of some gene markers provides a route to effectively increase the BC content in milk in the

Indian cattle and buffalo populations.

The second manuscript focused on exploring the usefulness of blood-based traits as indicators of health and performance in beef cattle at weaning and identify the genetic basis underlying the different blood parameters obtained from complete blood counts (CBCs) CBCs were recorded from approximately 570 Angus based, crossbred beef calves at weaning born xii between 2015 and 2016 and raised on toxic or novel tall fescue. The calves were genotyped using 50k SNPs and the genotypes were imputed to a density of 270k SNPs. Genetic parameters were estimated for 15 blood and 4 production traits. Finally, genome-wide association studies

(GWAS) were performed for all traits. Heritability estimates ranged from 0.11 to 0.60, and generally weak phenotypic correlations and strong genetic correlations were observed among blood- based traits only. The genome-wide association study identified ninety-one 1-Mb windows that accounted for 0.5% or more of the estimated genetic variance for at least 1 trait with 21 windows overlapping across 2 or more traits (explaining more than 0.5% of estimated genetic variance for two or more traits) and 5 candidate genes were identified in the most interesting overlapping regions related to blood-based traits. Finally, there is evidence of an important overlap of genetic control among similar blood-based traits which will allow for their use in improvement programs in beef cattle.

The third manuscript aimed to develop an effective set of SNPs to estimate breed composition of pigs, focusing on those with a Mangalitsa background. The manuscript also explored different methods to estimate breed composition. Genotypes from 648 pigs and 11 breeds were used to develop marker panels. Two sets of panels were created. The first set was composed of the 10, 50, 100, 500 and 1000 markers with the highest Fst scores across the pig genome. The second set was composed by randomly selected markers and had the same number of markers as the Fst-derived panels. Linear regression and random forest methods were then used on the marker panels to estimate breed composition, of 107 pigs including 47 individuals known to have Mangalitsa background. The Fst approach appeared to be better at identifying

Mangalitsa individuals while random markers were more accurate at estimating breed composition for non-Mangalitsa individuals. When the results were compared across methods for xiii estimating breed composition, linear regression produced more accurate estimates of breed composition than random forest. Importantly, accuracy of estimation depends on the right set of animals being used as reference for the estimation.

The last manuscript presented was the first to examine the genomics of brood stock

Muskellunge (Esox masquinongy) from Iowa and showed marked genetic differences with a

Canadian population. The genome of the Northern pike (Esox Lucius) was used as a reference genome to align whole genome sequence from 12 brood individuals from Iowa and publicly available RAD-seq of 625 individuals from Saint-Lawrence river in Canada. Analyses were performed using 16,867 high-quality SNPs common between both populations. The Ti/Tv values were 1.09 and 1.29 for samples from Iowa and Canada, respectively. PCA and Admixture analyses showed large genetic differences between Canadian and Iowan populations. Window- based pooled heterozygosity found 6 highly heterozygous windows containing 244 genes in the

Iowa population and Fst comparing the Iowa and Canadian populations found 14 windows with

Fst values larger than 0.9 containing 641 genes. Finally, these results prove the validity of using genomes of closely related species to perform genomic analyses when no reference genome assembly is available.

Overall, the manuscripts included in this thesis show the wide variety of applications and methods of genomics to tackle the most important challenges that the biological fields, especially the livestock sector will face in the years to come.

1

CHAPTER 1. GENERAL INTRODUCTION

The 21st century proposes a series of additional challenges to the livestock production sector as a whole including a rising population, climate change, swift loss of biodiversity and a rapid increase of demand for animal products. Thus, the livestock sector faces its original challenge of increasing efficiency with the added difficulty of improving sustainability and animal welfare while remaining a safe and accessible source of animal products. Furthermore, some consumers are becoming more educated and critical about how food is produced and how it impacts the environment, adding another level of complexity to production.

Animal breeding and genetics provides an opportunity to efficiently improve livestock production in a sustainable, cumulative manner (Flint and Woolliams, 2008). Moreover, the fast pace at which genomic and sequencing technologies are advancing along with an ever-increasing computational capacity provide a growing toolbox for geneticists to discover, examine, dissect and improve traits of interest in multiple livestock species allowing animal geneticists to have an impact in areas that go from animal production to human medicine. However, the growing list of tools available has caused the state of the art of animal breeding to be based on integrating aspects of many sciences and technologies (Flint and Woolliams, 2008) such as quantitative and molecular genetics, statistics, computer science, physiology, nutrition, reproduction, husbandry, engineering, veterinary medicine and ethology. This integration is key to identify, develop and implement novel and better traits into the breeding system in order to optimize the evaluation, selection and mating of breeding candidates in an effective and efficient manner.

As mentioned before, a secondary effect of the rapid growth in technologies related to genetics is the wide set of disciplines in which these technologies are used. Nowadays, technologies such as genomic analysis and genome sequencing play a prominent role in multiple 2 disciplines such as wildlife conservation (Rougemont et al., 2019), animal health and welfare

(Kramer et al., 2017), quality and traceability of animal products (Kramer et al., 2016), food security (Walugembe et al., 2020) and preservation of genetic diversity (Bovo et al., 2020).

Given the extraordinary set of challenges livestock production system face and the widespread applicability of technologies related to it, the field of genetics needs individuals with broad backgrounds that are capable of working and effectively communicating with individuals specialized in different disciplines in order to attain its purpose of sustainably improving livestock performance. With this in mind, this dissertation aims to show the multiple applications that breeding and genetics can have in tackling the challenges of food production that humanity faces in the 21st century as well as the well-rounded education needed in the new generation of professionals through four manuscripts that describe:

• The identification of polymorphisms with the objective of improving nutrition and

food security in developing countries.

• The use of genomic analyses to develop traits aimed at improving health and

welfare of livestock.

• Analysis of genomic data and machine learning to improve breed assignment and

traceability of livestock and derived products.

• Assessment of genetic diversity and population genomics of wild populations

through the analysis of whole-genome sequence data.

3

Dissertation organization

This dissertation consists of seven chapters. Chapter 1 is the general introduction.

Chapter 2 covers a literature review that aims to address a detailed overview of current literature related with the research areas covered in this Dissertation. Chapters 3 through 6 consist of four manuscripts that describe different applications of genomics and sequencing technology in cattle, buffalo, swine and fish. Chapter 7 presents implications from the research and conclusions of this dissertation.

Literature cited

Bovo, S., A. Ribani, M. Muñoz, E. Alves, J. P. Araujo, R. Bozzi, M. Čandek-Potokar, R. Charneca, F. Di Palma, G. Etherington, A. I. Fernandez, F. García, J. García-Casco, D. Karolyi, M. Gallo, V. Margeta, J. M. Martins, M. J. Mercat, G. Moscatelli, Y. Núñez, R. Quintanilla, Č. Radović, V. Razmaite, J. Riquet, R. Savić, G. Schiavo, G. Usai, V. J. Utzeri, C. Zimmer, C. Ovilo, and L. Fontanesi. 2020. Whole-genome sequencing of European autochthonous and commercial pig breeds allows the detection of signatures of selection for adaptation of genetic resources to different breeding and production systems. Genet. Sel. Evol. 52:33. doi:10.1186/s12711-020-00553-7. Available from: https://gsejournal.biomedcentral.com/articles/10.1186/s12711-020-00553-7

Flint, A. P. F., and J. A. Woolliams. 2008. Precision animal breeding. Philos. Trans. R. Soc. B Biol. Sci. 363:573–590. doi:10.1098/rstb.2007.2171. Available from: /pmc/articles/PMC2610171/?report=abstract

Kramer, L. M., M. A. A. Ghaffar, J. E. Koltes, E. R. Fritz-Waters, M. S. Mayes, A. D. Sewell, N. T. Weeks, D. J. Garrick, R. L. Fernando, L. Ma, and J. M. Reecy. 2016. Epistatic interactions associated with fatty acid concentrations of beef from angus sired beef cattle. BMC Genomics. 17:891. doi:10.1186/s12864-016-3235-8. Available from: http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3235-8

Kramer, L. M., M. S. Mayes, E. Fritz-Waters, J. L. Williams, E. D. Downey, R. G. Tait, A. Woolums, C. Chase, and J. M. Reecy. 2017. Evaluation of responses to vaccination of angus cattle for four viruses that contribute to bovine respiratory disease complex. J. Anim. Sci. 95:4820–4834. doi:10.2527/jas2017.1793. Available from: /pmc/articles/PMC6292290/?report=abstract

Rougemont, Q., A. Carrier, J. Le Luyer, A. L. Ferchaud, J. M. Farrell, D. Hatin, P. Brodeur, and L. Bernatchez. 2019. Combining population genomics and forward simulations to investigate stocking impacts: A case study of Muskellunge (Esox masquinongy) from the St. Lawrence River basin. Evol. Appl. 12:902–922. doi:10.1111/eva.12765. 4

Walugembe, M., E. N. Amuzu-Aweh, P. K. Botchway, A. Naazie, G. Aning, Y. Wang, P. Saelao, T. Kelly, R. A. Gallardo, H. Zhou, S. J. Lamont, B. B. Kayang, and J. C. M. Dekkers. 2020. Genetic Basis of Response of Ghanaian Local Chickens to Infection With a Lentogenic Newcastle Disease Virus. Front. Genet. 11:739. doi:10.3389/fgene.2020.00739. Available from: https://www.frontiersin.org/article/10.3389/fgene.2020.00739/full

5

CHAPTER 2. REVIEW OF LITERATURE

This literature review aims to provide recent and thorough information about pertinent topics addressed in this thesis such as the role of livestock in food security, genomics and animal breeding. Other topics covered include the use of indicator traits in animal breeding, genome- wide association studies, signatures of selection, genotyping and next generation sequencing.

The literature review concludes with a discussion of big data and precision livestock farming.

Livestock and food security

According to the Food and Agriculture Organization (FAO), food security is based on four main pillars: food availability, access to the available food, stability in the food production chain and distribution system, and optimal utilization of the available food (McLeod, 2011).

However, to effectively tackle the food security issues faced in the present, food production systems need to be developed with resilience and sustainability in mind (McLeod, 2011).

Sustainability aims to satisfy the needs of the present generations in a way that guarantees that the future generations have sufficient food of adequate nutritional quality to promote their well- being (Pinstrup-Andersen, 2009; Harding, 2010); while resilient food systems are those with the capacity of withstanding negative effects from a variety of sources (McLeod, 2011).

Albeit erroneously, livestock is usually considered incompatible and even detrimental to sustainable food production systems. As mentioned by Adesogan et al., this concept originates from an overestimation of the environmental footprint of livestock production (Steinfeld et al.,

2006; Adesogan et al., 2020) and thus, although livestock plays a key role in food security of small producers in developing countries, their role is often ignored. More than 60 percent of rural households keep livestock (Rege et al., 2011) that provide a variety of resources (McKune et al.,

2015) and can even play a role in culture and society (Thornton, 2010). 6

Adding to livestock’s importance, is the role that small species like chickens and goats play in food security for women and children. In Africa, women are traditionally considered caregivers and are under higher risk of being food insecure (Gladwin et al., 2001). Moreover, there is strong evidence that shows that when women receive income directly, infant nutrition and household food security is addressed more effectively compared to when income is controlled by men (Quisumbing et al., 1995). Here, livestock is a key component of food security given that the a large proportion of rural population, especially women, view smaller livestock species as “current accounts” and larger species as “saving accounts” (Lebbie, 2004).

Livestock also play a key role in food security by using non-arable lands and biproducts of other agricultural operations, transforming these into food sources of high nutrient density and nutritional quality (Otte et al., 2012). Non-arable lands such as drylands cover about 40 percent of the world’s land surface and 54 percent of productive land (Otte et al., 2012) livestock grazing constitutes the largest land-use system on earth mainly through exploiting these kinds of land. It is estimated that more than 180 million people, predominantly in the developing world depend on livestock grazing and pastoral activities for their livelihood (Thornton et al., 2002).

Finally, given the clear importance of livestock for food security in the developing world, special attention should be paid to producing livestock that are adapted to the low-input systems found in developing countries without disregarding their productive capacity. Although traditional breeds are adapted to the environment and management system in which they have been developed (Thornton and Herrero, 2014), they often do not have the productive capacity of modern breeds. Therefore, in order to guarantee the viability of low-input production schemes, the development and deployment of highly productive genotypes adapted to a wide spectrum of environmental conditions is needed. In order to accomplish this a varied array of breeding 7 strategies is needed. However, these strategies have to be mixed and/or adapted depending on the species and specific production and cultural settings encountered (Rege et al., 2011).

Indicator traits in animal breeding

Indicator traits are defined as traits that provide indirect information on traits of interest

(Woolliams and Smith, 1988). The effect of an indicator trait in selection response largely depends on the genetic correlation between traits as well as the accuracy of selection for the traits

(Woolliams and Smith, 1988). The genetic correlation is usually defined as the correlation between the breeding values of multiple traits (Hill, 2013), and it is mainly caused by linkage disequilibrium and pleiotropic genes.

The importance of indicator traits becomes evident when the limitations of the modern breeding programs are taken in consideration. One of these limitations is the cost and logistics associated to recording phenotypes from selection candidates and/or their relatives in a timely and accurate manner. This aspect plays a key role in the efficiency of selection schemes as it allows to record indicator phenotypes earlier than the actual the actual phenotypes can be measured in the candidates , therefore reducing the generation . Indicator traits are particularly useful to improve selection response on traits that are recorded late in life, are sex dependent or that cannot be measured directly in selection candidates due to sanitary or welfare reasons

(Dekkers, 2012).

Although indicator traits have not been extensively used by breeders historically, there have been efforts in using physiological measurements in cattle and pigs (Bunter et al., 2005) and blood markers as in chickens (Briles et al., 1977) as proxies of growth and disease resistance, respectively (Dekkers, 2012). Presently, the rapid adoption of high-throughput phenotyping technologies in the livestock sector in addition to the rapid cost reductions seen for omic technologies in general (Koltes et al., 2019) have allowed many novel phenotypes to be 8 collected in several agricultural species. Other factors that play in favor of indicator traits are the current challenge of climate change and its associated negative effects on livestock production such as heat stress (Rojas-Downing et al., 2017), the emergence of novel pathogens (Bean et al.,

2013) and the increase in the distribution of diseases and vectors along with the increased interactions of livestock with wildlife due to the expansion of the agricultural frontier (Austin,

2021). The intersection of technological development and new challenges has fueled a renewed interest in indicator traits due to their effectiveness at facilitating selection for traits related to environmental stressors and disease resistance.

A historical overview of animal breeding and the role of genomics in genetic selection.

The main goal of animal breeding is to improve the mean performance of populations over time. For this, breeders count on two main tools: selection and mating. Selection is the process of identifying the individuals that will be used to produce the next generation, the magnitude of their contribution to it and determining for how long will they be used as parents

(Dekkers et al., 2004). The second tool, mating, refers to deciding how males and females will be allocated in order to achieve the selection goal. In general, there are two main mating strategies

Population uniformity can be increased by mating individuals with contrasting genetic values.

Alternatively, individuals of similar genetic value can be mated in order to produce extreme phenotypes. The latter, known as assortative mating, is the most commonly adopted strategy by commercial breeders. A third approach involving different genetic backgrounds is crossbreeding.

The idea behind crossbreeding is to exploit breed complementarity by obtaining offspring that have the capacity to perform better than the average of the parental breeds. This strategy is particularly useful to maximize performance for traits that depend on a dominant mode of inheritance although the improvement is highest in the offspring generation and will be steadily lost in future generations. 9

Some concept of breeding has existed since the times of ancient civilizations, as seen in

Roman texts that discuss the characteristics that were considered as desirable for an animal to be selected for breeding purposes (Lush, 1947). The beginnings of animal breeding as a science trace back to Galton and Pearson in Victorian times (Gianola and Rosa, 2015). Galton noticed that on average, children from tall parents were shorter than their parents, while children from short parents tended to be taller (Galton, 1886), providing the first rough idea of heritability, a concept that would become one of the basis of the modern theory of genetic selection. However, it was Wright, Fisher and Haldane who are considered the founders of modern quantitative genetics. Fisher (Fisher, 1919) introduced the infinitesimal model along with the concept of analysis of variance (ANOVA) that allows for the estimation of the different variance components (Gianola and Rosa, 2015). Wright developed the idea of inbreeding based on following pedigree paths to identify common ancestors shared between parents along with other statistics that include information of population structure (Crow, 2007) while Haldane focused on linkage between genes and calculations to determine rate recombination and genetic distances

(Charlesworth, 2017).

It wasn’t until later in the 20th century that Jay L. Lush, considered as a pioneer in the use of biometrical tools and statistics in animal breeding combined previous work by Fisher and

Wright (Chapman, 1987) to initiate what is considered the modern era of animal breeding. One of the most important contributions at Iowa State from Lush and his students was the development of selection index theory by L. N. Hazel (Hazel, 1943) that allowed breeders to estimate genetic values of selection candidates on multiple traits simultaneously while using information from several related individuals. This method was later improved by Hazel’s student

C. R. Henderson in the 1940s. Henderson’s Best Linear Unbiased Predictor (BLUP) is still 10 considered the best approach to estimating breeding values as it allows their estimation while adjusting for fixed effects present in the data that might bias estimations. This is achieved through the use of a system of equations known as the mixed model equations (MME) that simultaneously minimizes the mean squared prediction errors (best) while constraining this predicted mean to equal the variance of the prediction error (unbiased) (Schaeffer, 2010).

Parallel to the refinement of methods used by breeders to estimate breeding values of selection candidates, molecular geneticists were focused in searching for the molecular elements that made inheritance of characteristics possible. Molecular genetics can be traced back to

Mendel’s pea experiments and his effort to identify the “factors” that governed inheritance. /the main objective of genomics as a discipline is to identify and characterize the molecular elements that cause the phenotypic variation than be observed throughout populations (Blasco and Toro,

2014). With the discovery of the double helix by Watson and Crick in the 1950s, the development of the molecular genetics field was initiated. In the 1990s, genetic maps of domestic species were originally built with the use of microsatellites. The idea of searching for regions of the genome that contained genes affecting economically important traits with the ultimate goal of using this information to improve breeding signaled the birth of genomics as a field (Georges et al., 2019). Genomics refers to the discipline that focuses on mapping, sequencing and analyzing DNA information with the goal of understanding the structure, function and evolution at a whole-genome level (Blasco and Toro, 2014).

The first application of genomic methods in animal breeding was to develop genetic tests to improve selection for traits that depended on one or a few genes with large effects. This was followed by the application of genomics to selection of quantitative traits through the concept of marker-assisted selection (MAS), whose objective was to increase genetic response (Georges et 11 al., 2019). This process can be thought of as being composed of three main steps. The first and second steps consist of discovering QTLs and identifying the causal genes for the traits of interest. The third step is focused on increasing the frequency of the favorable alleles in the population (Blasco and Toro, 2014). This could be attained by different strategies like selection of the beneficial alleles or introgression of alleles from different breeds or populations depending on the frequency at which these favorable alleles are found in the population to be improved.

A different application of genomic methods to improve populations was the candidate gene approach. This method is based on the idea that if a gene that shows polymorphism has a known biological role associated to a trait, the variation in the gene must be responsible for at least part of the phenotypic variation of the trait (Rothschild and Soller, 1997; Angaji, 2010).

Hence, once a candidate gene is found, no QTL mapping is needed. This methodology implies the identification of a candidate gene, amplification of the gene through PCR and examination of structural variation in the gene prior to studying the association of the candidate gene with the phenotypic variation of the trait (Rothschild and Soller, 1997). This approach can be modified to look at candidate genes demonstrated in other species and look for reasonable positional candidate genes in the regions once QTL are identified. These approaches have proved particularly useful in pigs (Rothschild et al., 1996; Fan et al., 2009) although success has been limited in other species.

Most traits of economic interest in livestock breeding are quantitative traits. Quantitative traits are those controlled by a large number of genes through the trait is influenced by the environment. For quantitative traits, advances in genomics meant development of genetic testing and the identification of quantitative trait loci (QTL) with the idea of better informing selection decisions and reducing generational intervals by selecting candidates at a younger age (Soller, 12

1978; Smith and Simpson, 1986; Dekkers, 2012). As the number of identified QTL increased rapidly, and technology, methodology and computational power improved, MAS evolved into genomic selection as proposed by Meuwissen (Meuwissen et al., 2001) by jumping from the small number of QTL used in MAS to thousands of markers spread throughout the genome.

Along with the massive number of QTLs that were identified, a key development that allowed for the efficient adoption of genomic selection was the affordability of single nucleotide polymorphism (SNP) panels with tens to hundreds of thousands of markers that allow to scan a large proportion of the genome. In short, genomic selection is composed of two main steps. First, the effects of several thousand markers is estimated from a reference or training population that is closely related to the selection candidates that has been genotyped and phenotyped (Blasco and

Toro, 2014). Second, the estimated marker effects are used to predict the breeding value of selection candidates based only on their marker genotypes across the genome (Dekkers and

Hospital, 2002) without the need to phenotype (Blasco and Toro, 2014).

In the last two decades, genomic selection has had a major impact on breeding programs in several livestock species (Hayes et al., 2009a; Robledo et al., 2018; Teng et al., 2019; Yang et al., 2020). This allows selection of candidates at a young age and thus reduces generation intervals (Schaeffer, 2006) and improves selection for traits with low heritability (König and

Swalve, 2009; Dekkers, 2012) such as fertility and resilience to external stressors like heat stress and disease. Nowadays, as genotyping and sequencing costs continue do go down and computational power keeps increasing, the use of genomic information in selection is shifting its focus from causative QTL to regulatory QTL that are expected to affect gene expression and thus adding another source of genetic variation (Georges et al., 2019). 13

Even though the most direct application of genomics to selection is through identifying markers to be included in genomic selection, this is the last step of a process known as trait development. Trait development involves discovering traits that might be of economic importance, examine their functionality for breeding purposes, understanding their genetic basis and their phenotypic and genetic correlations with other traits of interest. The goal of this process is to include these traits in the selection indexes used to achieve the breeding objectives. Finally, throughout the years the field of genomics has evolved and technologies such as genomic analysis and genome sequencing play a prominent role in multiple disciplines such as wildlife conservation (Rougemont et al., 2019), animal health and welfare (Kramer et al., 2017), quality and traceability of animal products (Chinchilla-Vargas et al., 2021), food security (Bertolini et al., 2019) and preservation of genetic diversity (Bovo et al., 2020).

SNP genotyping and genome sequencing

Genotyping simply refers to the detection of genetic differences that may lead to phenotypic variations. Generally, the genetic variation is identified by comparison between individuals or with a reference genome. On the other hand, sequencing is the process of determining the exact order of nucleotides in an individual’s DNA (Behjati and Tarpey, 2013).

Even though both tools share principles and were developed at parallel times, sequencing costs and technical difficulty are significantly higher than genotyping and therefore its adoption has been slower (Naqvi and Naqvi, 2007).

Sequencing first appeared in the 1970s with Sanger’s method that used modified nucleotides that did not allow for strand synthesis. Sequencing methods were further developed in the 1980s with the use of bacterial vectors in shotgun sequencing (Giani et al., 2019).

However, the turning point for the development of genome sequencing technologies came with the launching of the Project (Venter et al., 2001). Because of the competition 14 between public efforts and private companies to produce the first-ever complete sequence of the human genome, several private initiatives developed for-profit models to outsource sequencing services to research institutions, this boom motivated the next generation sequencing (NGS) approaches to be developed (Giani et al., 2019). Next generation sequencing encompasses a series of technology that allow sequencing to be performed at a very high rate by massively parallelizing the process (Sharma et al., 2017).

Next generation sequencing’s approach is based on breaking the DNA into small fragments of approximately 100 to 150bp, fix the fragments to a solid medium, use PCR to massively amplify the small fragments that will be sequenced and finally, align these reads by using their overlapping segments to form an overall consensus sequence (França et al., 2002;

Sharma et al., 2017; Giani et al., 2019). However, an important disadvantage of these technologies is that due to the short-read size, highly repetitive regions and structural variations such as copy number variations create important challenges to the accurate assembly of the consensus sequence.

Although highly efficient compared to previous sequencing approaches, NGS cost and technical complexity was a barrier for its widespread use in research, particularly related to livestock and other non-human centered applications and therefore other strategies to use genetic information in research such as genotyping were developed. One of the discoveries of the

Human Genome Project was that of several kinds of structural variations in the genome (Kim and Misra, 2007) with the most abundant form discovered being single-base mutations that spread throughout the genome. These forms of variation are known as single-nucleotide polymorphism (SNPs). Importantly, to be distinguished from a rare variant in the population, useful SNPs are to be present in at least, more than 1% of the population (Brookes, 1999). Added 15 to the relatively high frequency in which the alleles of these variants are by definition found in a population, is the fact that on average, there is one SNP every 1000 base pairs and this makes it possible to scan the whole length of the genome using SNPs (Brookes, 1999; Kim and Misra,

2007). Another important property that made SNPs a very effective vehicle to detect associations between genetic variation and phenotypic variation is linkage disequilibrium (LD). Linkage disequilibrium or gametic phase disequilibrium is seen when two loci show a non-random association between their alleles or, in other words, alleles at two loci are correlated as they are segregated together more often than what it would be expected to be at random (Hayes and

Goddard, 2010). Linkage disequilibrium thus allows to use SNPs nearby causal mutations to still capture an important proportion of the relationship between genetic and phenotypic variation.

Despite SNPs being a great tool to study associations between genes and phenotypes due to the large number of them spread throughout the genome and occurrences like LD (Brookes, 1999), there is still the need to genotype a large number of individuals with an extensive amount of

SNPs in order to capture the true genetic variation of a population (Blasco and Toro, 2014).

From a technical point of view, SNP genotyping requires two main steps: the generation of allele-specific products for SNPs of interest followed by a method to differentiate the specific products at each locus in order to infer the nucleotides present at each locus (Kim and Misra,

2007). There are four main methods used to differentiate between alleles at each locus: primer extension, hybridization, ligation, and enzymatic cleavage (Kim and Misra, 2007). Independently of the method used, the goal of this step is to produce a different biochemical product specifically for each allele at the locus. The second step is to analyze the products at each locus to detect the alleles present, the three main methods used for this purpose are mass spectrometry, fluorescence and chemiluminescence (Kim and Misra, 2007). Ultimately, the effectiveness of 16 using SNPs for genomic studies and breeding has caused methods to rapidly evolve, making panels of thousands of SNPs a common tool (Blasco and Toro, 2014).

Today, NGS is rapidly gaining popularity thanks to fast-dropping prices and the development of bioinformatic tools as well as a dramatic increase in computational power that allow sequencing to be performed efficiently. Next generation sequencing has allowed users to achieve a new level of resolution when exploring the genetic basis of phenotypic variation by allowing geneticists to study the vast majority of the genetic sequence. The cost reduction and the technological advancements related to NGS have also allowed breeders to implement genomic selection as well as performing association studies with NGS data and not only a select number of markers. This provides an important advantage as the causal mutations are directly included in the data and although linkage disequilibrium might still play a role in amplifying signals, it is not essential to detect the source of variation (Sharma et al., 2017). However, one of the disadvantages of NGS when dealing with large, complex genomes stems from its reliance on the proper overlapping of short reads in order to infer large segments of sequence, as repetitive regions and structural variations like translocations, insertions and copy number variations pose a challenge to NGS (Bickhart and Liu, 2014). To solve these issues, a new generation of sequencing techniques known as Third Generation Sequencing (TGS) appeared in the last decade. The distinctive characteristic of TGS is that there is no need of a massive amplification step through PCR as single molecules of DNA are sequenced directly (Giani et al., 2019).

Currently, TGS technologies allow for the production of reads markedly longer than NGS, and they are capable of spreading over several hundreds of kilo base pairs (kbps). The capacity to produce reads of this length is extremely advantageous as it allows the genome to be covered in a more uniform way in addition to allowing for a markedly increased capacity to sequence areas of 17 the genome that are repetitive and thus might prove problematic for short reads (Sharma et al.,

2017; Giani et al., 2019). While the evolution and adoption of new sequencing technologies is rapid, SNP genotyping, NGS and TGS are all widely used in the field and are often combined to achieve the optimal balance between costs and the amount of information needed to achieve the desired accuracy.

Genome-wide association studies (GWAS)

With the take-off of genomics and the capacity to access hundreds of thousands to millions of markers distributed along the genome mainly through genotyping technologies and more recently through sequencing, genome-wide association studies (GWAS) were developed.

They were initially developed to aid in the identification of loci associated to disease in humans but were rapidly adopted by the livestock sector with the idea of identifying loci related to traits of economic importance in order to improve genomic selection (Sharmaa et al., 2015). The theoretical principle of GWAS is to use structural genetic variation, most often a very large number of SNPs distributed along the genome and phenotypic information to associate the variation in the genotype of specific markers or regions with the variation observed in the phenotypes (Sharmaa et al., 2015).

One of the key advantages of GWAS compared to other methods used to detect associations between genotype and phenotype is the capacity of detecting variants with small effects. Moreover, the high density of markers generally used in these studies provides the capability of defining narrow genomic regions that influence phenotypic variation of the traits studied (Hirschhorn and Daly, 2005), making GWAS an ideal technique to discover the major genes involved in complex traits (Zhang et al., 2012). This being said, this method has limitations like the need to have a very large number of individuals to achieve the desired power and its reliance on linkage disequilibrium between markers and causal variants to produce an 18 association signal when SNP data are used. Perhaps the biggest limitation of GWAS is the inherent inability to detect variants or regions associated to a trait if the causal variants or markers linked to these variants are fixed or very close to fixation given the lack of genetic variation in the area (de Simoni Gouveia et al., 2014)

Structural and molecular geneticists have always been interested in identifying the loci associated with phenotypic variation of traits of interest, each of these loci is commonly known as a quantitative trait locus (QTL). Before the boom of genomics, microsatellite markers were the main tool used to locate QTL (Lipkin et al., 1998) through linkage analysis studies that required experimental crosses between breeds with opposite characteristics to be produced, most commonly F2 and backcrosses (Sharmaa et al., 2015). Although many QTL were mapped through these methods, the mapping precision tended to be low and the causative gene or mutation was very rarely identified (Schmid and Bennewitz, 2017).

One of the most important contributions of GWAS to agriculture is the essential role they have played in aiding genomic selection to become widely adopted and thus, significantly accelerating the rate of genetic gain particularly for traits with low heritability (Schmid and

Bennewitz, 2017). Given that genomic selection is based on the information provided by dense sets of markers spread uniformly throughout the genome (Meuwissen et al., 2001), identifying either linked or causative QTL that explain phenotypic variation of the traits under selection is of utmost importance (Meuwissen et al., 2016). From a perhaps more basic-science approach,

GWAS results can increase the knowledge of the biology that underpins trait expression and gene function, which in turn can be applied to further understand and develop novel traits that need to be improved either by being included in the selection process or by developing specific livestock management strategies (Goddard et al., 2016). With this information, the effects of 19 different alleles can be estimated and thus, the role of dominance or epistatic variance in the phenotypic variation can also be estimated (Goddard et al., 2016; Schmid and Bennewitz, 2017).

There are two main methodologies available for performing GWAS, single-marker and multi-marker models (Hayes and Goddard, 2010). In the single-marker methodology, a single

SNP is tested as a continuous variable in a mixed linear model (Wood et al., 2014; Yang et al.,

2014). In this method, the SNP effect represents the allele substitution effect, defined as Falconer as the change in the mean trait value when one q allele in a random individual is replaced by a p allele (Falconer and Mackay, 1996). In this method, the result of the regression ultimately provides a p-value and regression coefficient that would represent the magnitude of the allele substitution effect (Schmid and Bennewitz, 2017). While this approach has little computational requirements as a single SNP is fitted at the time and the interpretation of the p-value is relatively straight forward, this approach has some important disadvantages (Goddard et al.,

2016). Since each SNP is tested individually, this approach runs several thousands of tests and therefore the nominal p-values need to be corrected accordingly (Schmid and Bennewitz, 2017).

The most common corrections are the Bonferroni correction and the false discovery rate (FDR).

In general, a combination of both is used to determine which SNPs are significantly associated to the phenotypic variation of a trait. (Fernando et al., 2004). A second important disadvantage is caused by LD between markers that are in close proximity and the causal mutation. Since several SNPs are expected to be in partial LD with the causal mutation SNPs more than 2 Mb from the QTL can show a significant association, adding noise to the results (Goddard et al.,

2016). It is important to remember that livestock populations generally show low effective population (Ne) sizes and thus, LD tends to stretch for long distances when there is no crossbreeding present. 20

On the other hand, with the multi-marker approach, large groups of SNPs or even all

SNPs are fitted in the model at once as random effects to avoid over parameterization

(Meuwissen et al., 2001). An important observation is that fitting all SNPs simultaneously makes the assumption of the genetic variance having a uniform distribution over the genome and thus, all SNPs fitted in the model are estimated to have an effect on the phenotype which by definition defeats the purpose of performing a GWAS (Goddard et al., 2016). A way to work around the assumption of uniform distribution of variance is to perform the multi-marker GWAS using one of the several Bayesian priors that have been developed along with only fitting a subset of the markers at once (Meuwissen et al., 2001; Verbyla et al., 2009; Verbyla et al., 2010; Erbe et al.,

2012; Gianola, 2013). Very often, when the Bayesian methodology is used in GWAS, the genetic variance is estimated for windows of a defined size rather than for individual SNPs (Fernando et al., 2017). In this methodology, as the number of SNPs fitted in the model increases, the effects are proportionally shrunk towards 0 in a phenomenon known as Bayesian shrinkage

(Bhattacharya et al., 2012; Schmid and Bennewitz, 2017). Compared to the single-marker approach, these multi-marker methods require a significantly larger amount of computational power. Additionally, their implementation methodology is not as straightforward since with the

Bayesian approach parameters like window size, prior choice and proportion of SNPs to be fitted simultaneously (Pi value) can have a large impact in the result and thus, should be chosen carefully (Schmid and Bennewitz, 2017).

Since being adapted to be used in livestock, there has been great progress in GWAS methodology and numerous genes and QTL for economically important traits have been identified. However, the inconsistencies among the results seen in GWAS performed for the same trait is one of the main concerns about the accuracy of this technique (Zhang et al., 2012). 21

It is vital to be aware that GWAS results rely on LD between the SNPs used in the specific analysis and causative genes when interpreting results (Schmid and Bennewitz, 2017). Therefore results are specific for the population that is tested unless the causative mutation is identified directly (Sellner et al., 2007). Since LD is a non-random association of alleles among different loci, factors like selection, mutation, migration, population structure and recombination rate will cause LD to differ between populations (Zhu et al., 2013), which can lead to erroneous assumptions when extrapolating results from one population to another. Moreover, if multiple populations are being used simultaneously to find the association between genotype and phenotype, population structure should be correctly accounted for in the statistical model. As access to methodology, computational power and data increase in the future, special attention should be paid to aspects like experimental design, statistical modeling techniques and data quality control as these play a key role in limiting false positives/negatives and inconsistent results in genome-wide association studies.

Signatures of selection

One of the most important components of molecular breeding and genetics is to understand and identify the genes and mutations that have an effect in the traits of interest.

Identifying signatures of selection plays an important role in this regard as it allows researchers to characterize regions of the genome that play a role in traits related to adaptation and performance that have been under selection. All populations independently of being wild or domestic are under selection, either natural, artificial or both. The most important component needed for selection to take place is heritable genetic variation (Qanbari and Simianer, 2014).

Natural selection is a key in improving the adaptation of populations to the environments they’re subjected to by giving the appropriate genotypes better chances of survival and reproductive fitness, thus having an increased capacity to contribute to the gene pool of next generations 22

(Falconer and Mackay, 1996; Driscoll et al., 2009). There are three directions in which natural selection operates: positive selection, negative/purifying selection and balancing selection.

Positive selection will increase the frequency in the population of advantageous genotypes (de

Simoni Gouveia et al., 2014) while purifying selection works in the opposite way by removing unfavorable genotypes and mutations (Charlesworth et al., 1993). Finally, balancing selection will maintain the genetic variation around characteristics that do not affect the survivability or reproductive fitness of individuals (Charlesworth et al., 1993).

Artificial selection is a human-driven process in which a set of specific criteria are used to determine what traits will be considered favorably or unfavorably (de Simoni Gouveia et al.,

2014). There are two main categories of artificial selection. Unconscious selection refers to selection without any long-term production objective, very probably the one observed as domestication took place where there was no interest in altering the species in a specific direction (Saravanan et al., 2020). Alternatively, methodical selection started when livestock species were selected with the purpose of improving or changing specific characteristics, with consequences that include the formation of breeds and the separation of types within species (e.g. milk versus wool sheep) (de Simoni Gouveia et al., 2014; Saravanan et al., 2020).

Independently of the specific type of selection that is applied to a population, the genetic consequences of natural and artificial selection are the same (Gregory, 2008; Driscoll et al.,

2009; de Simoni Gouveia et al., 2014), as selection will ultimately cause changes in the regions of the genome that control the traits for which selection pressure is applied (Weigand and Leese,

2018; Saravanan et al., 2020). When a trait is subjected to any kind of intensive selection, the frequency of those variants (alleles) that have a positive effect on the trait will increase in the population (de Simoni Gouveia et al., 2014). Moreover, positive selection on a given variant 23 tends to cause a carry-over effect of linked loci with no effect on the trait under selection due to

LD between them. This process is known as genetic hitchhiking (Maynard Smith and Haigh,

2008) or a selective sweep (Charlesworth, 2008). There are different classifications of selective sweeps depending on several characteristics like origin, type and frequency of mutation

(Saravanan et al., 2019). However, the two main categories are hard and soft sweeps. Sweeps are considered hard when the beneficial allele under selection quickly increases in frequency and the allele becomes fixed in the population. Hard sweeps cause a reduction of genetic diversity in the population due to loci near the variant under selection also displaying fixed on nearly fixed alleles due to high levels of LD which in turn, cause long homozygous regions (Pritchard et al.,

2010). Soft sweeps are divided in two important categories. Single-origin soft sweeps occur when selection has been applied to one variant for a relatively short time. Hence, the variant under selection has not yet reached fixation, showing an increase in frequency but still leaving some genetic variation in the population (Hermisson and Pennings, 2017). Finally, multiple- origin soft sweeps tend to happen in large populations. In this case, different lines or subpopulations show several beneficial alleles and all of them increase in frequency at the same time, not allowing any specific one to reach fixation in the population (Hermisson and Pennings,

2005).

The regional reduction in genetic variation observed over time around variants that have under positive selection are known as signatures of selection. Signatures of selection are those regions of the genome that contain causal variants that are or have been under either natural or artificial selection (Qanbari and Simianer, 2014). Given that they contain causal variants, the identification of these signatures of selection provides knowledge of the genetic basis of phenotypes (Charlesworth, 2008). Analysis of signatures of selection provides two important 24 advantages when compared to other methods such as GWAS. First, methodology for analysis signatures of selection is purely focused on the genetic makeup and therefore there is no need to directly input phenotypic scores for each individual as part of the analyses and most importantly signatures of selection are detectable even when the beneficial allele is fixed in the population

(Qanbari and Simianer, 2014; Weigand and Leese, 2018).

The two main families of analyses used to identify signatures of selection are performed through comparing genomic data either within or between populations (Saravanan et al., 2020).

Within-population analysis include runs of homozygosity (ROH) and Pooled heterozygosity

(Hp). Runs of homozygosity focuses on continuous segments with no heterozygosity to identify signatures of selection (Qanbari and Simianer, 2014). Another very important application of

ROH is to estimate an inbreeding score directly from genomic data as the proportion of the genome composed by ROH. This statistic is known as Froh (Peripolli et al., 2017), and an advantage of this approach is that it is not based on the assumption of a base population with zero inbreeding. Very importantly, when both applications are combined, this method provides an estimation of “localized” inbreeding. Another popular method based on reduced local variability is pooled heterozygosity (Hp) (Rubin et al., 2010). This method estimates genetic variation based on allelic counts across sliding windows. Once the Hp scores are normalized, this method allows a researcher to identify areas of the genome that can be considered outliers due to high heterozygosity (highly positive normalized Hp scores) or high homozygosity (highly negative normalized Hp scores) (Qanbari and Simianer, 2014). Importantly, this method can be very effective at locating signatures of selection when pooled DNA samples are available (e.g.

Bertolini et al., 2020). This method has been used in swine, chicken and fish to identify several putative variants associated to traits involved in production and product quality (Rubin et al., 25

2010; Rubin et al., 2012; Bertolini et al., 2020; Bovo et al., 2020). Other methods based on within-population comparisons include Tajima’s D, composite likelihood ratio (CLR), Fay and

Wu’s H-statistic, relative extended haplotype homozygosity (rEHH) and integrated haplotype score (iHS) (Saravanan et al., 2020).

The main method based on between population comparisons is Wright’s fixation index

(Fst). This method assumes that differentiation between populations is caused by genetic drift and thus, any neutral variants would show similar frequencies in both populations. With this method, loci that have been selected in different direction or are in LD with those under selection in the populations that are being compared will display elevated Fst values (Gianola et al., 2010).

Since Fst is based on divergent selections between populations, it is important to carefully select the populations that will be used in the pairwise comparisons in order to identify signatures of selection related to the specific traits of interest. However, it is important to keep in mind that little statistical power is expected from Fst analysis if populations are too distant (Fariello et al.,

2013). The main advantage of Fst against other methods used to identify signatures of selection is that it is SNP-specific and can theoretically reveal the actual genetic variants under selection.

However, it is more efficient to look for a number of consecutive SNPs with elevated Fst score rather than analyzing each SNP separately, as single locus Fst values are highly variable and selective sweeps will cause series of SNPs in LD to produce high Fst scores (Qanbari and

Simianer, 2014). Therefore, calculating the mean Fst score (mFst) for windows across the genome is more appropriate (Akey et al., 2010). FST has been extensively used for detecting the impact of selection in domesticated species (Putney et al., 1989; Hayes et al., 2009b; Boyko et al., 2010; Olsson et al., 2011; Dunham et al., 2012; Petersen et al., 2013; Ramey et al., 2013; 26

Walugembe et al., 2019) as well as to identify variants that are effective at discriminating between breeds (Schiavo et al., 2020; Chinchilla-Vargas et al., 2021).

Although the current strategies available to identify selection signatures are effective, there is still a need to refine these methods to avoid false-positives and false-negatives. These.

Results are produced because factors like population size, population bottlenecks, strength of selection and variable recombination rates along the genome can cause similar effects to signatures of selection (Haasl and Payseur, 2016; Weigand and Leese, 2018). Therefore, all these factors should be taken into consideration to avoid false positives and negatives. The combination of alternative approaches is an effective way to address false results and thus overlapping of signals should be considered as true positives and negatives (Randhawa et al.,

2014; Weigand and Leese, 2018).

The era of precision livestock farming and big data

The livestock sector is facing the challenges of, climate change, political demands, and a rapid increase in the demand for animal products since the dawn of the 21st century (Berckmans,

2017). Additionally, the general public is rapidly becoming more educated and critical about how food is produced and how it impacts the environment, pressuring industry into improving welfare and sustainability. These challenges have hastened the adoption of technologies first seen in crop fields by the livestock sector that include the use of environmental sensors, microphones, cameras and wearable devices (Koltes et al., 2019). The large amounts of data generated by these phenotyping platforms added to low prices of high-density SNP genotyping and sequencing technologies have energized the field of animal science and specifically the animal breeding field into world of big data. In general, data are considered big data when specialized technology and methodology is needed to analyze and store them because of their large volume, the velocity at which they are received and recorded or due to the variety of 27 formats in which the data are received and stored (De Mauro et al., 2016; Wolfert et al., 2017;

Koltes et al., 2019). This definition is known as the four V model (IBM, http://www.ibmbigdatahub.com/infographic/four-vs-big-data) with the fourth element being veracity (Koltes et al., 2019).

Given the plethora of information sources being developed for livestock production systems and the fast rate at which the technology that manages and unifies the information obtained is evolving, the concept of precision livestock farming is now considered a reality and the gold standard that livestock agriculture is trying to reach. Precision livestock involves the use of technology and internet of things (IoT) to monitor the status of the animals and their environment in real time, allowing producers to adapt and adjust management almost immediately when needed, increasing the welfare, productivity and sustainability of the livestock sector as a whole (Guarino et al., 2017).

From an animal breeding and genetics perspective, the adaptation of technologies to livestock management will allow breeders and geneticists to capture a multitude of new phenotypes. These range from distance covered by individuals and interactions between them using video (Psota et al., 2019) to the use of wearable sensors to record indicator traits of dry matter intake and health events (Siberski, 2019). Ultimately, the capacity to capture multiple novel phenotypes, the rapid reduction in costs of sequencing and genotyping, and the swift advancement of bioinformatics and computational power provide the discipline with a chance to thoroughly tackle issues like genotype by environment interaction (GxE) with the idea of ultimately understanding how the multiple organizational levels of the genome determine phenotypes.

28

Literature cited

Adesogan, A. T., A. H. Havelaar, S. L. McKune, M. Eilittä, and G. E. Dahl. 2020. Animal source foods: Sustainability problem or malnutrition and sustainability solution? Perspective matters. Glob. Food Sec. 25:100325. doi:10.1016/j.gfs.2019.100325.

Akey, J. M., A. L. Ruhe, D. T. Akey, A. K. Wong, C. F. Connelly, J. Madeoy, T. J. Nicholas, and M. W. Neff. 2010. Tracking footprints of artificial selection in the dog genome. Proc. Natl. Acad. Sci. U. S. A. 107:1160–1165. doi:10.1073/pnas.0909918107. Available from: www.pnas.org/cgi/doi/10.1073/pnas.0909918107

Angaji, S. A. 2010. The candidate gene approach in plant anaerobiosis. Available from: http://www.ncbi.nlm.nih.gov/sites/entrez?db=PubMed

Austin, K. F. 2021. Degradation and disease: Ecologically unequal exchanges cultivate emerging pandemics. World Dev. 137:105163. doi:10.1016/j.worlddev.2020.105163.

Bean, A. G. D., M. L. Baker, C. R. Stewart, C. Cowled, C. Deffrasnes, L. F. Wang, and J. W. Lowenthal. 2013. Studying immunity to zoonotic diseases in the natural host-keeping it real. Nat. Rev. Immunol. 13:851–861. doi:10.1038/nri3551.

Behjati, S., and P. S. Tarpey. 2013. What is next generation sequencing? Arch. Dis. Child. Educ. Pract. Ed. 98:236–238. doi:10.1136/archdischild-2013-304340. Available from: /pmc/articles/PMC3841808/

Berckmans, D. 2017. General introduction to precision livestock farming. Anim. Front. 7:6–11. doi:10.2527/af.2017.0102. Available from: https://academic.oup.com/af/article/7/1/6/4638786

Bertolini, F., J. Chinchilla-Vargas, J. R. Khadse, A. Juneja, P. D. Deshpande, K. Bhave, V. Potdar, P. M. Kakramkar, A. R. Karlekar, A. B. Pande, R. L. Fernando, and M. F. Rothschild. 2019. Marker discovery and associations with β-carotene content in Indian dairy cattle and buffalo breeds. J. Dairy Sci. 102:10039–10055. doi:10.3168/jds.2019-16361.

Bertolini, F., A. Ribani, F. Capoccioni, L. Buttazzoni, V. J. Utzeri, S. Bovo, G. Schiavo, M. Caggiano, L. Fontanesi, and M. F. Rothschild. 2020. Identification of a major locus determining a pigmentation defect in cultivated gilthead seabream (Sparus aurata). Anim. Genet. 51:319–323. doi:10.1111/age.12890. Available from: https://pubmed.ncbi.nlm.nih.gov/31900984/

Bhattacharya, A., D. Pati, N. S. Pillai, and D. B. Dunson. 2012. Bayesian shrinkage. Available from: http://arxiv.org/abs/1212.6088

Bickhart, D. M., and G. E. Liu. 2014. The challenges and importance of structural variation detection in livestock. Front. Genet. 5:1–14. doi:10.3389/fgene.2014.00037.

Blasco, A., and M. A. Toro. 2014. A short critical history of the application of genomics to animal breeding. Livest. Sci. 166:4–9. doi:10.1016/j.livsci.2014.03.015. 29

Bovo, S., A. Ribani, M. Muñoz, E. Alves, J. P. Araujo, R. Bozzi, M. Čandek-Potokar, R. Charneca, F. Di Palma, G. Etherington, A. I. Fernandez, F. García, J. García-Casco, D. Karolyi, M. Gallo, V. Margeta, J. M. Martins, M. J. Mercat, G. Moscatelli, Y. Núñez, R. Quintanilla, Č. Radović, V. Razmaite, J. Riquet, R. Savić, G. Schiavo, G. Usai, V. J. Utzeri, C. Zimmer, C. Ovilo, and L. Fontanesi. 2020. Whole-genome sequencing of European autochthonous and commercial pig breeds allows the detection of signatures of selection for adaptation of genetic resources to different breeding and production systems. Genet. Sel. Evol. 52:33. doi:10.1186/s12711-020-00553-7. Available from: https://gsejournal.biomedcentral.com/articles/10.1186/s12711-020-00553-7

Boyko, A. R., P. Quignon, L. Li, J. J. Schoenebeck, J. D. Degenhardt, K. E. Lohmueller, K. Zhao, A. Brisbin, H. G. Parker, B. M. vonHoldt, M. Cargill, A. Auton, A. Reynolds, A. G. Elkahloun, M. Castelhano, D. S. Mosher, N. B. Sutter, G. S. Johnson, J. Novembre, M. J. Hubisz, A. Siepel, R. K. Wayne, C. D. Bustamante, and E. A. Ostrander. 2010. A Simple Genetic Architecture Underlies Morphological Variation in Dogs. H. E. Hoekstra, editor. PLoS Biol. 8:e1000451. doi:10.1371/journal.pbio.1000451. Available from: https://dx.plos.org/10.1371/journal.pbio.1000451

Briles, W. E., H. A. Stone, and R. K. Cole. 1977. Marek’s disease: Effects of B histocompatibility alloalleles in resistant and susceptible chicken lines. Science (80-. ). 195:193– 195. doi:10.1126/science.831269. Available from: https://science.sciencemag.org/content/195/4274/193

Brookes, A. J. 1999. The essence of SNPs. Gene. 234:177–186. doi:10.1016/S0378- 1119(99)00219-X.

Bunter, K. L., S. Hermesch, B. G. Luxford, H. U. Graser, and R. E. Crump. 2005. Insulin-like growth factor-I measured in juvenile pigs is genetically correlated with economically important performance traits. In: Australian Journal of Experimental Agriculture. Vol. 45. CSIRO PUBLISHING. p. 783–792. Available from: http://www.publish.csiro.au/?paper=EA05048

Chapman, A. B. 1987. Jay Laurence Lush 1896 - 5/22/82. NAtional Academy of Sciences, Washington D.C.

Charlesworth, B. 2008. A hitch-hiking guide to the genome: A commentary on “The hitch-hiking effect of a favourable gene” by John Maynard Smith and John Haigh. Genet. Res. (Camb). 89:389–390. doi:10.1017/S0016672308009580. Available from: https://doi.org/10.1017/S0016672308009580

Charlesworth, B. 2017. Haldane and modern evolutionary genetics. J. Genet. 96:773– 782. doi:10.1007/s12041-017-0833-4. Available from: http://link.springer.com/10.1007/s12041- 017-0833-4

Charlesworth, B., M. T. Morgan, and D. Charlesworth. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics. 134:1289–1303. doi:10.1093/genetics/134.4.1289. Available from: /pmc/articles/PMC1205596/?report=abstract 30

Chinchilla-Vargas, J., F. Bertolini, K. J. Stalder, J. P. Steibel, and M. F. Rothschild. 2021. Estimating breed composition for pigs: A case study focused on Mangalitsa pigs and two methods. Livest. Sci. 244:104398. doi:10.1016/j.livsci.2021.104398. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1871141321000068

Craig Venter, J., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G. Clark, J. Nadeau, V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon, C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy, C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley, K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran, R. Charlab, K. Chaturvedi, Z. Deng, V. di Francesco, P. Dunn, K. Eilbeck, C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan, T. J. Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei, Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M. Moore, A. K. Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch, S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Yuan Wang, A. Wang, X. Wang, J. Wang, M. H. Wei, R. Wides, et al. 2001. The sequence of the human genome. Science (80-. ). 291:1304–1351. doi:10.1126/science.1058040.

Crow, J. F. 2007. . Elsevier.

Dekkers, J. C. M. 2012. Application of Genomics Tools to Animal Breeding. Curr. Genomics. 13:207–212. doi:10.2174/138920212800543057. Available from: /pmc/articles/PMC3382275/?report=abstract

Dekkers, J. C. M., J. P. Gibson, P. Bijma, and J. a. M. van Arendonk. 2004. Design and optimisation of animal breeding programmes: Wageningen University, The Netherlands. 1–16. Available from: https://www.dphu.org/uploads/attachements/books/books_2338_0.pdf

Dekkers, J. C. M., and F. Hospital. 2002. The use of molecular genetics in the improvement of agricultural populations. Nat. Rev. Genet. 3:22–32. doi:10.1038/nrg701. Available from: www.nature.com/reviews/genetics

Driscoll, C. A., D. W. Macdonald, and S. J. O’Brien. 2009. From wild animals to domestic pets, an evolutionary view of domestication. Proc. Natl. Acad. Sci. U. S. A. 106 Suppl:9971–9978. doi:10.1073/pnas.0901586106. Available from: www.nasonline.org/SacklerDarwin.

31

Dunham, I., A. Kundaje, S. F. Aldred, P. J. Collins, C. A. Davis, F. Doyle, C. B. Epstein, S. Frietze, J. Harrow, R. Kaul, J. Khatun, B. R. Lajoie, S. G. Landt, B. K. Lee, F. Pauli, K. R. Rosenbloom, P. Sabo, A. Safi, A. Sanyal, N. Shoresh, J. M. Simon, L. Song, N. D. Trinklein, R. C. Altshuler, E. Birney, J. B. Brown, C. Cheng, S. Djebali, X. Dong, J. Ernst, T. S. Furey, M. Gerstein, B. Giardine, M. Greven, R. C. Hardison, R. S. Harris, J. Herrero, M. M. Hoffman, S. Iyer, M. Kellis, P. Kheradpour, T. Lassmann, Q. Li, X. Lin, G. K. Marinov, A. Merkel, A. Mortazavi, S. C. J. Parker, T. E. Reddy, J. Rozowsky, F. Schlesinger, R. E. Thurman, J. Wang, L. D. Ward, T. W. Whitfield, S. P. Wilder, W. Wu, H. S. Xi, K. Y. Yip, J. Zhuang, B. E. Bernstein, E. D. Green, C. Gunter, M. Snyder, M. J. Pazin, R. F. Lowdon, L. A. L. Dillon, L. B. Adams, C. J. Kelly, J. Zhang, J. R. Wexler, P. J. Good, E. A. Feingold, G. E. Crawford, J. Dekker, L. Elnitski, P. J. Farnham, M. C. Giddings, T. R. Gingeras, R. Guigó, T. J. Hubbard, W. J. Kent, J. D. Lieb, E. H. Margulies, R. M. Myers, J. A. Stamatoyannopoulos, S. A. Tenenbaum, Z. Weng, K. P. White, B. Wold, Y. Yu, J. Wrobel, B. A. Risk, H. P. Gunawardena, H. C. Kuiper, C. W. Maier, L. Xie, X. Chen, et al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature. 489:57–74. doi:10.1038/nature11247.

Erbe, M., B. J. Hayes, L. K. Matukumalli, S. Goswami, P. J. Bowman, C. M. Reich, B. A. Mason, and M. E. Goddard. 2012. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J. Dairy Sci. 95:4114–4129. doi:10.3168/jds.2011-5019. Available from: https://pubmed.ncbi.nlm.nih.gov/22720968/

Falconer, D. S., and T. F. C. Mackay. 1996. Introduction to Quantitative Genetics. Fourth. Longman Group Limited, Essex, England.

Fan, B., S. K. Onteru, G. S. Plastow, and M. F. Rothschild. 2009. Detailed characterization of the porcine MC4R gene in relation to fatness and growth. Anim. Genet. 40:401–409. doi:10.1111/j.1365-2052.2009.01853.x. Available from: https://pubmed.ncbi.nlm.nih.gov/19397528/

Fariello, M. I., S. Boitard, H. Naya, M. SanCristobal, and B. Servin. 2013. Detecting signatures of selection through haplotype differentiation among hierarchically structured populations. Genetics. 193:929–941. doi:10.1534/genetics.112.147231. Available from: https://academic.oup.com/genetics/article/193/3/929/5935284

Fernando, R. L., D. Nettleton, B. R. Southey, J. C. M. Dekkers, M. F. Rothschild, and M. Soller. 2004. Controlling the Proportion of False Positives in Multiple Dependent Tests. Genetics. 166:611–619. doi:10.1534/genetics.166.1.611. Available from: https://www.genetics.org/content/166/1/611

Fernando, R., A. Toosi, A. Wolc, D. Garrick, and J. Dekkers. 2017. Application of Whole-Genome Prediction Methods for Genome-Wide Association Studies: A Bayesian Approach. J. Agric. Biol. Environ. Stat. 22:172–193. doi:10.1007/s13253-017-0277-6. Available from: https://link.springer.com/content/pdf/10.1007%2Fs13253-017-0277-6.pdf 32

Fisher, R. A. 1919. The Correlation between Relatives on the Supposition of Mendelian Inheritance. Trans. R. Soc. Edinburgh. 52:399–433. doi:10.1017/S0080456800012163. Available from: https://pdfs.semanticscholar.org/72de/31338c688d820c30ae5380d5a6b2ad98017a.pdf

França, L. T. C., E. Carrilho, and T. B. L. Kist. 2002. A review of DNA sequencing techniques. Q. Rev. Biophys. 35:169–200. doi:10.1017/S0033583502003797.

Galton, F. 1886. Regression Towards Mediocrity in Iiereditary Stature. J. Anthropol. Inst. Gt. Britain Irel. 15:246–263. Available from: http://www.stat.ucla.edu/~nchristo/statistics100C/history_regression.pdf%250Ahttp://www.jstor. org/stable/2841583%250Ahttp://about.jstor.org/terms%250Ahttp://www.stat.ucla.edu/~nchristo/ statistics100C/history_regression.pdf%250Ahttp://www.jstor.org/stable/

Georges, M., C. Charlier, and B. Hayes. 2019. Harnessing genomic information for livestock improvement. Nat. Rev. Genet. 20:135–156. doi:10.1038/s41576-018-0082-2. Available from: www.nature.com/nrg

Giani, A. M., G. R. Gallo, L. Gianfranceschi, and G. Formenti. 2019. Long walk to genomics: History and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18:9–19. doi:10.1016/J.CSBJ.2019.11.002. Available from: https://www.sciencedirect.com/science/article/pii/S2001037019303277?via%3Dihub

Gianola, D. 2013. Priors in whole-genome regression: The Bayesian alphabet returns. Genetics. 194:573–596. doi:10.1534/genetics.113.151753. Available from: https://www.genetics.org/content/194/3/573

Gianola, D., and G. J. M. Rosa. 2015. One hundred years of statistical developments in animal breeding. Annu. Rev. Anim. Biosci. 3:19–56. doi:10.1146/annurev-animal-022114- 110733. Available from: http://www.annualreviews.org/doi/10.1146/annurev-animal-022114- 110733

Gianola, D., H. Simianer, and S. Qanbari. 2010. A two-step method for detecting selection signatures using genetic markers. Genet. Res. (Camb). 92:141–155. doi:10.1017/S0016672310000121.

Gladwin, C. H., A. M. Thomson, J. S. Peterson, and A. S. Anderson. 2001. Addressing food security in Africa via multiple livelihood strategies of women farmers. Food Policy. 26:177–207. doi:10.1016/S0306-9192(00)00045-2. Available from: https://www.sciencedirect.com/science/article/pii/S0306919200000452

Goddard, M. E., K. E. Kemper, I. M. MacLeod, A. J. Chamberlain, and B. J. Hayes. 2016. Genetics of complex traits: Prediction of phenotype, identification of causal polymorphisms and genetic architecture. Proc. R. Soc. B Biol. Sci. 283. doi:10.1098/rspb.2016.0569. Available from: http://dx.doi.org/10.1098/rspb.2016.0569

33

Gregory, T. R. 2008. Artificial Selection and Domestication: Modern Lessons from Darwin’s Enduring Analogy. Evol. Educ. Outreach. 2:5–27. doi:10.1007/s12052-008-0114-z. Available from: https://evolution-outreach.biomedcentral.com/articles/10.1007/s12052-008- 0114-z

Guarino, M., T. Norton, Dries Berckmans, E. Vranken, and Daniel Berckmans. 2017. A blueprint for developing and applying precision livestock farming tools: A key output of the EU- PLF project. Anim. Front. 7:12–17. doi:10.2527/af.2017.0103.

Haasl, R. J., and B. A. Payseur. 2016. Fifteen years of genomewide scans for selection: trends, lessons and unaddressed genetic sources of complication. Mol. Ecol. 25:5–23. doi:10.1111/mec.13339. Available from: http://doi.wiley.com/10.1111/mec.13339

Harding, J. 2010. What We’re about to Receive: Food Insecurity. London Rev. Books. 23. Available from: https://www.lrb.co.uk/the-paper/v32/n09/jeremy-harding/what-we-re-about- to-receive

Hayes, B., and M. Goddard. 2010. Genome-wide association and genomic selection in animal breeding. Genome. 53:876–883. doi:10.1139/G10-076. Available from: https://pubmed.ncbi.nlm.nih.gov/21076503/

Hayes, B. J., P. J. Bowman, A. J. Chamberlain, and M. E. Goddard. 2009a. Invited review: Genomic selection in dairy cattle: Progress and challenges. J. Dairy Sci. 92:433–443. doi:10.3168/jds.2008-1646.

Hayes, B. J., A. J. Chamberlain, S. Maceachern, K. Savin, H. McPartlan, I. MacLeod, L. Sethuraman, and M. E. Goddard. 2009b. A genome map of divergent artificial selection between Bos taurus dairy cattle and Bos taurus beef cattle. Anim. Genet. 40:176–184. doi:10.1111/j.1365-2052.2008.01815.x. Available from: http://doi.wiley.com/10.1111/j.1365- 2052.2008.01815.x

Hazel, L. N. 1943. The Genetic Basis for Constructing Selection Indexes. Genetics. 28:476–90. Available from: http://www.ncbi.nlm.nih.gov/pubmed/17247099

Hermisson, J., and P. S. Pennings. 2005. Soft sweeps: Molecular population genetics of adaptation from standing genetic variation. Genetics. 169:2335–2352. doi:10.1534/genetics.104.036947. Available from: https://academic.oup.com/genetics/article/169/4/2335-2352/6059609

Hermisson, J., and P. S. Pennings. 2017. Soft sweeps and beyond: understanding the patterns and probabilities of selection footprints under rapid adaptation. J. Kelley, editor. Methods Ecol. Evol. 8:700–716. doi:10.1111/2041-210X.12808. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12808

Hill, W. G. 2013. Genetic Correlation. In: Brenner’s Encyclopedia of Genetics: Second Edition. Elsevier Inc. p. 237–239. 34

Hirschhorn, J. N., and M. J. Daly. 2005. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6:95–108. doi:10.1038/nrg1521. Available from: www.nature.com/reviews/genetics

Kim, S., and A. Misra. 2007. SNP genotyping: Technologies and biomedical applications. Annu. Rev. Biomed. Eng. 9:289–320. doi:10.1146/annurev.bioeng.9.060906.152037.

Koltes, J. E., J. B. Cole, R. Clemmens, R. N. Dilger, L. M. Kramer, J. K. Lunney, M. E. McCue, S. D. McKay, R. G. Mateescu, B. M. Murdoch, R. Reuter, C. E. Rexroad, G. J. M. Rosa, N. V. L. Serão, S. N. White, M. J. Woodward-Greene, M. Worku, H. Zhang, and J. M. Reecy. 2019. A Vision for Development and Utilization of High-Throughput Phenotyping and Big Data Analytics in Livestock. Front. Genet. 10:1197. doi:10.3389/fgene.2019.01197. Available from: https://www.frontiersin.org/article/10.3389/fgene.2019.01197/full

König, S., and H. H. Swalve. 2009. Application of selection index calculations to determine selection strategies in genomic breeding programs. J. Dairy Sci. 92:5292–5303. doi:10.3168/jds.2009-2232. Available from: https://pubmed.ncbi.nlm.nih.gov/19762847/

Kramer, L. M., M. S. Mayes, E. Fritz-Waters, J. L. Williams, E. D. Downey, R. G. Tait, A. Woolums, C. Chase, and J. M. Reecy. 2017. Evaluation of responses to vaccination of angus cattle for four viruses that contribute to bovine respiratory disease complex. J. Anim. Sci. 95:4820–4834. doi:10.2527/jas2017.1793. Available from: /pmc/articles/PMC6292290/?report=abstract

Lebbie, S. H. . H. B. 2004. Goats under household conditions. In: Small Ruminant Research. Vol. 51. p. 131–136. Available from: https://ac.els-cdn.com/S0921448803002694/1- s2.0-S0921448803002694-main.pdf?_tid=spdf-3edc665a-b1fa-49ab-83f4- a030f5414e65&acdnat=1519755873_e64412f06f730c799f18fab6c6ea3a51

Lipkin, E., M. O. Mosig, A. Darvasi, E. Ezra, A. Shalom, A. Friedmann, and M. Soller. 1998. Quantitative trait locus mapping in dairy cattle by means of selective milk DNA pooling using dinucleotide microsatellite markers: analysis of milk percentage. Genetics. 149:1557. Available from: /pmc/articles/PMC1460242/?report=abstract

Lush, J. L. 1947. Animal Breeding Plans. Second. (R. B. Limited, editor.). Ames, Iowa.

De Mauro, A., M. Greco, and M. Grimaldi. 2016. A formal definition of Big Data based on its essential features. Libr. Rev. 65:122–135. doi:10.1108/LR-06-2015-0061.

Maynard Smith, J., and J. Haigh. 2008. The hitch-hiking effect of a favourable gene. Genet. Res. (Camb). 89:391–403. doi:10.1017/S0016672308009579. Available from: https://www.cambridge.org/core/terms.https://doi.org/10.1017/S0016672308009579Downloaded fromhttps://www.cambridge.org/core.

McKune, S. L., E. C. Borresen, A. G. Young, T. D. Auria Ryley, S. L. Russo, A. Diao Camara, M. Coleman, and E. P. Ryan. 2015. Climate change through a gendered lens: Examining livestock holder food security. Glob. Food Sec. 6:1–8. doi:10.1016/j.gfs.2015.05.001. 35

McLeod, A. 2011. World Livestock 2011: Livestock in food security.

Meuwissen, T. H. E. E., B. J. Hayes, and M. E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 157:1819–1829. doi:10.1046/j.1365- 2540.1998.00308.x. Available from: http://www.genetics.org/content/genetics/157/4/1819.full.pdf

Meuwissen, T., B. Hayes, and M. Goddard. 2016. Genomic selection: A paradigm shift in animal breeding. Anim. Front. 6:6–14. doi:10.2527/af.2016-0002. Available from: https://academic.oup.com/af/article-abstract/6/1/6/4638797

Naqvi, A.-U.-N. N., and A.-U.-N. N. Naqvi. 2007. Application of Molecular Genetic Technologies in Livestock Production: Potentials for Developing Countries. Adv. Biol. Res. (Rennes). 1:3–4. Available from: https://www.researchgate.net/publication/242303915

Olsson, M., J. R. S. Meadows, K. Truvé, G. Rosengren Pielberg, F. Puppo, E. Mauceli, J. Quilez, N. Tonomura, G. Zanna, M. J. Docampo, A. Bassols, A. C. Avery, E. K. Karlsson, A. Thomas, D. L. Kastner, E. Bongcam-Rudloff, M. T. Webster, A. Sanchez, Å. Hedhammar, E. F. Remmers, L. Andersson, L. Ferrer, L. Tintle, and K. Lindblad-Toh. 2011. A Novel Unstable Duplication Upstream of HAS2 Predisposes to a Breed-Defining Skin Phenotype and a Periodic Fever Syndrome in Chinese Shar-Pei Dogs. M. Georges, editor. PLoS Genet. 7:e1001332. doi:10.1371/journal.pgen.1001332. Available from: https://dx.plos.org/10.1371/journal.pgen.1001332

Otte, J., A. Costales, J. Dijkman, U. Pica-Ciamarra, T. Robinson, V. Ahuja, C. Ly, and D. Roland-Holst. 2012. A Living from Livestock Pro-Poor Livestock Policy Initiative Livestock sector development for poverty reduction: an economic and policy perspective Livestock’s many virtues.

Peripolli, E., D. P. Munari, M. V. G. B. Silva, A. L. F. Lima, R. Irgang, and F. Baldi. 2017. Runs of homozygosity: current knowledge and applications in livestock. Anim. Genet. 48:255–271. doi:10.1111/age.12526. Available from: http://doi.wiley.com/10.1111/age.12526

Petersen, J. L., J. R. Mickelson, A. K. Rendahl, S. J. Valberg, L. S. Andersson, J. Axelsson, E. Bailey, D. Bannasch, M. M. Binns, A. S. Borges, P. Brama, A. da Câmara Machado, S. Capomaccio, K. Cappelli, E. G. Cothran, O. Distl, L. Fox-Clipsham, K. T. Graves, G. Guérin, B. Haase, T. Hasegawa, K. Hemmann, E. W. Hill, T. Leeb, G. Lindgren, H. Lohi, M. S. Lopes, B. A. McGivney, S. Mikko, N. Orr, M. C. T. Penedo, R. J. Piercy, M. Raekallio, S. Rieder, K. H. Røed, J. Swinburne, T. Tozaki, M. Vaudin, C. M. Wade, and M. E. McCue. 2013. Genome-Wide Analysis Reveals Selection for Important Traits in Domestic Horse Breeds. J. M. Akey, editor. PLoS Genet. 9:e1003211. doi:10.1371/journal.pgen.1003211. Available from: https://dx.plos.org/10.1371/journal.pgen.1003211

Pinstrup-Andersen, P. 2009. Food security: definition and measurement. Food Secur. 1:5–7. doi:10.1007/s12571-008-0002-y. Available from: https://link.springer.com/article/10.1007/s12571-008-0002-y 36

Pritchard, J. K., J. K. Pickrell, and G. Coop. 2010. The Genetics of Human Adaptation: Hard Sweeps, Soft Sweeps, and Polygenic Adaptation. Curr. Biol. 20:R208–R215. doi:10.1016/j.cub.2009.11.055. Available from: http://www.cell.com/article/S0960982209020703/fulltext

Psota, E., M. Mittek, L. Pérez, T. Schmidt, and B. Mote. 2019. Multi-Pig Part Detection and Association with a Fully-Convolutional Network. Sensors. 19:852. doi:10.3390/s19040852. Available from: http://www.mdpi.com/1424-8220/19/4/852

Putney, D. J., M. Drost, and W. W. Thatcher. 1989. Influence of summer heat stress on pregnancy rates of lactating dairy cattle following embryo transfer or artificial insemination. Theriogenology. 31:765–778. doi:10.1016/0093-691X(89)90022-8. Available from: http://ac.els- cdn.com/0093691X89900228/1-s2.0-0093691X89900228-main.pdf?_tid=3cdb14a2-3cb1-11e7- b1c6-00000aab0f02&acdnat=1495211938_29801442d9934e161db60bfc6913d1cb

Qanbari, S., and H. Simianer. 2014. Mapping signatures of positive selection in the genome of livestock. Livest. Sci. 166:133–143. doi:10.1016/j.livsci.2014.05.003.

Quisumbing, A. R., L. R. Brown, H. S. Feldstein, L. Haddad, and C. Pena. 1995. Women: The Key to Food Security. Food Policy Rep. 2:26. doi:10.1196/annals.1425.001. Available from: http://www.globalfoodsec.net/static/text/ifpri_womenthekeytofoodsec.pdf

Ramey, H. R., J. E. Decker, S. D. McKay, M. M. Rolf, R. D. Schnabel, and J. F. Taylor. 2013. Detection of selective sweeps in cattle using genome-wide SNP data. BMC Genomics. 14:382. doi:10.1186/1471-2164-14-382. Available from: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-382

Randhawa, I. A. S., M. S. Khatkar, P. C. Thomson, and H. W. Raadsma. 2014. Composite selection signals can localize the trait specific genomic regions in multi-breed populations of cattle and sheep. BMC Genet. 15:34. doi:10.1186/1471-2156-15-34. Available from: http://bmcgenet.biomedcentral.com/articles/10.1186/1471-2156-15-34

Rege, J. E. O., K. Marshall, A. Notenbaert, J. M. K. Ojango, and A. M. Okeyo. 2011. Pro-poor animal improvement and breeding - What can science do? Livest. Sci. 136:15–28. doi:10.1016/j.livsci.2010.09.003. Available from: https://www.sciencedirect.com/science/article/pii/S1871141310004774

Robledo, D., O. Matika, A. Hamilton, and R. D. Houston. 2018. Genome-wide association and genomic selection for resistance to amoebic gill disease in Atlantic salmon. G3 Genes, Genomes, Genet. 8:1195–1203. doi:10.1534/g3.118.200075. Available from: https://academic.oup.com/g3journal/article/8/4/1195-1203/5941683

Rojas-Downing, M. M., A. P. Nejadhashemi, T. Harrigan, and S. A. Woznicki. 2017. Climate change and livestock: Impacts, adaptation, and mitigation. Clim. Risk Manag. 16:145– 163. doi:10.1016/j.crm.2017.02.001. Available from: https://www.sciencedirect.com/science/article/pii/S221209631730027X 37

Rothschild, M. . F., and M. Soller. 1997. Candidate gene analysis to detect gene controlling traits of economic important in domestic livestock. Probe (Lond). 8:13–20.

Rothschild, M. F., C. Jacobson, D. Vaske, C. K. Tuggle, L. Wang, T. Short, G. Eckardt, S. Sasaki, A. Vincent, D. McLaren, O. Southwood, H. van der Steen, A. Mileham, and G. Plastow. 1996. The estrogen receptor locus is associated with a major gene influencing litter size in pigs. Proc. Natl. Acad. Sci. 93:201–205. doi:10.1073/pnas.93.1.201. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC40206/pdf/pnas01505-0212.pdf

Rougemont, Q., A. Carrier, J. Le Luyer, A. L. Ferchaud, J. M. Farrell, D. Hatin, P. Brodeur, and L. Bernatchez. 2019. Combining population genomics and forward simulations to investigate stocking impacts: A case study of Muskellunge (Esox masquinongy) from the St. Lawrence River basin. Evol. Appl. 12:902–922. doi:10.1111/eva.12765.

Rubin, C. J., H. J. Megens, A. M. Barrio, K. Maqbool, S. Sayyab, D. Schwochow, C. Wang, Ö. Carlborg, P. Jern, C. B. Jørgensen, A. L. Archibald, M. Fredholm, M. A. M. Groenen, and L. Andersson. 2012. Strong signatures of selection in the domestic pig genome. Proc. Natl. Acad. Sci. U. S. A. 109:19529–19536. doi:10.1073/pnas.1217149109. Available from: www.pnas.org/cgi/doi/10.1073/pnas.1217149109

Rubin, C. J., M. C. Zody, J. Eriksson, J. R. S. Meadows, E. Sherwood, M. T. Webster, L. Jiang, M. Ingman, T. Sharpe, S. Ka, F. Hallböök, F. Besnier, R. Carlborg, B. Bedhom, M. Tixier- Boichard, P. Jensen, P. Siegel, K. Lindblad-Toh, and L. Andersson. 2010. Whole-genome resequencing reveals loci under selection during chicken domestication. Nature. 464:587–591. doi:10.1038/nature08832. Available from: http://gris.ulb.ac.be/

Saravanan, K. A., M. Panigrahi, H. Kumar, and B. Bhushan. 2019. Advanced software programs for the analysis of genetic diversity in livestock genomics: a mini review. Biol. Rhythm Res. 00:1–11. doi:10.1080/09291016.2019.1642650. Available from: https://doi.org/10.1080/09291016.2019.1642650

Saravanan, K. A., M. Panigrahi, H. Kumar, B. Bhushan, T. Dutt, and B. P. Mishra. 2020. Selection signatures in livestock genome: A review of concepts, approaches and applications. Livest. Sci. 241:104257. doi:10.1016/j.livsci.2020.104257. Available from: https://doi.org/10.1016/j.livsci.2020.104257

Schaeffer, L. R. 2006. Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet. 123:218–223. doi:10.1111/j.1439-0388.2006.00595.x. Available from: https://pubmed.ncbi.nlm.nih.gov/16882088/

Schaeffer, L. R. 2010. Linear Models and Animal Breeding Lawrence R . Schaeffer Centre for Genetic Improvement of Livestock Department of Animal and Poultry Science University of Guelph Guelph , ON N1G 2W1 June 2010 - Norway. doi:10.1080/02705060.2010.9664387.

38

Schiavo, G., F. Bertolini, G. Galimberti, S. Bovo, S. Dall’olio, L. Nanni Costa, M. Gallo, and L. Fontanesi. 2020. A machine learning approach for the identification of population- informative markers from high-throughput genotyping data: Application to several pig breeds. Animal. doi:10.1017/S1751731119002167. Available from: https://www.r-project.org/

Schmid, M., and J. Bennewitz. 2017. Invited review: Genome-wide association analysis for quantitative traits in livestock - A selective review of statistical models and experimental designs. Arch. Anim. Breed. 60:335–346. doi:10.5194/aab-60-335-2017. Available from: https://aab.copernicus.org/articles/60/335/2017/

Sellner, E. M., J. W. Kim, M. C. McClure, K. H. Taylor, R. D. Schnabel, and J. F. Taylor. 2007. Board-invited review: Applications of genomic information in livestock. J. Anim. Sci. 85:3148–3158. doi:10.2527/jas.2007-0291. Available from: https://pubmed.ncbi.nlm.nih.gov/17709778/

Sharma, A., J.-E. Park, H.-H. Chai, G.-W. Jang, S.-H. Lee, and D. Lim. 2017. Next generation sequencing in livestock species- A Review. J. Anim. Breed. Genomics. 1. doi:10.12972/jabng.20170003. Available from: https://doi.org/10.12972/jabng.20170003

Sharmaa, A., J. S. Lee, C. G. Dang, P. Sudrajad, H. C. Kim, S. H. Yeon, H. S. Kang, and S. H. Lee. 2015. Stories and challenges of genome wide association studies in livestock - a review. Asian-Australasian J. Anim. Sci. 28:1371–1379. doi:10.5713/ajas.14.0715. Available from: /pmc/articles/PMC4554843/

Siberski, C. J. 2019. Investigating automated sensor measures as possible indicator traits of feed intake and health traits in dairy cattle. Iowa State University, Ames, Iowa. Available from: https://lib.dr.iastate.edu/etd

de Simoni Gouveia, J. J., M. V. G. B. da Silva, S. R. Paiva, S. M. P. de Oliveira, J. José De Simoni Gouveia, M. Vinicius, G. Barbosa Da Silva, S. R. Paiva, and S. M. Pinheiro De Oliveira. 2014. Identification of selection signatures in livestock species. Genet. Mol. Biol. 37:330–342. doi:10.1590/S1415-47572014000300004. Available from: www.sbg.org.br

Smith, C., and S. P. Simpson. 1986. The use of genetic polymorphisms in livestock improvement. J. Anim. Breed. Genet. 103:205–217. doi:10.1111/j.1439-0388.1986.tb00083.x. Available from: http://doi.wiley.com/10.1111/j.1439-0388.1986.tb00083.x

Soller, M. 1978. The use of loci associated with quantitative effects in dairy cattle improvement. Anim. Prod. 27:133–139. doi:10.1017/S0003356100035960. Available from: https://www.cambridge.org/core/journals/animal-science/article/abs/use-of-loci-associated-with- quantitative-effects-in-dairy-cattle-improvement/DE55CDAF7770ACC859497AA8C4A5F352

Steinfeld, H., P. Gerber, T. Wassenaar, V. Castel, M. Rosales, and C. de Haan. 2006. Livestock’s long shadow.

Teng, J., N. Gao, Haibin Zhang, X. Li, J. Li, Hao Zhang, X. Zhang, and Z. Zhang. 2019. Performance of whole genome prediction for growth traits in a crossbred chicken population. Poult. Sci. 98:1968–1975. doi:10.3382/ps/pey604. 39

Thornton, P. K. 2010. Livestock production: recent trends, future prospects. Philos. Trans. R. Soc. B Biol. Sci. 365:2853–2867. doi:10.1098/rstb.2010.0134. Available from: http://rstb.royalsocietypublishing.org/content/royptb/365/1554/2853.full.pdf

Thornton, P. K., and M. Herrero. 2014. Climate change adaptation in mixed crop- livestock systems in developing countries. Elsevier. Available from: http://dx.doi.org/10.1016/j.gfs.2014.02.002

Thornton, P. K., R. L. Kruska, N. Henninger, P. M. Kristjanson, R. S. Reid, F. Atieno, A. N. Odero, and T. Ndegwa. 2002. Mapping Poverty and Livestock in the Developing World. International Livestock Research Institute. Available from: https://cgspace.cgiar.org/handle/10568/915

Verbyla, K. L., P. J. Bowman, B. J. Hayes, and M. E. Goddard. 2010. Sensitivity of genomic selection to using different prior distributions. BMC Proc. 4:S5. doi:10.1186/1753- 6561-4-s1-s5. Available from: https://bmcproc.biomedcentral.com/articles/10.1186/1753-6561- 4-S1-S5

Verbyla, K. L., B. J. Hayes, P. J. Bowman, and M. E. Goddard. 2009. Accuracy of genomic selection using stochastic search variable selection in Australian holstein friesian dairy cattle. Genet. Res. (Camb). 91:307–311. doi:10.1017/S0016672309990243. Available from: https://pubmed.ncbi.nlm.nih.gov/19922694/

Walugembe, M., F. Bertolini, C. M. B. Dematawewa, M. P. Reis, A. R. Elbeltagy, C. J. Schmidt, S. J. Lamont, and M. F. Rothschild. 2019. Detection of Selection Signatures Among Brazilian, Sri Lankan, and Egyptian Chicken Populations Under Different Environmental Conditions. Front. Genet. 9:737. doi:10.3389/fgene.2018.00737. Available from: https://www.frontiersin.org/article/10.3389/fgene.2018.00737/full

Watson, J. D., and F. H. C. Crick. 1953. Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature. 171:737–738. doi:10.1038/171737a0. Available from: https://www.nature.com/articles/171737a0

Weigand, H., and F. Leese. 2018. Detecting signatures of positive selection in non-model species using genomic data. Zool. J. Linn. Soc. 184:528–583. doi:10.1093/zoolinnean/zly007. Available from: https://academic.oup.com/zoolinnean/article/184/2/528/4970495

Wolfert, S., L. Ge, C. Verdouw, and M. J. Bogaardt. 2017. Big Data in Smart Farming – A review. Agric. Syst. 153:69–80. doi:10.1016/j.agsy.2017.01.023.

40

Wood, A. R., T. Esko, J. Yang, S. Vedantam, T. H. Pers, S. Gustafsson, A. Y. Chu, K. Estrada, J. Luan, Z. Kutalik, N. Amin, M. L. Buchkovich, D. C. Croteau-Chonka, F. R. Day, Y. Duan, T. Fall, R. Fehrmann, T. Ferreira, A. U. Jackson, J. Karjalainen, K. S. Lo, A. E. Locke, R. Mägi, E. Mihailov, E. Porcu, J. C. Randall, A. Scherag, A. A. E. Vinkhuyzen, H. J. Westra, T. W. Winkler, T. Workalemahu, J. H. Zhao, D. Absher, E. Albrecht, D. Anderson, J. Baron, M. Beekman, A. Demirkan, G. B. Ehret, B. Feenstra, M. F. Feitosa, K. Fischer, R. M. Fraser, A. Goel, J. Gong, A. E. Justice, S. Kanoni, M. E. Kleber, K. Kristiansson, U. Lim, V. Lotay, J. C. Lui, M. Mangino, I. M. Leach, C. Medina-Gomez, M. A. Nalls, D. R. Nyholt, C. D. Palmer, D. Pasko, S. Pechlivanis, I. Prokopenko, J. S. Ried, S. Ripke, D. Shungin, A. Stancáková, R. J. Strawbridge, Y. J. Sung, T. Tanaka, A. Teumer, S. Trompet, S. W. Van Der Laan, J. Van Setten, J. V. Van Vliet-Ostaptchouk, Z. Wang, L. Yengo, W. Zhang, U. Afzal, J. Ärnlöv, G. M. Arscott, S. Bandinelli, A. Barrett, C. Bellis, A. J. Bennett, C. Berne, M. Blüher, J. L. Bolton, Y. Böttcher, H. A. Boyd, M. Bruinenberg, B. M. Buckley, S. Buyske, I. H. Caspersen, P. S. Chines, R. Clarke, S. Claudi-Boehm, M. Cooper, E. W. Daw, P. A. De Jong, et al. 2014. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46:1173–1186. doi:10.1038/ng.3097. Available from: https://www.nature.com/articles/ng.3097

Woolliams, J. A., and C. Smith. 1988. The value of indicator traits in the genetic improvement of dairy cattle. Anim. Prod. 46:333–345. doi:10.1017/S0003356100018948. Available from: https://www.cambridge.org/core/product/identifier/S0003356100018948/type/journal_article

Yang, A.-Q., B. Chen, M.-L. Ran, G.-M. Yang, and C. Zeng. 2020. The application of genomic selection in pig cross breeding. Yi chuan = Hered. 42:145–152. doi:10.16288/j.yczz.19- 253. Available from: https://europepmc.org/article/med/32102771

Yang, J., N. A. Zaitlen, M. E. Goddard, P. M. Visscher, and A. L. Price. 2014. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46:100–106. doi:10.1038/ng.2876. Available from: https://www.nature.com/articles/ng.2876

Zhang, H., Z. Wang, S. Wang, and H. Li. 2012. Progress of genome wide association study in domestic animals. J. Anim. Sci. Biotechnol. 3:26. doi:10.1186/2049-1891-3-26. Available from: http://jasbsci.biomedcentral.com/articles/10.1186/2049-1891-3-26

Zhu, M., B. Zhu, Y. H. Wang, Y. Wu, L. Xu, L. P. Guo, Z. R. Yuan, L. P. Zhang, X. Gao, H. J. Gao, S. Z. Xu, and J. Y. Li. 2013. Linkage disequilibrium estimation of Chinese beef simmental cattle using high-density SNP panels. Asian-Australasian J. Anim. Sci. 26:772–779. doi:10.5713/ajas.2012.12721. Available from: /pmc/articles/PMC4093237/

41

CHAPTER 3. MARKER DISCOVERY AND ASSOCIATIONS WITH BETA- CAROTENE CONTENT IN INDIAN DAIRY CATTLE AND BUFFALO BREEDS

F. Bertolini,1,2*, J. Chinchilla-Vargas1*, J.R. Khadse3, A. Juneja3, P.D. Deshpande3, P.M. Kakramjar3, A. R. Karlekar3, A.B. Pande3, Rohan L. Fernando1 & M.F. Rothschild1**

1 Iowa State University, Department of Animal Science, 806 Stange road, 2255 Kildee Hall 50011, Ames, Iowa, USA 2 National Institute of Aquatic Resources, Technical University of Denmark, Kemitoryet 2800, KGs. Lyngby, Denmark 3 BAIF Development Research Foundation, Bhavan, Dr. Manibhai Desai Nagar Warje, Pune 411058, India * Shared equally in the production this work

Modified from a manuscript published in Journal of Dairy Science 102 (11), 10039 – 10055.

Abstract

Vitamin A is essential for human health but current intake levels in many developing countries such as India are too low due to malnutrition. According to the World Health

Organization (WHO), an estimated 250 million preschool children are vitamin A deficient globally. This number excludes pregnant women, and nursing mothers who are particularly vulnerable. Efforts to improve access to Vitamin A are key because supplementation can reduce mortality rates in young children in developing countries by around 23%. Three key genes,

BCMO1, BC02 and SCARB1 have been shown to be associated with the amount of beta- carotene in milk. Whole genome sequencing reads from the coordinates of these three genes in

202 non-Indian Cattle (141 Bos taurus, 61 B. indicus) and 35 non-Indian Buffalo (Bubalus bubalis) animals from several breeds were collected from data repositories. The SNPs detected in the coding regions of these three genes varied from 16 to 26 in the three species, with 5 overlapping SNPs between B. taurus and B. indicus. All these SNPs together with two SNPs in the upstream part of the gene but already present in dbSNP

(https://www.ncbi.nlm.nih.gov/projects/SNP/) were used to build a custom Sequenom array. 42

Blood for DNA and milk samples for Beta Carotene (BC) were obtained from 2,291 Indian cows of five different breeds (Gir, Holstein cross, Jersey Cross, Tharparkar and Sahiwal) and 2,242

Indian buffaloes (Jafarabadi, Murrah, Pandharpuri and Surti breeds). DNA was extracted and genotyped with the Sequenom array. For each individual breed and the combined breeds, SNPs with an association that had a P-value <0.3 in the first round of linear analysis were included in a second step of regression analyses to determine allele substitution effects to increase the content of beta-carotene in milk. Additionally, an F-test for all SNPs within gene was performed with the objective of determining if overall the gene had a significant effect on the content of beta- carotene in milk. The analyses were repeated using a Bayesian approach to compare and validate the previous frequentist results. Multiple significant SNPs were found using both methodologies with allele substitution effects (SE) ranging from -6.21 (3.13) to 9.10 (5.43) micrograms of beta carotene per 100ml of milk. Total gene effects exceeded the mean BC value for all breeds with both analysis approaches. The custom panel designed for genes related to beta-carotene production demonstrated applicability in genotyping of cattle and buffalo in India and may be used for cattle or buffalo from other developing countries. Moreover, the recommendation of selection for significant specific alleles of some gene markers provides a route to effectively increase the beta-carotene content in milk in the Indian cattle and buffalo populations

Key words: beta-carotene; SNP; milk; cattle; buffalo

43

Introduction

Vitamin A plays a key role in human health. Inclusion of proper amounts of Vitamin A in the diet is a key factor for the development and maintenance of a healthy vision (Bennasir et al.,

2010), a proper functioning of the immune system (Hussey and Klein, 1990), improved red blood cell and hemoglobin production (Lynch, 2009) in addition to prevention of diseases such as Alzheimer and schizophrenia (Davis et al., 1991; Goodman and Pardee, 2003). Moreover,

Vitamin A is related to successful growth in early childhood and embryonic development

(Semba, 2009). According to the World Health Organization, 250 million children are Vitamin A deficient worldwide and improving access to Vitamin A can have a big impact, especially in developing countries such as India.

Vitamin A supplementation through naturally or artificially fortified food can reduce mortality rates in young children by about 23% (Beaton et al., 1993). Beta carotene (BC) could also replace vitamin A, as it can be metabolized to Vitamin A after ingestion (Bennasir et al.,

2010). Beta-carotene is fat- soluble, and thus it is most efficiently absorbed in the presence of fat components. Therefore, milk is an ideal food for its delivery (Ribaya-Mercado, 2002).

Consequently, selection for increased beta-carotene content in milk could be a good approach to improve the nutritional value of milk (Berry et al., 2009).

Genomic technologies have recently facilitated the identification of three key genes in beta-carotene metabolism: beta-carotene oxygenase 1 (BCMO1 or BCO1) and beta-carotene oxygenase 2 (BCMO2 or BCO2), which are involved in the cleavage of beta-carotene

(D’Ambrosio et al., 2011), and scavenger receptor class B member 1 (SCARB1) which is involved in cellular transport (Valacchi et al., 2011). A QTL related to milk BC content linked to the BCO2 gene has been officially reported and subsequent research revealed allelic variants that are associated with different amounts of beta-carotene in milk (Berry et al., 2009). These 44 findings have suggested that selection for beneficial alleles could improve beta carotene levels in milk. The aim of this work was to map and identify Single Nucleotide Polymorphisms (SNPs) in the three candidate genes BCMO1, SCARB1 and BCO2 in cattle and buffalo with Next

Generation Sequencing resources available, develop a SNP panel and use this panel to detect

SNPs that are associated with beta-carotene content in several Indian cattle and buffalo breeds.

Materials and methods

Animal care

The sample collection done for the purpose of this project was performed using animal care procedures approved by Iowa State University (IACUC Log # 7-15-8061-B) and the BAIF veterinarian from the BAIF(Bharatiya Agro Industries Foundation) research foundation meeting the required standards in India and with approval from the Bill & Melinda Gates Foundation

(BMGF).

High throughput SNP discovery and building of the Sequenom custom panel

Reads from whole genome sequencing of 202 Cattle (141 Bos taurus, 35 B. indicus) and

61 Buffalo (Bubalus bubalis) non-Indian animals from several breeds were collected from SRA

(Sequence Nucleotide Archive; https://www.ncbi.nlm.nih.gov/sra) database (Bos taurus) as part of the 1000 bulls genome project (Daetwyler et al., 2014), from the International Buffalo

Consortium (Sonstegard et al., 2012) (Bubalus bubalis) or from other projects (e.g. Stafuzza et al., 2017 or data not shown) (Bos indicus). The list of breeds is reported in Table S1. Because of the high genomic similarity between Bos taurus and Bos indicus, reads from both species were aligned against the same UMD3.1 reference genome (GCA_000003055.3). B. bubalis reads were aligned to the most recent buffalo reference genome available at the time of the analysis:

MD_CASPUR_WB_2.0 (GCA_000471725.1). Coordinates of the BCMO1, BCO2 and

SCARB1 genes were retrieved through Ensembl (www.ensembl.org) for Bos taurus and through 45 alignment (https://blast.ncbi.nlm.nih.gov/Blast.cgi) of the coding sequence of the three genes against the UMD3.1 and MD_CASPUR_WB_2.0 genomes for Bos indicus and Bubalus bubalis, respectively.

For Bos taurus, bam files corresponding to the genomic coordinates of the three genes were retrieved directly from the SRA database. For Bos indicus and Bubalus bubalis, Burrows-

Wheeler Aligner (BWA MEM) with standard conditions (Li and Durbin, 2010) and Samtools

(Li, 2011) was used align the reads to the respective reference genomes and extract the portions corresponding the genomic coordinates of the three genes. A filter step was then arbitrarily chosen to the reduced bam files of all the three species to discard all reads with mapping quality

< 10, to reduce the influence of ambiguous alignments to repetitive regions by excluding reads that had a probability of alternative alignment > 10% (Hwang et al., 2019). After that, the standard pipeline of the Samtools (Li, 2011) or the GATK (McKenna, 2017) software was applied to call the variants in all the samples for the BCMO1, BCO2 and SCARB1 genes, where only SNPs with SNP quality ≥30 and at least 6X as coverage depth in at least one animal (or at least two for B. taurus) were considered. For cattle, the effect of each SNP was evaluated through Variant Effect Predictor (VEP;(McLaren et al., 2016)). For buffalo SNPs, because of the lack of variant effect information, the effect was determined comparing the predicted protein sequence output derived by changing one allele at the time with the reference protein sequence using the Blast web tool (https://blast.ncbi.nlm.nih.gov/Blast.cgi).

Comparing the amino acid changes, we were able to determine if a mutation in the coding region could be “synonymous” (when no amino acid change was detected) or “non- synonymous” (when the mutation causes amino acid change in the protein frame comparing to the refseq protein). 46

SNPs located in the coding region or in UTR or already reported in dbSNP were considered for a Sequenom panel. For each selected SNP, probes for the panel were designed by

Geneseek (Lincoln, Nebraska). The panel (composed of 67 SNPs as it will be shown in results) was then tested by genotyping a subset of Indian samples belonging to five different breeds

(Table S2), where DNA was extracted from blood following standard protocols. Each animal was genotyped in triplicate.

Beta-carotene measurement and SNP genotyping

Beta-carotene concentration was measured from milk samples collected through HPLC

(High Performance Liquid Chromatography) for 2,291 cattle (Holstein Cross, Jersey Cross,

Sahiwal, Tharparkar and Gir) and 2,242 buffalo (Jaffrabadi, Murrah, Pandharpuri, Mehsana and

Surti), as shown in Table S3. For each animal, information on lactation, milk yield, location and farmer was also collected if available. The same animals were genotyped with the previously developed Sequenom custom array and for each breed, only SNPs with call rate ≥ 0.90 and belonging to the same species (cattle and buffalo) were considered. For the combined analyses within species, SNPs had to have a call rate >90% for all the breeds to be included. The retained missing SNPs were then imputed breed by breed using Fimpute 2.2 (Sargolzaei et al., 2014).

Pairwise Linkage Disequilibrium (LD) among SNPs of the same genes were calculated with the

Haploview software (Barrett et al., 2005). Pairwise SNPs with r2>0.6 were considered as in strong LD.

Statistical association analyses

For each SNP, the genotypes were coded as 0 for homozygous for one of the alleles, 1 for heterozygous genotypes, and 2 for homozygous for the other allele. After this, ANOVAs with proc GLM in SAS 9.4 (SAS institute, 2013) were performed initially using three different linear models for each breed of cattle and buffalo and for each species: 1) The first linear model 47 included only non-SNP fixed effects, depending on the availability of information on location, farmer and breed. 2) To examine the contribution of SNPs to the variability of beta-carotene content in milk, all SNPs were added to the first model as covariates and 3) Finally, only SNPs that showed a P-value less than 0.3 were included in the model to estimate their additive effects, using the “solutions” option in SAS 9.4 (SAS institute, 2013). Further, when available, the number of lactations and milk yield were included in the models as covariates. Additionally, in the case of buffaloes, a fixed effect of batch was included in all the models. The reason for this being that data for buffaloes were collected at two separate periods of time with marked differences in precipitation that might have had an effect on the concentration of beta-carotene in milk. The option of correcting for multiple testing with adaptive FDR (data not shown) was explored and the number of significant SNPs dropped was as expected. However, given the limited number of SNPs tested and the previous SNP filtering steps taken, it was finally decided that correcting for multiple tests was not necessary and would reduce the information desired.

There is no known history of selection for beta carotene content in the population studied.

Because of this, the last step of this analysis was to contrast the hypothetical extreme cases of animals that were homozygous for all the favorable alleles (P<.30) versus animals that were homozygous for the unfavorable alleles with the objective of demonstrating the impact that long- term sustained selection could have on the beta carotene content in milk. These analyses were performed both for each individual gene and across all three genes for each species and the combined breeds as well.

Suppose SNP covariates are coded as 0, 1 or 2 (number of copies of the "A" allele). Then, at locus j, if the "A" allele is favorable, the substitution effect, βj, is positive, and if it is unfavorable, β, is negative. So, if locus j is favorable, the difference between the favorable and 48

unfavorable homozygotes will be 2 x βj. On the other hand, if locus j is unfavorable, this difference becomes 2 x – βj . Thus, at an arbitrary locus, the difference between the favorable and

and across all the loci, the difference between the most ,׀βj׀ unfavorable homozygotes is 2 x

But, as βj is not observed, D is .׀βj׀ favorable and least favorable genotypic values is: D = 2∑j

These analyses were performed using proc GLM in SAS 9.4 (SAS. ׀j ׀ estimated as = 2 ∑j institute, 2013) along with the estimate function to obtain coefficients, standard errors and a nominal p-value for the difference between the genotype contrasted.

In addition to the previously described frequentist approach, a Bayesian analysis was also undertaken with a model that considered the non-SNP effects as fixed and all SNPs effects as random. It has been recognized that explicit adjustments for multiple tests are not needed when inference is based on posterior probabilities (Stephens and Balding, 2009; Gelman et al., 2012;

Chen et al., 2017; Fernando et al., 2017), and thus, the Bayesian analysis does not suffer from the multiple-test penalty (Stephens and Balding 2009). As adjustments for multiple testing were not undertaken in the frequentist approach, the Bayesian approach provides a useful validation of the results from the frequentist approach.

Initially a BayesA prior was employed given that BayesA assumes that all the markers included in the model have an effect on the phenotype. Thus, using only the markers that showed a nominal P-value < 0.3 in the previous steps it was assumed this would produce accurate predictions. Additionally, to examine the usefulness of another Bayesian prior for inference of the SNP effects, two sets of simulations were employed. The simulated data set was composed of

500 observations and 20 markers. In the first set, no markers had an effect on the phenotype while on the second set 25% of the markers had an effect on the phenotype. These simulated data were tested with BayesA (Meuwissen et al., 2001), where SNP substitution effects are a priori 49 assumed to be identically and independently distributed t random variables and BayesCp (Habier et al., 2011), where they are assumed to be identically and independently distributed with a point mass at 0 with probability p or normally distributed with probability 1-p. Further, in BayesCp, p is assumed unknown with a uniform prior between 0 and 1. For both simulated data sets Bayes

Cp consistently produced more accurate estimates of the contrast between homozygotes for all favorable versus all unfavorable alleles than BayesA (simulated data results not shown).

With this evidence, an additional Bayesian analysis was undertaken using the Bayes Cp prior for SNPs effects. Inferences on marker effects were then based on their posterior distributions, which were estimated from Markov chain Monte Carlo samples obtained using the

JWAS (Cheng et al., 2018) package.

Results and discussion

SNP discovery and panel performance

Milk is a rich source of BC, which is one of the members of naturally occurring carotenoids and BC is also abundantly available in plants (fruits and vegetables), that humans obtain through foods of plant origin (Olson, 1999). BC is present in cow milk but at lower levels in buffalo (Ullah et al., 2017). In this study, an initial breed and species comparison demonstrated differences in BC level in milk among cattle and buffalo but shows that beta- carotene can be detected also in buffalo breeds, with Jafarabadi having a concentration higher than most of the cattle breeds considered (Table 3.2). BC can be acquired from the bloodstream by various tissues within the body, to be stored or be readily metabolized (Shete et al., 2013).

Besides milk, liver is a major organ that accumulates large quantities of BC, followed by several other tissues (Schmitz et al., 1991). Three key genes have been considered for our analyses, as they were previously reported to be associated with BC levels and are directly involved in beta- 50 carotene metabolism. The first is the BCMO1 gene, which symmetrically cleaves beta-carotene into two molecules of retinal using a dioxygenase mechanism. The role of BCMO1 in BC conversion efficiency has been clarified, as well as a report of a genetic variation in humans that can affect BC conversion efficiency (Lindqvist et al., 2007). BCO2 gene is also involved in the cleavage of beta carotene, through asymmetrical cleavage (Amengual et al., 2011). A SNP corresponding a stop codon in this gene has been associated with beta carotene content in milk in cattle (Berry et al., 2009). The third gene, SCARB1, is involved in cellular uptake of several provitamin A carotenoids, including β-carotene. Genetic variation associated with plasma beta-carotene was also reported (Borel et al., 2013), as well as a variation related with carotenoid-based coloration in birds (Toomey et al., 2017).

Overall a total of 1,576 SNPs were detected across the three genes for B. taurus, 2,225 for B. indicus and 3,824 for B. bubalis (Figure 3.1). The differences in number of SNPs is probably due to the crossbreds considered and the differences between the two species. Only two breeds for B. taurus (Holstein and Jersey) were considered for the analyses because these two are extensively employed around the world and they are often crossbreed with B.indicus (FAO,

2013). A total of 67 SNPs met the defined parameters ((a) in a coding region, (b) in the gene

UTR or (c ) reported in dbSNP) and thus, were selected to compose the SNP panel (Table 3.1) and were divided as follows: a total of 17 SNPs targeted for B. taurus (6 SNPs for BCMO1, 5 for

BCO2 and 6 for SCARB1), 27 SNPs for B. indicus (10 SNPs for BCMO1, 7 for BCO2 and 10

SNPs for SCARB1) and 23 SNPs for B. bubalis (8 for BCMO1, 6 for BCO2 and 9 for

SCARB1). For convenience, they were uniquely named with the name of the gene following by a successive number according with the position in the gene and species identification (first all the cattle SNPs, then the buffalo SNPs). Five SNPs were overlapping among B. bubalis and B. 51

Taurus (SCARB1.1 with SCARB1.2, SCARB1.5 with SCARB1.6, SCARB1.7 with SCARB1.8,

BCMO1.10 with BCMO1.11). The number of SNPs detected in the coding regions varied from

16 to 26 in the three species, with 5 overlapping SNPs between B. taurus and B. indicus. This is due to the high similarity among the two species that are often crossed to improve the production. Therefore, all SNPs designed on cattle were considered for both breed crosses and B. indicus breeds. The genotypes with the Indian test panel cattle (Table S4a) reported that

BCMO1.4 was not successfully genotyped. The other SNPs showed a call rate ranging from 75 to 100%, with 29 SNPs (12 for the BCO2 gene, 15 for the SCARB1 gene and 2 for the BCMO1 gene) with a call rate ≥ 0.90% in all the considered cattle breeds and crosses. For the buffalo breeds (Table S4b), 17 out of the 24 SNPs designed for buffalo were genotyped with a call rate ≥

90% in all the breeds. These SNPs were five for BCO2, seven for SCARB1 and five for the

BCMO1 genes.

Genotyping of the beta-carotene samples and association analyses

Linkage disequilibrium

The genotyping of the animals was successful in at least one breed for all the SNPs except for BCMO1.1 (cattle) and BCMO1.17 SNPs (buffalo). The number of SNPs successfully genotyped for cattle breeds ranged from 42 (Gir) to 31(Tharparkar) with the combined cattle breeds having 28 high quality SNPs in common. In buffalo, the number of high-quality SNPs ranged from 22 (Jafarabadi, Mehsana, Pandharpuri) to 20 (Murrah, Surti) while the combined

Buffalo breeds had 20 SNPs in common that were successfully genotyped. The pairwise analyses performed on cattle and buffalo SNPs revealed a low number of SNPs in high Linkage

Disequilibrium (LD), (Table S5 a and b). For cattle, as expected, mostly of the duplicated SNPs in cattle and buffalo are in strong LD for most of the breeds. The BCO2 gene had BCO2.2 and

BCO2.6 in strong linkage in the Tharparkar breed, while the SCARB1 gene has several SNPs in 52 strong LD, where SCARB1.13 and SCARB1.14 were in strong LD in all the considered breeds.

The BCMO1 gene had BCMO1.8 and BCMO1.9 that are in high LD within Gir, Sahiwal and

Jersey breeds. As for the buffalo breeds, only the SCARB1 gene reported 4 SNP (SCARB1.5,

SCARB1.6, SCARB1.7 and SCARB1.8) in strong LD in all the breeds.

Beta-carotene and association analyses

An initial breed and species comparison demonstrated differences in beta-carotene level in milk among cattle and buffalo with buffalo showing a lower beta-carotene content than cattle

(Table 3.2). This may be because buffalo can convert some portion of beta carotene directly in

Vitamin A (Ullah et al., 2017).

The specific linear models used for association analyses for each breed and species are shown in Tables 3.3 and 3.4 , respectively. For each model, adding different SNPs to the linear models increased the R2 value by a limited amount in most cases, suggesting that variation in beta-carotene content was mostly affected by environmental effects. The highest R2 was seen for models for Sahiwal cattle and the lowest R2 for models for Tharparkar cattle. For buffalo the lowest R2 was produced in the combined analyses while the highest was produced by the Murrah breed.

The allele substitution effects were estimated with the complete linear model for each breed including the SNPs with P < 0.3 as removing SNPs with P> .3 did not change the overall

Type I error rate and was a good compromise for acceptance and rejection. In Tables 3.5 and 3.6 allele substitution effects and the recommended alleles to select for each SNP are presented, as well as the gene-wise F-test performed to determine which genes had a significant effect on beta- carotene concentration as a whole. All breeds of cattle and buffalo showed SNPs with P-value <

0.3 except for the Mehsana buffalo breed. Even though all SNPs with P-value < 0.3 were included in the final analyses, attention should be centered on SNPs that are significant at 0.05. 53

This includes 23 SNP for cattle (13 for the combined breeds, 9 for Holstein, 10 for Jersey, 11 for

Gir13 for Sahiwal and 4 for Tharparkar ; Table 3.5) and 17 for Buffalo (10 for the combined breeds, 7 for Jafarabadi, 6 for Murrah, 3 for Pandharpuri and 9 for Surti.; Table 3.6). If employed for BC improvement selection, efforts should be focused on the recommended alleles for these SNPs.

Contrary to another study that used different methodologies and rejection thresholds and which found one marker in the BCO2 gene that was associated with differences in the beta- carotene concentrations in cow milk (Berry et al., 2009), this research found multiple markers in the BCO2, BCMO1 and SCARB1 genes for the population of cattle and buffalo investigated.

In cattle, the significant SNP with the largest allele substitution effect was SCARB1.16 for the Sahiwal breed with P-value < 0.05 and an allele substitution effect of -6.21±3.13 micrograms of beta-carotene/100ml of milk, for this SNP, animals should be selected for the C allele. In buffalo, SCARB1.20 of Surti animals was the significant SNP with the largest allele substitution effect of -2.07 ±0.88 mincrograms of beta-carotene/100ml of milk, in this case selection should be made for the G allele. It is important to note that there are three SNPs with P

< 0.05 for the combined cattle (BCMO1.11, BCMO1.15, SCARB1.8) analyses and only one for buffaloes (BCO2.18). These markers might represent an important tool for selection for cattle and buffaloes when breed proportions are unknown because they are significant when the breeds of each species included in this research are analyzed simultaneously. These SNPs may also be particularly useful when the breed of the animal is unknown or in case of crossbreeds. As for the single breeds, SNPs with a significant P value included Jersey crosses (SCARB1.10), Gir

(BCMO1.8, BCO2.2, BCO2.5, BCO2.9 and BCO2.10), Sahiwal (Scarb1.16) for cattle and

Jafarabadi (BCMO.21), Murah (SCARB1.24) and Surti (BCO2.16 and SCARB1.20) for buffalo. 54

Finally, there are significant cases such as the BCMO1 gene for Gir cattle and Jafarabadi where there were several markers in linkage disequilibrium within the gene and therefore it might be beneficial to select for a specific haplotype in that gene.

In the Bayesian approach inferences are based on posterior probabilities. The BayesA prior assumes a t distribution centered at zero for the effects of all loci, and so each locus is a priori equally likely to be positive or negative. On the other hand, BayesCp prior assumes the

SNP effect is null with probability p or has a normal distribution with probability 1-p for all loci, where the probability p is treated as an unknown with a uniform prior. Thus, in a BayesCp analysis, the posterior probability that a locus has a non-null effect can be taken as evidence of an association of the SNP with the trait.

Thus, posterior probabilities that deviate from this prior probability of 0.5 can be taken as evidence of an association of the SNP with the trait. Table 3.7 gives SNPs that had posterior probabilities greater than 0.8, 0.7, or 0.5 (showing all SNPs) for being positive or being negative, together with the posterior mean for the difference between the most favorable and the least favorable genotypes involving these SNPs.

It is often misleading to just compare results from the different approaches as they should be expected to differ. However, we include this discussion in part to show while some differences did exist there was overlap. Estimates (posterior means) of the marker effects, obtained from the Bayesian approach, ranged from 0.01 to 2.02 (data not shown). Means for total gain, generally, were similar or smaller than those found with the frequentist approach.

Interestingly, even though the frequentist approach is not directly comparable with the Bayesian methods, the number of SNPs with important effects was similar in many cases the different sets of SNPs in the two approaches with important effects overlapped. The overlap was of six SNPs 55 for the combined cattle breeds when comparing the results of the frequentist approach and the most stringent threshold of both bayesian approaches. In the case of the buffalo combined breeds, seven SNPs overlapped across the three analyses performed. The biggest difference was found in the Jersey cattle breed where the Bayesian analyses found 5 important SNPs less than the frequentist approach and the total gain in the frequentist approach was nearly 45 micrograms

/100ml higher when compared to the results of the Bayesian approach.

In the combined buffalo breeds frequentist approach found 7 significant SNPs and the

Bayesian approach with a BayesA prior and a posterior probability (PP) > 0.7 found 13. Results from the BayesCp analyses showed a different trend. SNPs that had non-null effects with a posterior probability (PP) greater than 0.0 (showing all SNPs), 0.1, 0.2 are given in Table 3.8 for all breeds except for the Surti and Pandharpuri breeds which showed markers with important effects with PP > 0.4 and PP > 0.5 (Table 3.8), along with the posterior means for total gain in beta-carotene content in milk of animals that are homozygous for the favorable alleles at these

SNPs. Some buffalo breeds tended to show higher total gains than any of the cattle breeds when the Bayes Cp prior was used to analyze the data, exhibiting an opposite trend to both the Bayes

A and frequentist analyses that showed cattle breeds having higher total gains when compared to buffalo breeds. For the analyses with the BayesCp prior, the breeds with the overall highest total gain were the buffalo breeds Pandharpuri and Surti with 10.42 ± 7.33 and 9.43 ± 5.67 micrograms /100ml, respectively for PP > 0.0. For cattle, Holstein-crossed animals showed the highest total gain with 6.34 ± 7.20 micrograms /100ml. Another result is that only buffalo breeds

(Pandharpuri, Surti, and Jafarabadi) showed markers that had posterior probabilities higher than

0.2. and that reached 0.8 in the case of Jafarabadi. These two breeds, however, had only 213

(Pandharpuri) and 388 (Surti) observations, and as a result, the posterior mean of p was 0.51 for 56

Pandharpuri and 0.55 for Surti, which are very close to the prior mean of 0.5. Thus, posterior means in the BayesCp analyses for total gain for these two breeds were close to but lower than those from the BayesA analyses, which implicitly has a p of 0.0. When the breeds were combined within cattle (2290 observations) and buffalo (2238 observations), the posterior mean of p was 0.88 for cattle and was 0.87 for buffalo, which implies that a large proportion of the

SNPs had no association with the trait. Thus, posterior means in the BayesCp analysis for total gain in cattle and buffalo were much lower than those from the BayesA analyses, but they were higher for cattle than for buffalo as in the BayesA analyses.

It is important to note that estimates of total gain in the Bayesian analyses tended to be smaller than those in the frequentist approach due to both the generally smaller number of markers found significant and due to the expected shrinkage of their effects in the Bayesian analyses (Bhattacharya et al., 2012). Even though there are clear differences in the magnitude of the substitution effects and the number of significant markers for each breed depending on the analysis method used, the results of our analyses serve as confirmation of the possible applicability of genetic selection for the improvement of nutritional value of milk in regard to beta-carotene content and demonstrate that there is value in further investigating the genetic potential of cattle and buffalo breeds for its production.

It is also very important to note that the expression of a phenotype is dependent on the interaction between environment and genotype and most of the animals sampled for this project were under harsh environmental conditions. India is a developing country with rural areas that are often poor, and the animals sampled were under varied and generally suboptimal management and nutritional conditions. Even though a fixed effect for herd was included in our statistical models to account for the different herd conditions found throughout the samples, the 57 generalized less-than-optimal nutritional, environmental and management conditions that these animals were kept under might have had an overall negative effect that prevented or decreased the full expression of the phenotypes associated with the concentration of beta-carotene in milk.

Therefore, improving the aforementioned conditions should go hand in hand with the selection program to successfully and significantly increase the concentration of beta-carotene concentration in milk in the cattle and buffalo Indian population.

Conclusions

The custom panel designed for genes related to beta-carotene production shows applicability in genotyping of cattle and buffalo in India. Among the genotyped SNPs, some were significantly associated in several cattle and buffalo breeds, providing markers that may be useful to develop genetic selection strategies that can increase beta-carotene content in milk of those populations and could be tested in other developing countries. Moreover, the recommendation of selection of significant specific alleles at the gene markers may provide the direction to effectively increase the beta-carotene content in milk in the Indian cattle and buffalo populations. Additional analyses will be required to evaluate a haplotype-based selection for

SNPs in high linkage disequilibrium. Moreover, future genome wide association studies may reveal additional genes associated with beta-carotene or vitamin A in cattle and buffalo. The possible discovery of new candidate genes involved in the beta-carotene production would help increase the number of informative SNPs this panel.

Acknowledgements

The authors wish to express their appreciation to Neogen (Geneseek) for development of the Sequenom panel, Imperial Life Sciences (P) Limited for genotyping, James M Reecy and

Eline Fritz Waters (Iowa State University) for assistance, the International Buffalo Consortium, the 1000 bulls genome project, Marcos Vinicius Gualberto B Silva (EMBRAPA) and Jose 58

Fernando Garcia (UNESP) for providing the NGS data and Gujarat, Rajasthan, Punjab,

Jharkhand and Maharashtra (BAIF) for timely collection of samples. Funding for this project was kindly provided for by the Bill and Melinda Gates Foundation, State of Iowa and the Ensminger

Fund.

Literature cited

Amengual, J., G.P. Lobo, M. Golczak, H.N.M. Li, T. Klimova, C.L. Hoppel, A. Wyss, K. Palczewski, and J. von Lintig. 2011. A mitochondrial enzyme degrades carotenoids and protects against oxidative stress.. FASEB J. 25:948–59. doi:10.1096/fj.10-173906.

Barrett, J.C., B. Fry, J. Maller, and M.J. Daly. 2005. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21:263–265. doi:10.1093/bioinformatics/bth457.

Beaton, G.H., R. Martorell, K.J. Aronson, B. Edmonston, G. McCabe, A.C. Ross, and B. Harvey. 1993. Effectiveness of Vitamin A Supplementation in the Control of Young Child Morbidity and Mortality in Developing Countries. Toronto Canada University of Toronto Faculty of Medicine International Nutrition Program 1993 Dec.

Bennasir, H., S. Sridhar, and T.T. Abdel-Razek. 2010. Vitamin A from physiology to disease prevention. Int. J. Pharm. Sci. Rev. Res. 1:68–73.

Berry, S.D., S.R. Davis, E.M. Beattie, N.L. Thomas, A.K. Burrett, H.E. Ward, A.M. Stanfield, M. Biswas, A.E. Ankersmit-Udy, P.E. Oxley, J.L. Barnett, J.F. Pearson, Y. Van Der Does, A.H.K. MacGibbon, R.J. Spelman, K. Lehnert, and R.G. Snell. 2009. Mutation in bovine β- carotene oxygenase 2 affects milk color. Genetics 182:923–926. doi:10.1534/genetics.109.101741.

Bhattacharya, A., D. Pati, N.S. Pillai, and D.B. Dunson. 2012. Bayesian shrinkage.

Borel, P., G. Lietz, A. Goncalves, F. Szabo de Edelenyi, S. Lecompte, P. Curtis, L. Goumidi, M.J. Caslake, E.A. Miles, C. Packard, P.C. Calder, J.C. Mathers, A.M. Minihane, F. Tourniaire, E. Kesse-Guyot, P. Galan, S. Hercberg, C. Breidenassel, M. González Gross, M. Moussa, A. Meirhaeghe, and E. Reboul. 2013. CD36 and SR-BI Are Involved in Cellular Uptake of Provitamin A Carotenoids by Caco-2 and HEK Cells, and Some of Their Genetic Variants Are Associated with Plasma Concentrations of These Micronutrients in Humans. J. Nutr. 143:448–456. doi:10.3945/jn.112.172734.

Buzanakas, E., T.C.S. Chud, J.Cl. do C. Panetto, M.A. Machado, L.O.C. da Silva, V.G.B. da Silva, and D.P. Munari. 2018. Breeding structure and genetic variability in Nellore and Gyr breeds from Brazil and India. Page in Reuniao Anual da Sociedade Brasileira de Zootecnia.

Chen, C., J.P. Steibel, and R.J. Tempelman. 2017. Genome-Wide Association Analyses Based on Broadly Different Specifications for Prior Distributions, Genomic Windows, and Estimation Methods. Genetics 206:1791–1806. doi:10.1534/genetics.117.202259/-/DC1.1. 59

Cheng, H., R. Fernando, and D. Garrick. 2018. JWAS: Julia implementation of Whole-genome Analyses Software. Proc. World Congr. Genet. Appl. to Livest. Prod. 859.

D’Ambrosio, D.N., R.D. Clugston, W.S. Blaner, D.N. D’Ambrosio, R.D. Clugston, and W.S. Blaner. 2011. Vitamin A Metabolism: An Update. Nutrients 3:63–103. doi:10.3390/nu3010063.

Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. Van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. Vantassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. 2014. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat. Genet. 46:858–865. doi:10.1038/ng.3034.

Davis, K.L., R.S. Kahn, G. Ko, and M. Davidson. 1991. Dopamine in schizophrenia: A review and reconceptualization. Am. J. Psychiatry 148:1474–1486. doi:10.1176/ajp.148.11.1474.

FAO. 2013. Dairy Production and Products. Accessed October 17, 2018. http://www.fao.org/dairy-production-products/production/dairy-animals/cattle/en/.

Fernando, R., A. Toosi, A. Wolc, D. Garrick, and J. Dekkers. 2017. Application of Whole- Genome Prediction Methods for Genome-Wide Association Studies: A Bayesian Approach. J. Agric. Biol. Environ. Stat. 22:172–193. doi:10.1007/s13253-017-0277-6.

Gelman, A., J. Hill, and M. Yajima. 2012. Why We (Usually) Don’t Have to Worry About Multiple Comparisons. J. Res. Educ. Eff. 5:189–211. doi:10.1080/19345747.2011.618213.

Goodman, A.B., and A.B. Pardee. 2003. Evidence for defective retinoid transport and function in late onset Alzheimer’s disease. Proc Natl Acad Sci U S A 100:2901–2905. doi:10.1073/pnas.0437937100\r0437937100 [pii].

Habier, D., R.L. Fernando, K. Kizilkaya, and D.J. Garrick. 2011. Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics 12. doi:10.1186/1471-2105-12-186.

Hussey, G.D., and M. Klein. 1990. A randomized, controlled trial of vitamin A in children with severe measles.. New Engl. J. Med. 323:160–164. doi:Doi 10.1056/Nejm199007193230304.

Hwang, K.B., I.H. Lee, H. Li, D.G. Won, C. Hernandez-Ferrer, J.A. Negron, and S.W. Kong. 2019. Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings. Sci. Rep. 9:3219. doi:10.1038/s41598-019-39108-2.

Li, H. 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27:2987–2993. doi:10.1093/bioinformatics/btr509.

60

Li, H., and R. Durbin. 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595. doi:10.1093/bioinformatics/btp698.

Lindqvist, A., J. Sharvill, D.E. Sharvill, and S. Andersson. 2007. Loss-of-Function Mutation in Carotenoid 15,15′-Monooxygenase Identified in a Patient with Hypercarotenemia and Hypovitaminosis A. J. Nutr. 137:2346–2350. doi:10.1093/jn/137.11.2346.

Lynch, S.R. 2009. Interaction of Iron with Other Nutrients. Nutr. Rev. 55:102–110. doi:10.1111/j.1753-4887.1997.tb06461.x.

McKenna, C. 2017. The impact of legislation and industry standards on farm animal welfare.

McLaren, W., L. Gil, S.E. Hunt, H.S. Riat, G.R.S. Ritchie, A. Thormann, P. Flicek, and F. Cunningham. 2016. The Ensembl Variant Effect Predictor. Genome Biol. 17:122. doi:10.1186/s13059-016-0974-4.

Meuwissen, T.H.E.E., B.J. Hayes, and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829. doi:10.1046/j.1365- 2540.1998.00308.x.

Olson, J.A. 1999. Carotenoids and human health.. Arch. Latinoam. Nutr. 49:7S-11S.

Ribaya-Mercado, J.D. 2002. Inn uence of Dietary Fat on-Carotene Absorption and Bioconversion into Vitamin A 104–110.

Sargolzaei, M., J.P. Chesnais, and F.S. Schenkel. 2014. A new approach for efficient genotype imputation using information from relatives. BMC Genomics 15:478. doi:10.1186/1471- 2164-15-478.

SAS institute. 2013. SAS 9.4 User’s Guide 4. doi:10.1016/B978-0-08-087780-8.00056-5.

Schmitz, H.H., C.L. Poor, R.B. Wellman, and J.W. Erdman. 1991. Concentrations of Selected Carotenoids and Vitamin A in Human Liver, Kidney and Lung Tissue. J. Nutr. 121:1613– 1621. doi:10.1093/jn/121.10.1613.

Semba, R.D. 2009. The Role of Vitamin A and Related Retinoids in Immune Function. Nutr. Rev. 56:S38–S48. doi:10.1111/j.1753-4887.1998.tb01643.x.

Shete, V., L. Quadro, V. Shete, and L. Quadro. 2013. Mammalian Metabolism of β-Carotene: Gaps in Knowledge. Nutrients 5:4849–4868. doi:10.3390/nu5124849.

Sonstegard, T., J. Williams, S.G. Schroeder, J.F. Garcia, A. Zimin, M. Babar, and D. Iamartino. 2012. W114: SNP Discovery Across Buffalo Breeds. Page in Proceeddings of the XX Plant and Animal Genome Conference. Plant and Animal Genome, San Diego.

Stephens, M., and D.J. Balding. 2009. Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10:681–690. doi:10.1038/nrg2615. 61

Toomey, M.B., R.J. Lopes, P.M. Araújo, J.D. Johnson, M.A. Gazda, S. Afonso, P.G. Mota, R.E. Koch, G.E. Hill, J.C. Corbo, and M. Carneiro. 2017. High-density lipoprotein receptor SCARB1 is required for carotenoid coloration in birds.. Proc. Natl. Acad. Sci. U. S. A. 114:5219–5224. doi:10.1073/pnas.1700751114.

Ullah, R., S. Khan, H. Ali, M. Bilal, and M. Saleem. 2017. Identification of cow and buffalo milk based on Beta carotene and vitamin-A concentration using fluorescence spectroscopy. PLoS One 12:1–10. doi:10.1371/journal.pone.0178055.

Valacchi, G., C. Sticozzi, Y. Lim, and A. Pecorelli. 2011. Scavenger receptor class B type I: a multifunctional receptor. Ann. N. Y. Acad. Sci. 1229:E1–E7. doi:10.1111/j.1749- 6632.2011.06205.x.

62

Tables and figures

Table 3.1 Sequenom panel. Coordinates for the SNPs that targeted cattle (B.taurus and B.indicus) are made based on the UMD3.1 reference genome. SNPs independently detected but overlapping in both Bos Taurus and Bos indicus are reported with the same apical number in the “Sequenom” column.

Genomic coordinates Sequenome SNP Species Effect dbSNP 15:g.22841716T>G BCO2.1 B. indicus synonymous - 15:g.22877552G>A BCO2.2 B. taurus stop rs109226280 15:g.22878937G>A BCO2.3 B. Taurus non- synonymous rs445248588 15:g.22886751A>T BCO2.4 B. Taurus stop rs475149853 15:g.22886781G>A BCO2.5 B. indicus non- synonymous - 15:g.22887879G>A BCO2.6 B. indicus non- synonymous - 15:g.22902367G>A BCO2.7 B. Taurus synonymous rs468029187 15:g.22903083C>T BCO2.8 B. Taurus non- synonymous rs463702615 15:g.22903195A>G BCO2.9 B. indicus synonymous - 15:g.22904160C>T BCO2.10 B. indicus non- synonymous - 15:g.22904161A>T BCO2.11 B. indicus non- synonymous - 15:g.22905346T>C BCO2.12 B. indicus non- synonymous - 17:g.53180716C>G SCARB1.11 B. Taurus Upstream variant rs211588107 17:g.53180716C>G SCARB1.21 B. indicus Upstream variant - 17:g.53181068C>G SCARB1.3 B. indicus non- synonymous - 17:g.53229545A>C SCARB1.4 B. indicus non- synonymous - 17:g.53237728A>G SCARB1.52 B. Taurus synonymous rs210238050 17:g.53237728A>G SCARB1.62 B. indicus synonymous - 17:g.53237845T>C SCARB1.73 B. Taurus synonymous rs210829935 17:g.53237845T>C SCARB1.83 B. indicus synonymous - 17:g.53242822C>A SCARB1.9 B. Taurus synonymous rs377844543 17:g.53242843C>T SCARB1.10 B. indicus synonymous -

Table 3.1 continued. 63

Genomic coordinates Sequenome SNP Species Effect dbSNP 17:g.53242906C>T SCARB1.12 B. taurus synonymous rs478582082 17:g.53245654G>A SCARB1.134 B. taurus synonymous rs211138720 17:g.53245654G>A SCARB1.144 B. indicus synonymous - 17:g.53258090A>G SCARB1.15 B. indicus synonymous - 17:g.53263896C>T SCARB1.16 B. indicus synonymous - 18:g.7930930C>T BCMO1.1 B. taurus Upstream gene variant rs110137311 18:g.7938090G>A BCMO1.2 B. taurus non- synonymous/stop rs210669227 18:g.7938092G>A BCMO1.3 B. indicus synonymous - 18:g.7942847T>C BCMO1.4 B. indicus synonymous -

18:g.7942924G>C BCMO1.5 B. indicus non- synonymous - 18:g.7944544C>A BCMO1.6 B. taurus synonymous np 18:g.7944547G>A BCMO1.7 B. indicus synonymous - 18:g.7944577C>T BCMO1.8 B. indicus synonymous - 18:g.7947135G>C BCMO1.9 B. indicus non- synonymous - 18:g.7947229C>T BCMO1.105 B. taurus synonymous rs381162140 18:g.7947229C>T BCMO1.115 B. indicus synonymous - 18:g.7947242G>A BCMO1.12 B. taurus non- synonymous rs209658446 18:g.7949278A>G BCMO1.13 B. indicus synonymous - 18:g.7949326T>C BCMO1.14 B. indicus synonymous - 18:g.7949381A>G BCMO1.15 B. taurus non- synonymous rs444976967 18:g.7962378C>G BCMO1.16 B. indicus non- synonymous -

Table 3.1 continued. 64

Genomic coordinates Sequenome SNP Species Effect dbSNP jcf7180021617284:g.785353G>A BCO2.13 B. bubalis synonymous - jcf7180021617284:g.785395C>T BCO2.14 B. bubalis synonymous - jcf7180021617284:g.787642T>C BCO2.15 B. bubalis non- synonymous - jcf7180021617284:g.788450C>T BCO2.16 B. bubalis synonymous - jcf7180021617284:g.845577G>A BCO2.17 B. bubalis synonymous - jcf7180021617284:g.847449G>A BCO2.18 B. bubalis non- synonymous - jcf7180021616390:g.1200693C>T SCARB1.17 B. bubalis synonymous - jcf7180021616390:g.1226834G>T SCARB1.18 B. bubalis non- synonymous - jcf7180021616390:g.1226890G>C SCARB1.19 B. bubalis synonymous - jcf7180021616390:g.1226924A>G SCARB1.20 B. bubalis non- synonymous - jcf7180021616390:g.1226944G>A SCARB1.21 B. bubalis synonymous - jcf7180021616390:g.1231985C>T SCARB1.22 B. bubalis non- synonymous - jcf7180021616390:g.1235067C>T SCARB1.23 B. bubalis synonymous - jcf7180021616390:g.1235151G>A SCARB1.24 B. bubalis synonymous - jcf7180021616390:g.1283516A>C SCARB1.25 B. bubalis synonymous - jcf7180021615735:g.3038221G>T BCMO1.17 B. bubalis misense - jcf7180021615735:g.3038243C>T BCMO1.18 B. bubalis synonymous - jcf7180021615735:g.3040767G>T BCMO1.19 B. bubalis synonymous - jcf7180021615735:g.3051607C>T BCMO1.20 B. bubalis synonymous - jcf7180021615735:g.3062829G>C BCMO1.21 B. bubalis non- synonymous - jcf7180021615735:g.3062850G>A BCMO1.22 B. bubalis synonymous - jcf7180021615735:g.3062908G>A BCMO1.23 B. bubalis non- synonymous - jcf7180021615735:g.3068901A>G BCMO1.24 B. bubalis synonymous -

65

Table 3.2 Number and mean beta-carotene (BC) concentration in milk.

Species Breed Number of samples BC Mean (µg/100ml) Standard Error Combined 2291 4.41 0.11 Holstein Cross 492 6.16 0.38 Jersey Cross 512 3.90 0.16 Cattle Sahiwal 392 4.34 0.24 Tharparkar 481 4.04 0.18 Gir 414 3.50 0.17 Combined 2242 4.33 0.11 Jaffrabadi 458 5.50 0.26 Murrah 470 4.71 0.25 Buffalo Pandharpuri 412 3.35 0.19 Mehsana 489 4.31 0.28 Surti 413 3.61 0.23

Table 3.3 Fixed effects included in the linear models used to analyze each cattle breed.

Breed Model R2 $breed + place (breed) + farmer (place*breed) 0.566 all common SNPs* 0.577 Combined non- synonymous/stop** 0.57 P<0.3 SNPs*** (11 SNPs) 0.573 farmer + place 0.545 all SNPs**** 0.574 Holstein Cross non- synonymous/stop 0.551 P<0.3 SNPs (10 SNPs) 0.558 farmer + place + yield 0.807 all SNPs 0.861 Jersey Cross non- synonymous/stop 0.815 P<0.3 SNPs (12 SNPs) 0.848

66

Table 3.3 continued.

Breed Model R2 farmer + place + lactation + yield 0.58 all SNPs 0.642 Gir non- synonymous/stop 0.608 P<0.3 SNPs (9 SNPs) 0.621 farmer + place + lactation + yield 0.871 all SNPs 0.913 Sahiwal non- synonymous/stop 0.893 P<0.3 SNPs (13 SNPs) 0.898 farmer + place + lactation + yield 0.112 all SNPs 0.155 non- synonymous/stop 0.128 Tharparkar

P<0.3 SNPs (7 SNPs) 0.133

$bolded models include no SNPs *model including as all SNPs with call rate ≥ 90% for all five breeds in addition to fixed effects ** model including only SNPs that code for non- synonymous and stop codons in addition to fixed effects *** model including only SNPs with P-value < 0.3 in addition to fixed effects **** model including all SNPs with call rate ≥ 90% in addition to fixed effects

67

Table 3.4 Fixed effects included in the linear models used to analyze each buffalo breed.

Breed Model R2 $breed + batch(breed) + place(batch*breed) 0.336 all common SNPs* 0.346 Combined non- synonymous/stop** 0.341 P<0.3 SNPs*** (8 SNPs) 0.344 batch + farmer + place + lactation + yield 0.475 all SNPs**** 0.516 Jafarabadi non- synonymous/stop 0.498 P<0.3 SNPs (6 SNPs) 0.504 batch + farmer + place + lactation 0.464 all SNPs 0.469 Mehsana non- synonymous/stop 0.492 P<0.3 SNPs ( 6 SNPs) 0.503 batch + farmer + place + lactation + yield 0.873 all SNPs 0.902 Murrah non- synonymous/stop 0.872 P<0.3 SNPs (4 SNPs) 0.886 batch + farmer + place 0.84 all SNPs 0.88 Pandharpuri non- synonymous/stop 0.885 P<0.3 SNPs (9 SNPs) 0.916 batch + farmer + place + lactation + yield 0.408 all SNPs 0.483 Surti non- synonymous/stop 0.431 P<0.3 SNPs (8 SNPs) 0.459 $bolded models include no SNPs * model including as all SNPs with call rate ≥ 90% for all five breeds in addition to fixed effects ** model including only SNPs that code for non- synonymous and stop codons in addition to fixed effects *** model including only SNPs with P-value < 0.3 in addition to fixed effects **** model including all SNPs with call rate ≥ 90% in addition to fixed effects

Table 3.5 Cattle SNPs with P-values ≤ 0.30 and F-tests for significant SNPs in each gene.

Total P. breed + place (breed) + farmer Gene Breed SNP mean* (place*breed) effect effect** BCMO1 BCMO1.3 BCMO1.11 BCMO1.15 -0.536 -1.937 allele sub. effect (SE) (µg/100ml) 1.286 (0.47) 7.57 (2.48) (0.38) (1.08) P-value 0.16 0.006 0.007 0.002a recommended allele A T A Freq. of recommended allele 0.07 0.04 0.99 BCO2 BCO2.2 BCO2.4 BCO2.5 BCO2.12 -0.965 -0.399 allele sub. effect (SE) (µg/100ml) 1.212 (0.87) 0.368 (0.32) 5.88 (2.84) 17.38 (0.92) (0.31) All 4.41 (4.31) Breeds (0.11) P-value 0.16 0.3 0.25 0.2 0.04 P < 68 recommended allele G A G C 0.0001

Freq.of recommended allele 0.63 0.99 0.88 0.24 SCARB1 SCARB1.2 SCARB1.3 SCARB1.8 -0.745 allele sub. effect (SE) (µg/100ml) 0.516 (0.32) 0.715 (0.39) 3.98 (1.73) (0.37) P-value 0.11 0.07 0.05 0.02 recommended allele C C T Freq. of recommended allele 0.7 0.54 0.45

Table 3.5 continued

Breed P. mean farmer + place SNP Gene effect Total effect BCMO1.1 BCMO1 BCMO1.5 BCMO1.16 2 allele sub. effect (SE) -3.40 (1.80) 0.80 (0.57) -2.76 (2.02) 14.02 (5.64) (µg/100ml) P-value 0.06 0.16 0.17 0.01 recommended allele G G G Holstein 6.16 Freq. of recommended allele 0.98 0.85 0.01 30.62 (10.04) Cross (0.38) P < 0.002 BCO2 BCO2.1 BCO2.2 BCO2.6 BCO2.11 allele sub. effect (SE) 1.80 (1.45) 2.85 (1.59) -1.82 (1.68) 1.94 (1.81) 16.58 (8.18) (µg/100ml) P-value 0.21 0.07 0.28 0.29 0.04 recommended allele T G A G Freq. of recommended allele 0.93 0.99 0.12 0.97 69 Pop. Gene effect Total effect Breed farmer + place + yield SNP mean (µg/100ml) (µg/100ml) BCO2 BCO2.1 BCO2.3 BCO2.4 BCO2.6 allele sub. effect (SE) -1.55 (1.07) -2.66 (1.71) -1.09 (0.89) 1.36 (0.98) 13.32 (9.30) (µg/100ml) P-value 0.15 0.13 0.21 0.12 0.01 recommended allele G A A G

Jersey 3.90 Freq. of recommended allele 0.28 0.02 0.63 0.73 31.06 (14.02) Cross (0.16) SCARB1 SCARB1.3 SCARB1.5 SCARB1.10 SCARB1.16 P < 0.0001 allele sub. effect (SE) 2.31 (1.29) 1.18 (0.58) 3.18 (0.92) -2.19 (1.56) 17.74 (4.72) (µg/100ml) P-value 0.07 0.03 0.0009 0.17 0.0003 recommended allele C G T C Freq. of recommended allele 0.68 0.58 0.14 0.99

Table 3.5 continued

farmer + place + SNP Gene effect Total effect Breed P. mean lactation + yield (µg/100ml) (µg/100ml) BCMO1 BCMO1.2 BCMO1.3 BCMO1.8 allele sub. effect (SE) -4.72 -1.03 -1.24 14.1 (6.66) (µg/100ml) (3.18) (0.64) (0.54) P-value 0.14 0.11 0.02 0.04 recommended allele A A C Freq. of recommended 0.002 0.1 0.8 allele BCO2 BCO2.5 BCO2.9 BCO2.10 BCO2.12 allele sub. effect (SE) -1.60 -3.13 -1.48 2.39 (1.01) 17.12 (5.40) (µg/100ml) (0.74) (1.20) (0.90) 3.50 33.86 (8.76) Gir P-value 0.02 0.03 0.01 0.1 0.002 (0.17) P < 0.0001 recommended allele G A C C Freq of recommended 70 0.88 0.89 0.09 0.04 allele SCARB SCARB1 1.5 allele sub. effect (SE) 1.28 (0.50) 2.56 (1.00) (µg/100ml) P-value 0.01 0.01 recommended allele G Freq. of recommended 0.97 allele

Table 3.5 continued.

farmer + place + lactation + Gene effect Total effect Breed P. mean SNP yield (µg/100ml) (µg/100ml) BCMO BCMO BCMO BCMO1 1.8 1.12 1.14 allele sub. effect (SE) 0.91 -2.48 0.93 (0.88) 8.64 (3.84) (µg/100ml) (0.78) (1.60) P-value 0.25 0.12 0.29 0.03 recommended allele T A T Freq. of recommended allele 0.32 0.02 0.3 BCO2 BCO2.9 BCO2.11 BCO2.12 allele sub. effect (SE) -1.43 -1.86 -1.45 9.48 (3.98) (µg/100ml) (1.09) (1.03) (0.80) 4.34 39.06 (10.99) P < Sahiwal (0.24) P-value 0.19 0.07 0.07 0.02 0.0006 recommended allele A A C 71

Freq. of recommended allele 0.92 0.06 0.24 SCARB SCARB SCARB SCARB SCARB1 1.5 1.7 1.10 1.16 allele sub. effect (SE) -1.67 -2.00 -0.58 -6.21 (µg/100ml) (0.10) (1.15) (0.48) (3.13) P-value 0.1 0.08 0.23 0.05 0.02 recommended allele A C C C Freq. of recommended allele 0.04 0.52 0.78 0.99

Table 3.5 continued.

Breed P. mean farmer + place + lactation + yield SNP Gene effect (µg/100ml) Total effect (µg/100ml) BCMO BCMO1 1.9 allele sub. effect (SE) (µg/100ml) 0.48 (0.39) 0.96 (0.78) P-value 0.23 0.23 recommended allele C Freq. of recommended allele 0.19 Tharparkar 4.04 (0.18) 1.78 (1.025) P < 0.08 BCO BCO2 2.1 allele sub. effect (SE) (µg/100ml) -0.41 (0.31) 0.82 (0.62) P-value 0.18 0.18 recommended allele G Freq. of recommended allele 0.6 72

* Mean beta carotene content for the breed (SE) (µg/100ml) ** Represents EBV of an animal homozygous for all favorable alleles SNPs with P≤0.05 are underlined a P-Value for Gene-wide effect BCMO1, SCARB1 and BCO2 are the gene symbols

Table 3.6 Buffalo SNPs with P< 0.30 and F-tests for significant SNPs in each gene.*

P. breed + batch (breed) + place Gene effect Total effect*** Breed SNP mean** (batch*breed) (µg/100ml) (µg/100ml) BCMO BCMO BCMO1 1.21 1.24 0.17 0.37 allele sub. effect (SE) (µg/100ml) 1.08 (0.52) (0.15) (0.23) P-value 0.25 0.1 0.04a recommended allele C A frequency of recommended allele 0.58 0.56 BCO BCO BCO2 2.16 2.18 0.45 0.38 allele sub. effect (SE) (µg/100ml) 1.62 (0.66) 4.72 (1.046) Combin 4.33 (0.28) (0.17) ed (0.11) P-value 0.1 0.03 0.01 P < 0.0001

recommended allele T A 73 frequency of recommended allele 0.07 0.84 SCARB SCARB SCARB SCARB1 1.17 1.19 1.20 0.23 0.35 -0.35 allele sub. effect (SE) (µg/100ml) 1.96 (0.66) (0.16) (0.19) (0.27) P-value 0.16 0.06 0.2 0.003 recommended allele T C G frequency of recommended allele 0.55 0.46 0.72

Table 3.6 continued.

P. mean batch + farmer+ place + lactation + yield SNP Gene effect (µg/100ml) Total effect (µg/100ml) BCMO BCMO BCMO1 7.84 (2.31) 1.21 1.24 allele sub. effect (SE) (µg/100ml) 1.40 (0.53) 1.38 (0.78) 5.56 (1.58) P < 0.0008 P-value 0.008 0.08 0.0005 recommended allele C A frequency of recommended allele 0.62 0.88 BCO BCO2 2.14 allele sub. effect (SE) (µg/100ml) 0.55 (0.49) 1.10 (0.98) Jafarabadi 5.5 (0.26) P-value 0.26 0.26 recommended allele T

frequency of recommended allele 0.43 74

SCARB SCARB1 1.19 allele sub. effect (SE) (µg/100ml) 0.59 (0.51) 1.18 (1.02) P-value 0.24 0.24 recommended allele C frequency of recommended allele 0.54

Table 3.6 continued.

batch + farmer + place + lactation + Gene effect Total effect P.mean SNP yield (µg/100ml) (µg/100ml) BCMO BCMO BCMO1 10.64 (4.68) 1.22 1.24 allele sub. effect (SE) (µg/100ml) 0.44 (0.27) 0.41 (0.36) 1.7 (0.94) P < 0.02 P-value 0.1 0.25 0.07 recommended allele C A frequency of recommended allele 0.47 0.84 4.71 Murrah SCARB SCARB (0.25) SCARB1 1.22 1.24 -2.06 allele sub. effect (SE) (µg/100ml) 2.41 (1.12) 8.94 (4.44) (1.11) P-value 0.07 0.03 0.05 recommended allele C A 75

frequency of recommended allele 0.31 0.7 Gene effect Total effect P.mean batch + farmer+ place + lactation SNP (µg/100ml) (µg/100ml) BCMO BCMO BCMO1 9.86 (5.48) 1.19 1.24 -1.83 -3.10 Pandharpuri allele sub. effect (SE) (µg/100ml) P < 0.08 3.35 (1.31) (2.16) (0.19) P-value 0.17 0.16 recommended allele T G frequency of recommended allele 0.15 0.04

Table 3.6 continued

farmer + place + lactation Gene effect Total effect P. mean SNP + yield (µg/100ml) (µg/100ml) BCMO BCMO BCMO BCMO BCMO1 33.71 (12.30) 1.20 1.22 1.23 1.24 allele sub. effect (SE) 0.85 9.10 -1.38 -1.41 (1.03) 22.67 (11.66) P < 0.007 (µg/100ml) (0.69) (5.43) (0.84) P-value 0.17 0.22 0.1 0.1 0.05 recommended allele C A A G frequency of recommended 0.92 0.27 0.001 0.15 allele BCO2 BCO2.16 allele sub. effect (SE) 2.88 (1.17) 6.76 (2.38) (µg/100ml) 3.61 Surti P-value 0.009 0.005 (0.23)

recommended allele T 76

frequency of recommended 0.07 allele SCARB1.2 SCARB1 0 allele sub. effect (SE) -2.07 (0.88) 4.14 (1.76) (µg/100ml) P-value 0.02 0.02 recommended allele G frequency of recommended 0.7 allele * No SNPs with P-value ≤ 0.3 were found for Mehsana, therefore no SNPs were included in these analyses ** Mean beta carotene content for the breed (SE) (µg/100ml) *** Represents EBV of an animal homozygous for all favorable alleles a P-Value for Gen-wide effect BCMO1, SCARB1 and BCO2 are the gene symbols

Table 3.7 Significant SNPs for cattle and buffalo and Total effect (STD) using a Bayesian approach (BayesA).

Breed* Gene PP > 0.8 PP > 0.7 PP > 0.5*** Total effect (STD)** Total effect (STD) Total effect (STD) SNPs SNPs SNPs (µg/100ml) (µg/100ml) (µg/100ml) BCM 2, 3, 5, 2, 3, 5, 11, 15 2, 3, 5, 6, 7, 10, 11, 15 O1 11, 15 Cattle BCO2 2, 5, 13.09 (3.74) 2, 4, 5, 12 15.40 (4.26) 1, 2, 4, 5, 6, 7, 8, 11, 12 19.13 (5.73) combined SCAR 2, 3, 8, 2, 3, 8, 11, 13 2, 3, 4, 8, 9, 10 , 11, 13, 14, 16 B1 11 BCM 3, 5, 8, 10, 11, 2, 3, 5, 6, 7, 8, 10, 11, 12 , 15, 5, 12 O1 12, 16 16 Holstein BCO2 2 7.70 (7.08) 1, 2 16.64 (6.07) 1, 2, 4, 5, 6, 7, 8, 11 27.78 (11.71) cross SCAR 3 3 1, 2, 3, 5, 7, 8, 10, 13, 14, 16 B1 BCM 2, 3, 5, 7, 8, 9, 10, 11, 12, 14, none 15 O1 15, 16 Jersey 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, BCO2 4 6.85 (3.28) 1, 3, 4 9.80 (4.22) 19.29 (9.55) 77 cross 12 SCAR 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 3, 5, 10 3, 5, 10 B1 13, 14, 15, 16 BCM 2, 3, 5, 7, 8, 9, 10, 11, 12, 13, 3, 16 3, 8, 16 O1 14, 16 Gir BCO2 5, 9, 10 7.10 (3.53) 5, 8, 9, 10, 12 11.64 (4.76) 1, 2, 3, 5, 6, 8, 9, 10, 11, 12 19.27 (8.12) SCAR 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 5 4, 5, 11, 15 B1 13, 14, 15, 16 BCM 2, 3, 5, 7, 8, 9, 10, 11, 12, 14, none 7, 9, 12 O1 16 Sahiwal BCO2 11, 12 3.32 (1.97) 6, 9, 11, 12 9.19 (4.48) 1, 2, 4, 5, 6, 8, 9, 10, 11, 12 17.31 (8.86) SCAR 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 13, 10 5, 10, 13 B1 14, 16 BCM 7 7, 11 3, 5, 6, 7, 8, 9, 10, 11, 15 O1 Tharparkar BCO2 5 4.61 (2.15) 5 5.96 (2.78) 1, 4, 5, 6, 8, 11, 12 12.71 (7.20) SCAR 2, 11 2, 11 2, 3, 4, 6, 8, 10, 11, 13, 14, 16 B1

Table 3.7 continued.

Breed* Gene PP > 0.8 PP > 0.7 PP > 0.5*** Total effect (STD)** Total effect (STD) Total effect (STD) SNPs SNPs SNPs (µg/100ml) (µg/100ml) (µg/100ml) BCM 21, 24 18, 21, 24 18, 20, 21, 22, 23, 24 O1 Buffalo 16, 17, BCO2 5.99 (1.67) 16, 17, 18 7.79 (2.37) 14, 15, 16, 17, 18 9.49 (3.78) combined 18 SCAR 17, 19, 17, 19, 20, 21, 17, 18, 19, 20, 21, 22, B1 20, 21 22, 23, 24 23, 24, 25 BCM 21, 22, 21, 22, 24 18, 19, 20, 21, 22, 24 O1 24 Jaffarbadi BCO2 none 5.50 (1.73) 14 6.86 (2.03) 13, 14, 16, 17, 18 10.60 (5.96) SCAR 17, 18, 19, 20, 21, 22, 19 16, 19 B1 23, 24, 25 BCM 21, 24 18, 21, 24 18, 20, 21, 22, 24 O1 Murrah BCO2 16, 18 5.98 (1.48) 16, 18 8.45 (2.38) 14, 16, 17, 18 11.44 (4.81) 78 SCAR 17, 18, 19, 20, 21, 22, none 19, 24 B1 23, 24, 25 BCM 18, 19, 20, 21, 22, 23, 24 21, 24 O1 24 Pandharpuri BCO2 18 5.49 (2.93) 14, 18 11.62 (4.38) 13, 14, 15, 16, 17, 18 15.45 (8.28) SCAR 17, 18, 19, 20, 21, 22, 25 9 19, 21, 22, 23 B1 23, 24, 25 BCM 18, 24 18, 20, 22, 24 18, 20, 21, 22, 23, 24 O1 Surti BCO2 16 7.60 (2.92) 16 12.25 (4.28) 14, 16, 17, 18 16.16 (5.93) SCAR 17, 18, 19, 20, 21, 22, 19 17, 19, 20, 25 B1 23, 24, 25 * No SNPs with P-value ≤ 0.3 were found for Mehsana, therefore no SNPs were included in these analyses **Represents EBV of an animal homozygous for all favorable alleles ***Includes all SNPs with P-value ≤ 0.3

Table 3.8 Significant SNPs for cattle and buffalo and Total effect (STD) using a Bayesian approach (Bayes Cpi).

Breed* Gene Posterior Prob. > 0.0** Posterior Prob. > 0.1 Posterior Prob. > 0.2 Total effect*** Total effect** Total effect SNPs SNPs SNPs (µg/100ml) (µg/100ml) (µg/100ml) BCM 2, 3, 5, 6, 7, 10, 11, 15 2, 3, 5, 6, 10, 11, 15 10, 11, 15 O1 Cattle BCO 1, 2, 4, 5, 6, 7, 8, 11, 12 3.13 (3.52) 2, 4 2.89 (3.21) none 1.48 (1.75) combined 2 SCA 2, 3, 4, 8, 9, 10 , 11, 13, 14, 4, 8, 9, 13, 14 none RB1 16 BCM 2, 3, 5, 6, 7, 8, 10, 11, 12 , 2, 3, 5, 6, 7, 8, 10, 11 12, 3, 5, 7, 8, 10, O1 15, 16 15 16 11, 16 Holstein BCO 1, 2, 4, 5, 6, 7, 8, 11 6.34 (7.20) 1, 2, 3, 4, 5, 6, 7, 8, 10, 11 6.32 (7.18) 2 3.90 (4.63) cross 2 SCA 2, 3, 4, 5, 7, 8, 10, 11, 13, 1, 2, 3, 5, 7, 8, 10, 13, 14, 16 4 RB1 14, 16 BCM 2, 3, 5, 7, 8, 9, 10, 11, 12, 14, 2, 3, 5, 8, 9, 10, 11, 12, 14 none O1 15, 16 , 15, 16 79 Jersey BCO 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 3.66 (4.81) 1, 2, 3, 4, 5, 8, 9, 10, 11 3.59 (4.68) none 1.39 (1.78) cross 2 12 SCA 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 1, 3, 5, 6, 7. 8. 10, 11, 12 , 10 RB1 12, 13, 14, 15, 16 13 , 14, 15, 16 BCM 2, 3, 5, 7, 8, 9, 10, 11, 12, 13, 2, 16 none O1 14, 16 0 BCO Gir 1, 2, 3, 5, 6, 8, 9, 10, 11, 12 1.86 (3.08) 12 0.82 (1.84) none 2 SCA 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 4, 5, 7, 12, 16 none RB1 12, 13, 14, 15, 16 BCM 2, 3, 5, 7, 8, 9, 10, 11, 12, 14, 2, 3, 5, 8, 10, 11, 12, 14, none O1 16 16 BCO Sahiwal 1, 2, 4, 5, 6, 8, 9, 10, 11, 12 2.42 (3.45) 1, 2, 4, 6, 9, 9, 10, 11, 12 2.25 (3.32) 11 0.44 (1.03) 2 SCA 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 2, 3, 5, 8, 9, 10, 11, 13, 16 none RB1 13, 14, 16

Table 3.8 continued.

Breed* Gene Posterior Prob > 0.0** Posterior Prob > 0.1 Posterior Prob > 0.2 Total effect *** Total effect** Total effect SNPs SNPs SNPs (µg/100ml) (µg/100ml) (µg/100ml) 3, 5, 6, 7, 8, 9, 10, BCMO1 3, 6, 7, 10, 11, 15 7 11, 15 Tharparkar BCO2 1, 4, 5, 6, 8, 11, 12 1.69 (2.98) 4, 8 0.67 (1.21) none 0.32 (0.71) 2, 3, 4, 6, 8, 10, 11, SCARB1 4, 11, 14 none 13, 14, 16 18, 20, 21, 22, 23, BCMO1 24 none 24 Buffalo BCO2 14, 15, 16, 17, 18 1.20 (1.80) 14, 15, 16 0.91 (1.45) none 0.22 (0.40) combined 17, 18, 19, 20, 21, SCARB1 19, 20, 21 19 22, 23, 24, 25 18, 19, 20, 21, 22, 18, 19, 20, 21, 22, 18, 20, BCMO1 24 24 21, 22, 24

Jaffarbadi BCO2 13, 14, 16, 17, 18 4.01 (3.10) 13, 14, 16, 17, 18 4.01 (3,10) 13, 16 3.36 (2.51) 80

17, 18, 19, 20, 21, 17, 18, 19, 20, 21, SCARB1 18, 23 22, 23, 24, 25 22, 23, 24, 25 BCMO1 18, 20, 21, 22, 24 18, 24 none Murrah BCO2 14, 16, 17, 18 4.73 (1.99) 16, 17 4.46 (1.91) 16 4.01 (1.21) 17, 18, 19, 20, 21, SCARB1 18, 22, 24 none 22, 23, 24, 25

Table 3.8 continued.

Breed* Gene Posterior Prob. > 0.0** Posterior Prob. > 0.1 Posterior Prob. > 0.2 Total effect *** Total effect ** Total effect SNPs SNPs SNPs (µg/100ml) (µg/100ml) (µg/100ml) PP > 0.4 PP > 0.5 BCMO 18, 19, 20, 21, 22, 23, 18, 19, 20, 21, 22, 23, 24 24 Pandhar- 1 24 10.42 (7.33) 10.20 (7.23) 6.87 (3.67) puri BCO2 13, 14, 15, 16, 17, 18 13, 14, 15, 16, 17, 18 18 SCAR 17, 18, 19, 20, 21, 22, 23, 18, 19, 20, 21, 22, 23, 21, 22, 23, B1 24, 25 24, 25 25 PP > 0.4 PP > 0.5 BCMO 18, 20, 21, 22, 23, 24 18, 20, 23, 24 24 1 Surti 9.43 (5.67) 8.332 (4.89) 4.22 (2.99) BCO2 14, 6, 17, 18 16, 17 16 SCAR 17, 18, 19, 20, 21, 22, 23, 18, 19, 20, 21, 22, 23, 19 B1 24, 25 24 81

* No SNPs with P-value ≤ 0.3 were found for Mehsana, therefore no SNPs were included in these analyses **Includes all SNPs with P-value ≤ 0.3 ***Represents EBV of an animal homozygous for all favorable alleles

82

Figure 3.1 Numbers of SNPs and their distribution derived by the next generation sequencing data across the three species.

83

CHAPTER 4. GENETIC BASIS OF BLOOD-BASED TRAITS AND THEIR RELATIONSHIP WITH PERFORMANCE AND ENVIRONMENT IN BEEF CATTLE AT WEANING

Josue Chinchilla-Vargas1†, L. M. Kramer1†, J. D. Tucker2, D. S. Hubbell III2, J. G. Powell2, T. D. Lester2, E. A. Backes2, K. Anschutz2, J. E. Decker3, K. J. Stalder1, M. F. Rothschild1, J. E. Koltes1*

1 Iowa State University, Department of Animal Science, Ames, Iowa, 50011. 2 University of Arkansas, Department of Animal Science, Fayetteville, Arkansas, 72701. 3 University of Missouri, Division of Animal Science, Columbia, Missouri, 65211. †Shared equally in the production this work

Modified from a manuscript published in Frontiers in Genetics: 11: 717.

Abstract

The objectives of this study were to explore the usefulness of blood-based traits as indicators of health and performance in beef cattle at weaning and identify the genetic basis underlying the different blood parameters obtained from complete blood counts (CBCs). Disease costs represent one of the main factors determining profitability in animal production. Previous research has observed associations between blood cell counts and an animal’s health status in some species. CBC were recorded from approximately 570 Angus based, crossbred beef calves at weaning born between 2015 and 2016 and raised on toxic or novel tall fescue. The calves(N=˜600) were genotyped at a density of 50k SNPs and the genotypes (N=1160)were imputed to a density of 270k SNPs. Genetic parameters were estimated for 15 blood and 4 production. Finally, with the objective of identifying the genetic basis underlying the different blood-based traits, genome-wide association studies (GWAS) were performed for all traits.

Heritability estimates ranged from 0.11 to 0.60, and generally weak phenotypic correlations and strong genetic correlations were observed among blood-based traits only. Genome-wide association study identified ninety-one 1-Mb windows that accounted for 0.5% or more of the 84 estimated genetic variance for at least 1 trait with 21 windows overlapping across two or more traits (explaining more than 0.5% of estimated genetic variance for two or more traits). Five candidate genes have been identified in the most interesting overlapping regions related to blood- based traits. Overall, this study represents one of the first efforts represented in scientific literature to identify the genetic basis of blood cell traits in beef cattle. The results presented in this study allow us to conclude that: (1) blood-based traits have weak phenotypic correlations but strong genetic correlations among themselves. (2) Blood-based traits have moderate to high heritability. (3) There is evidence of an important overlap of genetic control among similar blood-based traits which will allow for their use in improvement programs in beef cattle.

Introduction

Expenses associated with disease and feed are two of the main drivers for cost of production in livestock operations and show a direct relationship where disease impacts feed intake (Irsik et al., 2006; Leach et al., 2013). There is limited information in scientific literature regarding genetic parameters for blood-based traits in livestock and the majority of scientific literature regarding this topic in beef cattle dates back to the second half of the 20th century.

With the greater use of molecular genetics by seedstock and commercial beef producers, animal breeding has experienced a paradigm shift (Meuwissen et al., 2016), with a larger number of traits and information being used to increase the accuracy to identify the best animals across a range of environments and production settings. Therefore, blood-based traits and other

“forgotten” traits should be evaluated again using the methods currently available to better understand their usefulness in modern animal production.

In beef cattle, bovine respiratory disease (BRD) is among the most economically important traits in production (Snowder et al., 2007; Schneider et al., 2009). Infection can result in morbidity, mortality, and reduced average daily gain, which ultimately translates into reduced 85 product quality and an overall reduced system productivity (Griffin, 1997; Irsik et al., 2006;

Fulton, 2009; Leach et al., 2013). With blood samples being relatively easy to obtain when handling animals for other procedures and the intrinsic presence of white blood cells in peripheral blood, blood counts are an objective representation of innate and adaptive immunity of the animals (Leach et al., 2013). In this regard, Leach et al. (2013) looked at the genetic correlation between immune response to Bovine Respiratory Disease (BRD) vaccine and the incidence of the disease and average daily gain (ADG). They reported that blood-based traits related to immunity such as neutrophils (NE), lymphocytes (LY), eosinophils (EO) and basophils

(BA) change significantly over time depending on the vaccination status of the animal (before or after a vaccination booster is applied). The research also showed significant correlations between the blood-based traits and ADG.

Previous efforts to identify genetic parameters for blood-based traits in beef cattle include those by Rowlands (Rowlands et al., 1977, 1983) and Richardson (Richardson et al., 1996).

Rowland estimated a heritability of 0.55 ± 0.18 and a genetic correlation of -0.46 with growth rate for hemoglobin concentration in blood (Rowlands et al., 1983). While Richardson also calculated a repeatability ranging from 0.43 to 0.95 for various blood-based traits. More recently,

Leach et al. (2013) found genetic correlations ranging from -0.48 to 0.86 between blood-based traits related to immunity.

Blood-based traits and their genetic basis have been given more attention in swine, perhaps because of the translational potential to humans that swine possess. Clapperton et al.

(Clapperton et al., 2008) compared heritability and genetic and phenotypic correlations of blood- based traits between herds with high and low health status at 30kg and 90 kg of live weight.

They found that heritability for white blood cell traits can vary greatly between herds exposed to 86 different environments with the heritability for number of white blood cells changing from 0.06

± 0.11 in high health herds to 0.37 ± 0.16 in low health herds. Additionally, they found mostly strong negative correlations between traits related to white blood cells and ADG ranging from

0.03 to -0.62. Evidence for heritability and moderate to strong genetic correlations of Blood- based traits to growth traits could be helpful in identifying and selecting animals with more robust growth under stressful environments. Robust growth is defined as the ability, in the face of environmental constraints, to carry on doing the various things that the animal needs to do to express its full genetic potential through rapid growth and weight gain (Friggens et al., 2017).

More recently, Flori et al.(Flori et al., 2011) calculated heritability for total white blood cells (WBC) 0.73 ± 0.20 and 0.80 ± 0.21 for EO in swine. Mpetile et al.(Mpetile et al., 2015) compared peripheral blood profiles from complete blood counts (CBCs) between lines of pigs selected for high and low residual feed intake (RFI). They found no significant correlations between RFI and the blood-based traits studied. Heritability estimates ranged from 0.04 for mean corpuscular hemoglobin concentration (MCHC) to 0.62 for red blood cells count (RBC).

In a very similar study to the one presented in our report, although on swine, Bovo et al.

(Bovo et al., 2019) performed a genome-wide association analyses (GWAS) for 15 hematological traits and 15 clinical-biochemical traits finding 52 quantitative trait loci (QTL) associated with 29 of the 30 traits investigated. They also estimated genomic variance parameters and (SE) for blood-based traits and observed ranging from 0.14 (0.06) for EO to

0.40 (0.06) for mean corpuscular hemoglobin (MCH).

Overall, previous studies have indicated that blood-based traits may be useful as indicators for performance and health in intensive production settings. The objectives of this study were to explore the usefulness of blood-based traits as performance and health in beef 87 cattle at weaning and identify the genetic basis underlying the different blood parameters obtained from complete blood counts (CBCs).

Materials and methods

Data description

Complete blood count (CBC)s were recorded from 570 crossbred cattle (Angus background crossed with Hereford, Charolais, Sim-Angus, Brangus) using blood samples collected at weaning during 2015 and 2016 at three research farms with similar management techniques at the University of Arkansas in Fayetteville and Batesville, AR. Animals were handled in accordance with the regulations of the University of Arkansas Institute for Animal

Care and Use Committee (IACUC), under protocol number 16037. Blood samples were collected at weaning via jugular vein puncture into an EDTA blood tube and analyzed in a

Hemavet HV 950 multispecies hematology system (HEMAVET, 2011). In addition, birth weight, weaning weight and age at weaning were collected for all animals. Additionally, average daily gain and adjusted weaning weight at 205 days were calculated. The number of records from each farm is shown in Table 4.1, along with their respective year and calving season. Table 4.2 shows the traits included in the analyses along with their respective abbreviations and units. Animals raised at Savoy farm were raised on toxic fescue until weaning, while the majority of animals raised at Batesville and North farms were moved to novel fescue upon calving and were kept there until weaning.

Blood samples were collected for DNA isolation following previously described methods

(Sambrook, 2001), and subsequently for genotyping. Animals were genotyped using the

GeneSeek Bovine GGP50 SNP chip or the GGP F250 SNP chip from GeneSeek (Neogen,

2016). Approximately 1100 animals related to the CBC-phenotyped individuals, including their parents were genotyped with the GGP F250 chip and their genotypes were used for imputation 88 purposes. Genotype positions for all SNP were updated to coordinates of the ARS1.2 bovine reference genome

(https://www.ncbi.nlm.nih.gov/genome/gdv/browser/genome/?id=GCF_002263795.1).

Imputation

A total of 501 animals with CBC records were genotyped at a density of approximately

50k markers while 1160 animals from the same population, including the parental generations were genotyped at a density of approximately 250k markers. FImpute version 2.2 (Sargolzaei et al., 2014) was used to impute all genotypes to an approximate density of 270k markers . The resulting genotypes where used for further analyses. Imputation accuracy was not measured for the present project but previous experiences with similar projects have shown accuracies ranging between 90% and 95%.

Population structure

The population analyzed was divided in six contemporary groups defined by the combination of farm of origin, year of calving and calving season as shown in Table 4.1. To visualize population structure, a principal component analysis (PCA) was performed (results not shown). For this purpose, genotypes of registered purebred Hereford, Black Angus, Red Angus,

Gelbvieh, Limousine, Simmental and Shorthorn animals were used as references to quantify the genomic similarity between the individuals used in this study.

Statistical analyses

Frequentist approach to estimate phenotypic, genetic and genomic parameters

Phenotypic correlations between traits were estimated using the method of moments after adjusting the data for fixed effects that included contemporary group and sex. Genotypic correlations between traits and narrow sense heritability (h2) for each trait were calculated in

ASReml 3.0 (Mary et al., 2009) using an animal model with a genomic relationship matrix and 89

fixed effects for contemporary group (Ci) , sex of the animal (Sj), the genetic random effect of the animal (Dk) and a covariate for weaning weight (W):

P1ijk P2ijk = " + Ci + Sj + Dk + W + #ijk

For genetic correlations, a bivariate model was used while for heritability a univariate model with the same effects was implemented (P1 = phenotype 1, P2 = phenotype 2).

Bayesian approach to estimate genomic parameters and identify genome-wide marker associations

Narrow sense heritability (h2) for each trait was estimated through a Bayesian analysis in

GenSel4.0 (Fernando and Garrick, 2008) utilizing Bayes C (Habier et al., 2011). For these analyses, only markers with a minor allele frequency (MAF) larger or equal to 0.02 were used.

With this filter, it was guaranteed that at least 10 animals with the minor allele were included in the analyses. A large proportion of markers had MAF lower than 0.02. After filtering, approximately 100,000 markers were retained. Additionally, the Pi value was set to 0.9877 in order to fit the random effect of approximately 250 markers to the model per iteration. Each chain consisted of 75000 iterations with a burn of the first 5000 (i.e. the first 5000 samples were discarded). The same parameters were used to perform a genome-wide association study

(GWAS) with the objective of exploring the genetic basis of blood-based traits. For this purpose

Bayes B (Stephens and Balding, 2009) was used as it shrinks the genetic effects, with a larger shrinking factor on windows that show smaller genetic effects (Fernando and Garrick, 2013). For

GWAS purposes, variance was examined for fixed windows of 1 Megabase (Mb) genomic segments. When estimating heritability for each trait and performing a GWAS, the model used was the same one used to estimate genomic heritability, composed of fixed effects contemporary group (CGi) , sex of the animal (Sj), a covariate for weaning weight (WW) and the random effects of the markers (Ml) fitted during each iteration: 90

Pijkl = " + CGi + Sj + Ml + WW + #

Gene ontology and identification of candidate genes

To be considered significant, windows had to explain at least 0.5% of the estimated genetic variance. This threshold was assigned with the rationale that there is little literature referring to blood-based traits and it is not known if these traits are affected by a few genes with large effects or multiple genes with small effects. Genes located in each significant genomic window were identified with Ensembl Biomart (http://useast.ensembl.org/biomart/martview/) by choosing the “cow genome” option with the ARS1.2 bovine genome version. Given the small number of annotated genes in most of the windows found significant, traits were grouped in four broad categories to increase the probability of detecting enrichment for specific categories: red blood cell traits, white blood cell traits, platelet traits and growth traits (i.e MCHC and RBC were grouped under red blood cell traits). This, under the rationale that traits related to the same type of cells should share molecular basis (Iwasaki and Akashi, 2007) Once the list of genes located in all significant windows for a category of traits was obtained, an ontology enrichment test was performed through Princeton University’s Lewis-Sigler Institute for Integrative

Genomics GO::TermFinder (Boyle et al., 2004). For a term enrichment to be considered significant false discovery rate (FDR) had to be less than 5%. Given that the annotation of the human genome is far more complete when compared to the annotation on the cow’s genome, all gene ontology terms were performed using the human genome as a reference. This approach, as discussed by Band et al. (2000) is a very useful tool to discover genes of agricultural importance.

Finally, the genes located in windows that explained ≥ 0.5% of the estimated genetic variance in two or more traits were investigated with the objective of identifying possible candidate genes to be associated to the estimated genetic variance explained by each window. 91

Genes were considered candidates when scientific literature linking the genes to physiological processes related to blood-based traits.

Results

Population structure

There was a wide range of crossbred animals genotyped for this study. However, it should be noted that there is a small set of animals with a heavy Black Angus background. Given the high level of heterogeneity shown by PCA (Proportion of variance explained: PC1= 0.047,

PC2= 0.025, PC3= 0.014), it was decided to ignore the animals’ breed for the purpose of statistical analyses.

Phenotypic correlations

Phenotypic correlations are presented below the diagonal in Figure 4.1. Stronger phenotypic correlations were observed between similar traits, such as WW and average daily gain (ADG) that had a phenotypic correlation of 0.86. In a similar manner, strong correlations were found between similar blood-based traits. Within red blood cell traits, strong correlations of

0.81 and 0.77 were found between hematocrit percentage (HCT), RBC and hemoglobin content

(HB), respectively. Likewise, white blood cell traits tended to show stronger correlations within themselves as in the case of WBC with LY and NE having correlations of 0.80 and 0.78, respectively.

Phenotypic correlations between blood-based and production traits tended to be weak with some exceptions. The strongest negative phenotypic correlation was found between MCHC and WW that showed a correlation of –0.09. ADG and adjusted weaning weight to 205 days

(adjWW) showed the strongest positive phenotypic correlation between production traits and blood-based traits, where the correlations with red blood cell distribution width (RDW) were

0.29 for both productivity traits. It should be highlighted that the correlation between ADG and 92 adjWW was 1. Further, WW, ADG and adjWW had highly positive correlations. Both of these findings were expected given that these three traits are directly related as functions using the same weights for their calculations.

Genetic correlations

Genetic correlations between traits above the diagonal are shown in Figure 4.1. Genetic correlations were found to be markedly stronger than phenotypic correlations but followed the same trend of being stronger within trait groups. Birth weight (BW) was the production trait that showed the strongest genetic correlations with blood-based traits, having correlations of -0.68 and 0.70 with mean platelet volume (MPV) and EO, respectively. It should be noted that RDW and EO showed moderate to strong genetic correlations with all four production traits included in the present analyses as shown above the diagonal in Figure 4.1. Genetic correlations between production traits were strong, where the weakest correlation was 0.80 between BW and ADG.

When evaluating the genetic correlations between blood-based traits, interestingly mean platelet MPV showed moderate to strong correlations with all white blood cell traits included in the analyses (BAlog = -0.43, EO = -0.68, LY = -0.31, NE = -0.77, WBC = -0.65). Mean platelet volume also showed moderate to strong correlations, but predominantly positive ones with red blood cell traits. Platelets (PLT) showed strong genetic correlations with RBC, HB and HCT while intriguingly, showing a weak genetic correlation of -0.1 with MPV.

As noted before with phenotypic correlations, red blood cell traits tended to show strong genetic correlations among themselves. Several red blood cell traits had relatively strong correlations with white blood cell traits. It is worth highlighting the strong positive correlations between HCT and EO with basophils (Balog) and RBC, which were 0.98 and 0.99 respectively.

Strong negative correlations were also observed between MCHC and HB with LY of -0.69 and -

0.66, respectively. 93

Strong genetic correlations were found among all white blood cell traits and no negative genetic correlations were found amongst them. WBC had genetic correlations of 0.86 with LY and 0.93 with NE, while BAlog had correlations of 0.72 and 0.79 with NE and EO. Finally, genetic correlations between production traits and blood-based traits ranged from weak to strong with the weakest correlation being between MCV and BW (0.01) and the strongest one (0.70) between BW and EO.

Estimation of narrow sense heritability (h2)

Estimates produced by both Bayesian and frequentist analyses are shown in Table 4.3.

Narrow sense heritability estimates were similar between estimation techniques but estimates from Bayesian approach tended to be lower. The greatest difference between methods was for monocytes (MO). Estimates from the frequentist approach ranged from 0.01 ± 0.05 for MO to

0.60 ± 0.10 for WW, while estimates from Bayesian analyses ranged from 0.11 ± 0.04 to 0.55 ±

0.07 for MO and weaning weight, respectively.

It is important to note that while heritability estimates from the Bayesian analyses tended to be lower than those from frequentist analyses, there were cases where Bayesian estimates were larger than frequentist estimates as is the cases of EO, HCT, MO, PLT and RDW.

However, only MO shows heritability estimates that are different when standard errors and sampling errors are taken in consideration.

Genome-wide association study (GWAS)

A genome-wide association study was performed for each of the nineteen traits included in this study. Overall, 91 one-megabase windows explained more than 0.5% of the estimated genetic variance for at least one trait. All the windows identified are presented in Figure 4.2. Of these windows, only 15 showed a posterior probability of inclusion (PPI) of at least 60%. 94

Figure 4.3 shows the GWAS results for MCH. For this trait, seven 1-Mb windows were found to be responsible for at least 0.5% of the estimated genetic variance. Windows starting at megabase 119 and 128 on 3 and 30 respectively had posterior probabilities larger than 60%, with window 119 on explaining 4.5% of estimated genetic variance and window 128 on chromosome 30 explaining approximately 4.75%. Several QTL related to average daily gain, meat quality and body conformation were found in these windows along with multiple QTL (Orrù et al., 2012) associated to the trait in close proximity to the window on chromosome 4 identified in this GWAS.

Genome-wide association study results for MO are presented in Figure 4.4 which show five windows explaining greater than 0.5% of the estimated genetic variance located on chromosomes 3, 8, 14, 19 and 22. The window starting at megabase 54 on chromosome 22 had a

PPI larger than 60%. There are no previously reported QTL for this trait. Several QTL related to average daily gain, average daily feed intake, body conformation and meat quality were found in the windows described by the GWAS or in close proximity.

Two windows explaining more than 0.5 of the estimated genetic variance on chromosomes 2 and 27 were found for MPV as shown in Figure 4.5. The window on chromosome 2 showed a PPI > 60%. There are no QTL related to blood-based traits or growth traits reported in the windows found to be important for this trait, and in a similar fashion to other blood-based traits, there were no QTL previously described to have an effect on MPV values. Given the large number of traits examined in this study, results for all other individual

GWAS for all other traits are shown in supplemental figures.

95

Several genomic windows accounted for more than 0.5% of estimated genetic variance for multiple traits. As shown in table 4.4, three windows explained greater than 0.5% of the estimated genetic variance for more than two traits. The window that was identified as important for the most traits was at megabase 70 on chromosome 11, which explained approximately

1.25%, 0.6%, 0.7%, 0.7% and 0.5% of HB, HCT, MCH, mean corpuscular volume (MCV) and red blood cell distribution width, respectively. There are no previous reports of QTL associated to HB. All reported QTL for mean corpuscular hemoglobin (MCV), MCH and RDW are found on chromosomes 4, 5, 15 and 25.

Gene ontology enrichment analysis

Once the significant windows were identified for each trait, gene ontology term enrichment was performed. Overall, the term “unannotated” was significantly enriched for function in 6 traits each for function and process. It is worth highlighting that in the individual trait ontology term enrichment analyses, RDW and BW were both significantly enriched for folic acid receptor activity and binding due to genes FOLR1, FOLR2 and FOLR3, spanning over the

51 and 52 Mb on chromosome 15. Additionally, BW was significantly enriched for biological process for terms such as response to oxygen-containing compound and response to endogenous stimulus.

Given the very limited literature and research found in the scientific literature for the specific blood-based traits in cattle examined in this project, the main focus was directed to the broad categories of the traits to increase the chance of finding significant enrichment. Therefore, all the genes identified for traits that fell in the same category (i.e. MCHC and RBC were grouped under red blood cell traits) were grouped into one of four categories before analysis. The categories included red blood cell traits, white blood cell traits, platelet traits and growth traits. In total, 615, 365, 324 and 91 genes were found in windows significant for red blood cell traits, 96 white blood cell traits, growth traits and platelet traits respectively. The 10 most-significantly enriched for each trait category are shown in table 4.5. Interestingly, red and white blood cell traits shared significant enrichment for six traits including nitrogen compound metabolic process

(FDR ≤ 5%). On the other hand, growth traits shared only one significantly enriched category with platelet traits and two with white blood cell traits while not sharing any with red blood cell traits. Finally, the ten most significant terms that were enriched for biological function for each of the trait categories along with the respective FDR are presented in table 4.6. Overall, there was notably less enrichment for biological function, to the point where “unannotated” (FDR <

0.01%) was the only term significantly enriched for traits related to platelets while protein binding (FDR < 0.001% for all categories) was shared by all other categories. As seen with BW, folic acid receptor activity was significantly enriched for growth traits. White blood cell traits showed enrichment for transcription regulatory region DNA binding and regulatory region nucleic acid binding.

Discussion

Most blood-based traits and growth traits are weakly correlated

In the present study, genetic correlations among growth and blood-based traits followed the same pattern than phenotypic correlations with the exceptions of the correlations between

MPV and BW, and RDW and EO with all growth traits. Clapperton et al. in 2008 reported similar results in swine, describing weak and mostly negative phenotypic correlation between several subsets of white blood cells and ADG ranging from -011 to 0.16. The same study found strong genetic correlations between white blood cell related traits and average daily gain that ranged from -0.58 to 0.23. Those results differ from the results in the present study, perhaps because of differences in species. Leach et al.(Leach et al., 2013), found weak and mostly 97 negative genetic correlations between white blood cell traits and ADG, supporting the present findings.

Overall, the findings of this study indicate that phenotypic and genetic correlations with a few exceptions tended to be weak between blood cell traits and growth traits. The weak correlations between blood-based traits and growth traits limit the potential for blood-based traits to be used as indicators of performance under varying environments. Strong genetic correlations between blood-based traits indicate the existence of an important overlap in genetic control and can be considered as evidence of pleiotropic effects playing a role in regulating multiple blood- based traits as found previously by Lukowski et al.(2017).

Blood-based traits tend to have moderate to high heritability

Heritability (h2) estimates in the present study are in line with what is generally reported in the literature (Kitchenham and Rowlands, 1976; Bourdon and Brinks, 1982; Rowlands et al.,

1983; Arnold et al., 1991; Bullock et al., 1993; Bennett and Gregory, 1996; Phocas and Laloë,

2004; Wright et al., 2014; Mpetile et al., 2015; Snelling et al., 2019). Heritability estimates (h2) for white blood cell traits by Leach et al., 2013 ranged from 0.28 to 0.50, confirming the findings of the present study, with the only exception of MO which had a heritability estimate (h2) of 0.11 and 0.01 for Bayesian and frequentist approaches respectively, which in both cases are lower than the estimates previously reported by Leach et al. that ranged from 0.21 to 0.39 (Leach et al.,

2013) . After an extensive literature review, we believe the present findings provide one of the few, if not the genetic parameter estimates for several blood-based traits in beef cattle at weaning age.

Bayesian analyses tended to produce lower estimates of heritability than frequentist analyses, indicating the possibility of missing heritability. Missing rare variants is possible in this case given that the population used to produce the estimates is relatively small (~570 animals) 98 and thus, a larger study population might be needed to capture rare variants. However, the GGP

F250 SNPchip is the chip with the highest number of rare variants included and therefore should capture all the rare variants present in our population. To differentiate between missing heritability and the possibility of missing genetic variance because the pi value used in the

Bayesian analyses was too large and therefore not taking in consideration all markers that explained genetic variance, heritability estimates (data not shown) were produced using Bayes C priors with a pi value of zero. The difference in estimates of heritability when using the different pi values was minimal. Possible reasons for missing heritability may be the small sample size, epistatic interactions between markers, other structural variations in the genome like copy number variants (CNV), linkage disequilibrium (LD) or rare alleles not present in the studied population (Clarke and Cooper, 2010; Vineis and Pearce, 2010; Makowsky et al., 2011; Zuk et al., 2012). Other possible explanations for the missing genetic variance could be that the SNP chips used to genotype the animals did not include markers that explain variance for the trait, or perhaps some of the rare variants that explained genetic variance for the traits were lost through filtering SNPs that had MAF lower than 0.02. Another possible cause could be the different marker information content given the different allele frequencies in the breeds that are admixed in the population used for this research. However, the Bovine GGP F250 SNPchip contains a very large number of rare variants and includes data from most of the breeds that compose the population used for this study (Neogen, 2016).

Maternal genetic effects do not impact genetic and genomic correlations or heritability

Another concern was that heritability estimates could be inflated due to maternal genetic effects. Given that blood-based and productivity traits were measured at weaning and that the original analyses did not take in consideration maternal effects that might be significant, a model accounting for maternal effects was implemented using ASReml(Mary et al., 2009) (data not 99 shown). The difference in estimates from a model with and without genetic maternal effects was very small, indicating that maternal effects have limited influence on blood-based traits and productivity at weaning. However, estimation of maternal effects is complex and given the amount of data available these results should be taken as preliminary.

Genome wide association study results identify few genomic regions with large effects

Although the GWAS performed for each trait revealed 91 windows of 1-megabase in length that explained at least 0.5% of the estimated genetic variance, the most interesting results were the numerous overlaps of windows that were important for different traits. Windows of importance were identified on chromosome 23 only for three traits, ADG, adjWW and MCHC.

No GWAS windows for any blood-based trait associated to white blood cells on chromosome 23 were identified, where the MHC complex is found in bovines (Steiner et al., 2014). These results provide evidence of possible pleiotropic genes influencing multiple traits related to blood-based traits as well as influencing blood-based traits and productivity traits.

Several candidate genes were identified for windows associated with blood-based traits and overlapping with growth traits

The only overlap found in white blood cell traits is found on chromosome 9, megabase

10. This window contains genes like SMAP1, a protein-coding gene that has been linked to erythropoietic and overall hematopoietic activity in mice (Behl et al., 2012) and receptor endocytosis in mammals (Sato et al., 1998), which makes it a good candidate gene for further studies.

In the case of red blood cell traits, the most important window was found on chromosome

11, starting at position 70 megabases. This window was significant for five traits. A total of 20 annotated genes were identified in this window. Promising candidate genes included CAPN13, a gene previously associated to hypertension in humans (Kobayashi et al., 2014) and LCLAT1, 100 previously identified to control development of hematopoietic and endothelial lineages in mice embryos (Taylor et al., 2010). Other interesting candidate genes for further research were YPEL5 and SPDYA, that have been linked to cellular cycle progression (Wang et al., 2007) and regulation of CD133+ cell population in humans (Hosono et al., 2010).

Another genomic region with overlapping associations across multiple traits was found at

5-6Mb on chromosome 29. This window overlapped over traits associated to blood cells (MCV and RDW) and performance (BW). A total of ten annotated genes and 18 unannotated genes were found in this region, with no good candidates identified. A set of windows located at 51-53

Mb on chromosome 15 overlapped over these same traits. As shown in the results section, genes associated to folic acid binding (FOLR1, FOLR2 and FOLR3). In cattle, it has been shown that folate supplementation can reduce the occurrence of dystocia by up to 50% (Duplessis et al.,

2014).

It has also been shown that folate can increase milk production and can modify concentration of amin acids in blood plasma (Graulet et al., 2007). In humans, it has been shown that folate plays a crucial role in nucleic acid synthesis, cell division, regulation of gene expression, amino acid metabolism and neurotransmitter synthesis during fetal development(Djukic, 2007). More importantly, during pregnancy folate intake is crucial for rapid cell proliferation and tissue growth in the uterus and placenta, growth of the fetus and expansion of maternal blood flow

(Rondo and Tomkins, 2000) to the point that in humans, folate requirements are 5 to 10 fold greater during pregnancy (Antony, 2007). Although in humans, significantly increased birth weight has been observed when women take folic acid supplementation during gestation(Fekete et al., 2012), no effect has been found in cattle (Girard et al., 1995). In cattle, there is evidence of folic acid supplementation leading to increased milk production over a complete lactation 101 from cows on their second lactation or greater (Girard et al., 1995) which could be translated into greater weaning weights in calves from cows with optimum folate intake. There is a link between folate and blood-based traits. Previous research has shown that folate is required for proliferation of erythroblasts during differentiation (Koury and Ponka, 2004). Moreover, folate and iron deficiency cause erythroblast apoptosis through the impairment of protein and DNA (Blount and

Ames, 1995). Although the candidate genes identified in the different windows have been liked to biological processes related to blood-based traits it is important to keep in mind that these genes have not yet been studied in cattle, and therefore further research in beef cattle is needed to elucidate their roles and their potential use as a tool for breeders to accelerate genetic improvement.

Conclusions

The present study represents one of the first efforts to identify the genetic basis of blood- based traits in beef cattle. The results presented in this study allow us to conclude that: (1) blood- based traits have weak phenotypic correlations, but strong genetic correlations among themselves compared to growth traits. (2) Blood-based traits have moderate to high heritability. (3) There is evidence of an important overlap between genetic control among similar blood-based traits and between some blood-based traits and growth traits. Additionally, multiple windows overlapping over blood-based traits and growth traits and candidate genes that show a biological function that ties these traits together were identified.

The present study also provides evidence that most blood-based traits are heritable, with some exhibiting correlations with growth traits.

Further studies are warranted to determine if CBCs may act as indicators of growth performance under different environments as a means of capturing relationships with immune status, nutrition and environment under different production settings. 102

Availability of data and materials

Genotype data is available online at Open Science Framework:

DOI 10.17605/OSF.IO/E4QMU

All GWAS results will be deposited at the animal QTLdb database.

Author contributions

Experimental design: JEK.

Livestock management and data collection: DH, JED,JP and EB.

Data analysis and management: JCV, LMK, MFR, KA and JEK.

Manuscript writing: JCV,LMK, MFR, KJS and JEK.

Manuscript review and editing: MFR, KJS, LK and JEK.

Conflict of interest

There are no conflicts of interest (financial, personal or other relationships with people and/or organizations) that could inappropriately have influenced our work.

Funding

Funding for this study was provided by the University of Arkansas Division of Agriculture Batesville Matching Funds Program and startup funds provided in part from

USDA-NIFA HATCH project number ARK02523. In addition, research reported in this paper is partially supported by the HPC@ISU equipment at Iowa State University, some of which has been purchased through funding provided by NSF under MRI grant number CNS 1229081 and

CRI grant number 1205413.

103

Literature cited

Antony, A. C. (2007). In utero physiology: Role of folic acid in nutrient delivery and fetal development. Am. J. Clin. Nutr. 85, 598–603. doi:10.1093/ajcn/85.2.598s.

Arnold, J. W., Bertrand, J. K., Benyshek, L. L., and Ludwig, C. (1991). Estimates of genetic parameters for live animal ultrasound, actual carcass data, and growth traits in beef cattle. J. Anim. Sci. 69, 985–992. doi:10.2527/1991.693985x.

Band, M. R., Larson, J. H., Rebeiz, M., Green, C. A., Heyen, D. W., Donovan, J., et al. (2000). An ordered comparative map of the cattle and human genomes. Genome Res. 10, 1359– 1368. doi:10.1101/gr.145900.

Behl, J. D., Verma, N. K., Tyagi, N., Mishra, P., Behl, R., and Joshi, B. K. (2012). The Major Histocompatibility Complex in Bovines: A Review. ISRN Vet. Sci. 2012, 1–12. doi:10.5402/2012/872710.

Bennett, G. L., and Gregory, K. E. (1996). Genetic (Co)variances among Birth Weight, 200-Day Weight, and Postweaning Gain in Composites and Parental Breeds of Beef Cattle. J. Anim. Sci. 74, 2598–2611. doi:10.2527/1996.74112598x.

Blount, B. C., and Ames, B. N. (1995). DNA damage in folate deficiency. Baillieres. Clin. Haematol. 8, 461–478. doi:10.1016/S0950-3536(05)80216-1.

Bourdon, R. M., and Brinks, J. S. (1982). Genetic, environmental and phenotypic relationships among gestation length, birth weight, growth traits and age at first calving in beef cattle. J. Anim. Sci. 55, 543–553. doi:10.2527/jas1982.553543x.

Bovo, S., Mazzoni, G., Bertolini, F., Schiavo, G., Galimberti, G., Gallo, M., et al. (2019). Genome-wide association studies for 30 haematological and blood clinical-biochemical traits in Large White pigs reveal genomic regions affecting intermediate phenotypes. Sci. Rep. 9, 1–17. doi:10.1038/s41598-019-43297-1.

Boyle, E. I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J. M., et al. (2004). GO::TermFinder - Open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20, 3710–3715. doi:10.1093/bioinformatics/bth456.

Bullock, K. D., Bertrand, J. K., and Benyshek, L. L. (1993). Genetic and environmental parameters for mature weight and other growth measures in Polled Hereford cattle. J. Anim. Sci. 71, 1737–1741. doi:10.2527/1993.7171737x.

Clapperton, M., Glass, E. J., and Bishop, S. C. (2008). Pig peripheral blood mononuclear leucocyte subsets are heritable and genetically correlated with performance. Animal 2, 1575–1584. doi:10.1017/S1751731108002929.

Clarke, A. J., and Cooper, D. N. (2010). GWAS: Heritability missing in action. Eur. J. Hum. Genet. 18, 859–861. doi:10.1038/ejhg.2010.35. 104

Djukic, A. (2007). Folate-Responsive Neurologic Diseases. Pediatr. Neurol. 37, 387–397. doi:10.1016/j.pediatrneurol.2007.09.001.

Duplessis, M., Girard, C., Santschi, D., Laforest, J., Durocher, J., and Pellerin, D. (2014). Effects of folic acid and vitamin B12 supplementation on culling rate, diseases, and reproduction in commercial dairy herds. doi:10.3168/jds.2013-7369.

Fekete, K., Berti, C., Trovato, M., Lohner, S., Dullemeijer, C., Souverein, O. W., et al. (2012). Effect of folate intake on health outcomes in pregnancy: A systematic review and meta- analysis on birth weight, placental weight and length of gestation. doi:10.1186/1475-2891- 11-75.

Fernando, R. L., and Garrick, D. J. (2008). GenSel-User manual for a portfolio of genomic selection related analyses. Anim. Breed. Genet. Iowa State Univ. Ames, 0–24.

Fernando, R. L., and Garrick, D. J. (2013). “Bayesian Methods Applied to GWAS,” in Genome- Wide Association Studies and Genomic Prediction, Methods in Molecular Biology, ed. C. G. et Al. (Springer Science+Business MEdia), 237–274. doi:10.1007/978-1-62703-447-0.

Flori, L., Gao, Y., Laloë, D., Lemonnier, G., Leplat, J. J., Teillaud, A., et al. (2011). Immunity traits in pigs: Substantial genetic variation and limited covariation. PLoS One 6. doi:10.1371/journal.pone.0022717.

Friggens, N. C., Blanc, F., Berry, D. P., and Puillet, L. (2017). Review: Deciphering animal robustness. A synthesis to facilitate its use in livestock breeding and management. Animal 11, 2237–2251. doi:10.1017/S175173111700088X.

Fulton, R. W. (2009). Bovine respiratory disease research (1983-2009). Anim. Health Res. Rev. 10, 131–139. doi:10.1017/S146625230999017X.

Girard, C. L., Matte, J. J., and Tremblay, G. F. (1995). Gestation and Lactation of Dairy Cows: A Role for Folic Acid? J. Dairy Sci. 78, 404–411. doi:10.3168/jds.S0022-0302(95)76649-8.

Graulet, B., Matte, J. J., Desrochers, A., Doepel, L., Palin, M. F., and Girard, C. L. (2007). Effects of dietary supplements of folic acid and vitamin B12 on metabolism of dairy cows in early lactation. J. Dairy Sci. 90, 3442–3455. doi:10.3168/jds.2006-718.

Griffin, D. (1997). Economic impact associated with respiratory disease in beef cattle. Vet. Clin. North Am. Food Anim. Pract. 13, 367–377. doi:10.1016/S0749-0720(15)30302-9.

Habier, D., Fernando, R. L., Kizilkaya, K., and Garrick, D. J. (2011). Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics 12. doi:10.1186/1471-2105-12-186.

HEMAVET (2011). Product reference manual for HEMAVET ® HV950 multisspicies hematology instruments Part Number M-950HV.

105

Hosono, K., Noda, S., Shimizu, A., Nakanishi, N., Ohtsubo, M., Shimizu, N., et al. (2010). YPEL5 protein of the YPEL gene family is involved in the cell cycle progression by interacting with two distinct RanBPM and RanBP10. Genomics 96, 102–111. doi:10.1016/j.ygeno.2010.05.003.

Irsik, M., Langemeier, M., Schroeder, T., Spire, M., and Roder, J. D. (2006). Estimating the Effects of Animal Health on the Performance of Feedlot Cattle. Bov. Pract. 40, 65–74.

Iwasaki, H., and Akashi, K. (2007). Hematopoietic developmental pathways: On cellular basis. Oncogene 26, 6687–6696. doi:10.1038/sj.onc.1210754.

Kitchenham, B. A., and Rowlands, G. J. (1976). Differences in the concentrations of certain blood constituents among cows in a dairy herd. J. Agric. Sci. 86, 171–179. doi:10.1017/S0021859600065126.

Kobayashi, N., Kon, S., Henmi, Y., Funaki, T., Satake, M., and Tanabe, K. (2014). The Arf GTPase-activating protein SMAP1 promotes transferrin receptor endocytosis and interacts with SMAP2. Biochem. Biophys. Res. Commun. 453, 473–479. doi:10.1016/j.bbrc.2014.09.108.

Koury, M. J., and Ponka, P. (2004). NEW INSIGHTS INTO ERYTHROPOIESIS: The Roles of Folate, Vitamin B 12 , and Iron . Annu. Rev. Nutr. 24, 105–131. doi:10.1146/annurev.nutr.24.012003.132306.

Leach, R. J., Chitko-McKown, C. G., Bennett, G. L., Jones, S. A., Kachman, S. D., Keele, J. W., et al. (2013). The change in differing leukocyte populations during vaccination to bovine respiratory disease and their correlations with lung scores, health records, and average daily gain. J. Anim. Sci. 91, 3564–3573. doi:10.2527/jas.2012-5911.

Lukowski, S. W., Lloyd-Jones, L. R., Holloway, A., Kirsten, H., Hemani, G., Yang, J., et al. (2017). Genetic correlations reveal the shared genetic architecture of transcription in human peripheral blood. Nat. Commun. 8. doi:10.1038/s41467-017-00473-z.

Makowsky, R., Pajewski, N. M., Klimentidis, Y. C., Vazquez, A. I., Duarte, C. W., Allison, D. B., et al. (2011). Beyond missing heritability: Prediction of complex traits. PLoS Genet. 7, 1002051. doi:10.1371/journal.pgen.1002051.

Mary, Q., End, M., and Biology, C. (2009). ASReml User Guide. Available at: www.vsni.co.uk [Accessed November 1, 2019].

Meuwissen, T., Hayes, B., and Goddard, M. (2016). Genomic selection: A paradigm shift in animal breeding. Anim. Front. 6, 6–14. doi:10.2527/af.2016-0002.

Mpetile, Z., Young, J. M., Gabler, N. K., Dekkers, J. C. M., and Tuggle, C. K. (2015). Assessing peripheral blood cell profile of Yorkshire pigs divergently selected for residual feed intake. J. Anim. Sci. 93, 892–899. doi:10.2527/jas.2014-8132.

106

Neogen (2016). GeneSeek Genomic Profiler F250. Available at: www.neogen.com/Genomics [Accessed October 31, 2019].

Orrù, L., Abeni, F., Catillo, G., Grandoni, F., Crisà, A., de Matteis, G., et al. (2012). Leptin gene haplotypes are associated with change in immunological and hematological variables in dairy cow during the peripartum period. J. Anim. Sci. 90, 16–26. doi:10.2527/jas.2010- 3706.

Phocas, F., and Laloë, D. (2004). Genetic parameters for birth and weaning traits in French specialized beef cattle breeds. Livest. Prod. Sci. 89, 121–128. doi:10.1016/j.livprodsci.2004.02.007.

Richardson, E. C., Herd, R. M., Arthur, P. F., Wright, J., Xu, G., Dibley, K., et al. (1996). Possible Physiological Indicators for Net Feed Conversion Efficiency in Beef Cattle. Aust. Soc. Anim. Prod. 21, 103–106. Available at: http://livestocklibrary.com.au/handle/1234/8783 [Accessed November 18, 2019].

Rondo, P. H. C., and Tomkins, A. M. (2000). Folate and intrauterine growth retardation. Ann. Trop. Paediatr. 20, 253–258. doi:10.1080/02724936.2000.11748144.

Rowlands, G. J., Manston, R., Bunch, K. J., and Brookes, P. A. (1983). A genetic analysis of the concentrations of blood metabolites and their relationships with age and liveweight gain in young British Friesian bulls. Livest. Prod. Sci. 10, 1–16. doi:10.1016/0301-6226(83)90002- 7.

Rowlands, G. J., Stark, A. J., Manston, R., Lewis, W. H., and Saunders, R. W. (1977). The blood composition of different breeds of bulls undergoing beef performance tests. Res. Vet. Sci. 23, 348–350. doi:10.1016/s0034-5288(18)33130-8.

Sambrook, J. (2001). Molecular Cloning: A Laboratoy Manual. Third Edit. Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press.

Sargolzaei, M., Chesnais, J. P., and Schenkel, F. S. (2014). A new approach for efficient genotype imputation using information from relatives. BMC Genomics 15, 478. doi:10.1186/1471-2164-15-478.

Sato, Y., Hong, H. N., Yanai, N., and Obinata, M. (1998). Involvement of stromal membrane- associated protein (SMAP-1) in erythropoietic microenvironment. J. Biochem. 124, 209– 216. doi:10.1093/oxfordjournals.jbchem.a022082.

Schneider, M. J., Tait, R. G., Busby, W. D., and Reecy, J. M. (2009). An evaluation of bovine respiratory disease complex in feedlot cattle: Impact on performance and carcass traits using treatment records and lung lesion scores. J. Anim. Sci. 87, 1821–1827. doi:10.2527/jas.2008-1283.

Snelling, W. M., Kuehn, L. A., Thallman, R. M., Bennett, G. L., and Golden, B. L. (2019). Genetic correlations among weight and cumulative productivity of crossbred beef cows. J. Anim. Sci. 97, 63–77. doi:10.1093/jas/sky420. 107

Snowder, G. D., Van Vleck, L. D., Cundiff, L. V., Bennett, G. L., Koohmaraie, M., and Dikeman, M. E. (2007). Bovine respiratory disease in feedlot cattle: Phenotypic, environmental, and genetic correlations with growth, carcass, and longissimus muscle palatability traits. J. Anim. Sci. 85, 1885–1892. doi:10.2527/jas.2007-0008.

Steiner, W., Leisch, F., and Hackländer, K. (2014). A review on the temporal pattern of deer- vehicle accidents: Impact of seasonal, diurnal and lunar effects in cervids. Accid. Anal. Prev. 66, 168–181. doi:10.1016/j.aap.2014.01.020.

Stephens, M., and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10, 681–690. doi:10.1038/nrg2615.

Taylor, J. Y., Sun, Y. V., Hunt, S. C., and Kardia, S. L. R. (2010). Gene-environment interaction for hypertension among african american women across generations. Biol. Res. Nurs. 12, 149–155. doi:10.1177/1099800410371225.

Vineis, P., and Pearce, N. (2010). Missing heritability in genome-wide association study research. Nat. Rev. Genet. 11, 589. doi:10.1038/nrg2809-c2.

Wang, C., Faloon, P. W., Tan, Z., Lv, Y., Zhang, P., Ge, Y., et al. (2007). Mouse lysocardiolipin acyltransferase controls the development of hematopoietic and endothelial lineages during in vitro embryonic stem-cell differentiation. Blood 110, 3601–3609. doi:10.1182/blood- 2007-04-086827.

Wright, F. A., Sullivan, P. F., Brooks, A. I., Zou, F., Sun, W., Xia, K., et al. (2014). Heritability and genomics of gene expression in peripheral blood. Nat. Genet. 46, 430–437. doi:10.1038/ng.2951.

Zuk, O., Hechter, E., Sunyaev, S. R., and Lander, E. S. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. 109, 1193–1198. doi:10.1073/pnas.1119675109.

108

Tables and figures

Table 4.1Distribution of animals by farm, year and calving season

Farm Year Calving Season Number of records Spring - 2015 Fall 205 Savoy Spring 76 2016 Fall - Spring - 2015 Fall 72 Batesville Spring 38 2016 Fall 157 Spring - 2015 Fall - North Spring 22 2016 Fall -

109

Table 4.2 Description of traits analyzed.

Trait Abbreviation Unit Hemoglobin content HB g/dL Hematocrit percentage HCT % Mean corpuscular hemoglobin MCH Pg Mean corpuscular volume MCV fL Mean corpuscular hemoglobin concentration MCHC g/dL Red blood cells RBC M/uL Red blood cell distribution width RDW % Basophils BA K/uL Basophils(logarithm) BAlog Log Eosinophils EO K/uL Lymphocytes LY K/uL Monocytes MO K/uL Neutrophils NE K/uL White blood cells WBC K/uL Mean platelet volume MPV fL Platelets PLT K/uL Birth weight BW lbs Weaning weight WW lbs Adjusted weaning weight1 adjWW lbs Average daily gain ADG lbs 1 Calculated as: (("#$%&%' "#&'ℎ* − ,&-*ℎ "#&'ℎ*)⁄/$01 $* "#$%&%') ∗ 205

110

Table 4.3 Narrow sense heritability (h2) estimates for blood and growth traits.

Trait Approach Bayesian Frequentist 1 2 Average daily gain. (ADG) 0.48 (0.07) 0.54 (0.11) Adjusted weaning weight (adjWW) 0.48 (0.07) 0.54 (0.11) Birth weight (BW) 0.38 (0.08) 0.41 (0.10) Weaning weight (WW) 0.55 (0.07) 0.60 (0.10) Basophils (Balog) 0.15 (0.06) 0.23 (0.10) Eosinophils (EO) 0.15 (0.05) 0.11 (0.08) Hemoglobin (HB) 0.25 (0.08) 0.27 (0.10) Hematocrits (HCT) 0.17 (0.06) 0.11 (0.08) Lymphocytes (LY) 0.22 (0.07) 0.26 (0.10) Mean corpuscular hemoglobin (MCH) 0.48 (0.09) 0.52 (0.11) Mean corpuscular hemoglobin concentration (MCHC) 0.42 (0.09) 0.46 (0.11) Mean corpuscular volume (MCV) 0.40 (0.09) 0.44 (0.11) Monocytes (MO) 0.11 (0.04) 0.01 (0.05) Mean platelet volume (MPV) 0.23 (0.07) 0.24 (0.10 Neutrophils (NE) 0.28 (0.07) 0.30 (0.10) Platelets (PLT) 0.18 (0.06) 0.16 (0.09) Red blood cells number (RBC) 0.32 (0.08) 0.35 (0.10) Red blood cell distribution width(RDW) 0.24 (0.08) 0.18 (0.10) White blood cells (WBC) 0.31 (0.08) 0.32 (0.10) 1 Sampling error shown in parenthesis. 2 Standard error shown in parenthesis.

Table 4.4 Significant windows overlapping over different traits.

Chr BA(log) EO HB HCT MCH MCHC MCV MPV NE RBC RDW WBC ADG WW BW 3 1011 " 119 2 119 101 101 4 116 116 7 1 3 0 0 9 10 10 11 88 70 70 70 70 88 70 12 71 71 14 1 62 1 62 15 51 78 51 77 78 52 22 55 24 59 59 111 27 5 6 16 6 16 6 29 5 5 5 X 55 27 55 27 1 Windows highlighted in blue explained more than 0.5% of estimated genetic variance for three or more traits. " Numbers in the cell represents the megabase at which the 1-megabase window starts. 2 Orange highlighted windows explained more than 0.5% of estimated genetic variance for two traits. 3 Green highlighted windows explained more than 0.5% of estimated genetic variance for one trait and are immediately next to a significant window for a different trait.

Table 4.5 The ten most significantly enriched terms for biological process for each trait category.

Red Blood Cell Traits FDR 1 Platelet Traits FDR cellular process 2 0.00% response to stimulus 4.00% organic substance metabolic process 0.00% detection of chemical stimulus involved in sensory perception of smell 3.00% metabolic process 0.00% detection of chemical stimulus involved in sensory perception 2.00% cellular metabolic process 0.00% sensory perception of smell 2.50% primary metabolic process 0.00% detection of chemical stimulus 2.00% macromolecule metabolic process 0.00% detection of stimulus involved in sensory perception 2.00% nitrogen compound metabolic process 0.00% smooth muscle cell migration 2.00% cellular component organization or biogenesis 0.00% sensory perception of chemical stimulus 1.75% cellular component organization 0.00% response to chemical 1.56% macromolecule modification 0.00% detection of stimulus 1.80% White Blood Cell Traits Growth traits 112 organic substance metabolic process 0.00% cellular process 0.00% primary metabolic process 0.00% regulation of biological process 0.00% nitrogen compound metabolic process 0.00% cell communication 0.00% metabolic process 0.00% biological regulation 0.00% macromolecule metabolic process 0.00% regulation of cellular process 0.00% cellular process 0.00% signaling 0.00% cellular metabolic process 0.00% response to stimulus 0.00% localization 0.00% regulation of cell communication 0.00% biological regulation 0.00% regulation of signaling 0.00% organic substance biosynthetic process 0.00% positive regulation of biological process 0.00% 1 False discovery rate.

Table 4.6 Ten most significantly enriched terms for function for each trait category.

Red Blood Cell Traits FDR" Platelet Traits FDR binding 0.00% unannotated 0.00% protein binding 0.00% ion binding 0.00% catalytic activity 0.00% hydrolase activity 0.00% heterocyclic compound binding 0.00% organic cyclic compound binding 0.00% modified amino acid binding 0.00% anion binding 0.00% - - White Blood Cell Traits Growth traits 113 serine-type endopeptidase activity 0.00% binding 0.00% serine-type peptidase activity 0.00% folic acid binding 0.00% serine hydrolase activity 0.00% protein binding 0.00% binding 0.00% insulin receptor binding 0.50% protein binding 0.00% amide binding 2.00% catalytic activity 0.00% modified amino acid binding 1.67% hydrolase activity 0.00% folic acid receptor activity 2.00% endopeptidase activity 0.00% voltage-gated sodium channel activity 1.75% regulatory region nucleic acid binding 0.22% protein-containing complex binding 1.56% transcription regulatory region DNA binding 0.20% " False discovery rate.

114

Figure 4.1. Genetic (above diagonal) and phenotypic (below diagonal) correlations between traits.Traits included are: mean platelet volume (MPV), Platelets (PLT), red blood cell distribution width (RDW), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular hemoglobin (MCH), mean corpuscular volume (MCV), hematocrits (HCT), hemoglobin content(HB), red blood cells (RBC), basophils(log) (Balog), eosinophils (EO), lymphocytes (LY), neutrophils (NE), white blood cells (WBC), average daily gain (ADG), adjusted weaning weight (adjWW), weaning weight (WW) and birth weight (BW). Average daily gain and adjusted weaning weight show a phenotypic correlation of 1 because these traits are a function of each other. Gradient of color from blue to red represents negative to positive correlations and their strength, respectively.

1

115

Figure 4.2 Manhattan plot displaying one-megabase windows and the percentage of estimated genetic variance they account for along the genome. Labeled windows explain ≥ 0.5% of estimated genetic variance. Labels represent the abbreviation of the trait for which the variance is explain at each window. Traits included are: mean platelet volume (MPV), Platelets (PLT), red blood cell distribution width (RDW), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular hemoglobin (MCH), mean corpuscular volume (MCV), hematocrits (HCT), hemoglobin content (HB), red blood cells (RBC), basophils(log) (Balog), eosinophils (EO), lymphocytes (LY), neutrophils (NE), white blood cells (WBC), average daily gain (ADG), adjusted weaning weight (adjWW), weaning weight (WW) and birth weight (BW). Chromosome X is identified as chromosome 30

116

Figure 4.3 Manhattan plot showing percentage of estimated genetic variance explained by each 1megabase (MB) window for mean corpuscular hemoglobin (MCH). Labeled points explain ≥ 0.5% of the estimated genetic variance. Points highlighted in green have posterior probability of inclusion (PPI) > 60%. The first number of the label of each window represents the chromosome where the window is located, numbers after the underscore. i.e. “6_25” represents a QTL on chromosome 6 encompassing the window from 25-26 Mbs. Chromosome X is identified as chromosome 30

117

Figure 4.4 Manhattan plot showing percentage of estimated genetic variance explained by each 1MB window for monocytes (MO). Labeled points explain ≥ 0.5% of the estimated genetic variance. Points highlighted in green have posterior probability of inclusion (PPI) > 60%. The first number of the label of each window represents the chromosome where the window is located, numbers after the underscore. i.e. “3_15” represents a QTL on chromosome 3 encompassing the window from 15-16 Mbs. Chromosome X is identified as chromosome 30.

118

Figure 4.5 Manhattan plot showing percentage of estimated genetic variance explained by each 1MB window for mean platelet volume (MPV). Labeled points explain ≥ 0.5% of the estimated genetic variance. Points highlighted in green have posterior probability of inclusion (PPI) > 60%. The first number of the label of each window represents the chromosome where the window is located, numbers after the underscore. i.e. “27_16” represents a QTL on chromosome 27 encompassing the window from 16-17 Mbs. Chromosome X is identified as chromosome 30

119

CHAPTER 5. ESTIMATING BREED COMPOSITION FOR PIGS: A CASE STUDY FOCUSED ON MANGALITSA PIGS AND TWO METHODS

Josue Chinchilla-Vargas1* , Francesca Bertolni2 , K. J. Stalder1 , J. P. Steibel3 , M. F. Rothschild1

1 Iowa State University, Department of Animal Science, Ames, Iowa, 50011. 2 National Institute of Aquatic Resources, Technical University of Denmark, 2800, KGs. Lyngby, Denmark 3 Department of Animal Science, Michigan State University, East Lansing, Michigan 48824

Modified from a manuscript published in Livestock Science: 244, 104398

Abstract

Breed associations and registries maintain breed purity by enforcing certain conformational characteristics defining the breed along with cataloguing the pedigree of every animal in the registry. Furthermore, developing niche markets is often based on specialized products using heritage breeds that need to guarantee breed purity. Genomic technology and the progressively lower costs of genotyping can be helpful when assessing breed purity by estimating breed composition. In this research, genotypes from 648 pigs and 11 breeds were used to develop marker panels to estimate breed composition with special emphasis on Mangalitsa pigs as a heritage breed. Two sets of panels were created. The first set was based on Fst scores that were calculated individually for ~31,000 available markers across the pig genome. Here, panels composed of the 10, 50, 100, 500 and 1000 markers with the highest Fst scores were generated. The second set was composed by randomly selected markers and had the same number of markers as the Fst-derived panels. Two statistical methods, linear regression and random forest were then used on the marker panels to estimate breed composition, of 107 pigs including 47 individuals known to have Mangalitsa background. Fst appeared to be better at identifying Mangalitsa individuals when compared to random markers regardless of the method 120 used to estimate breed composition. However, random markers were more accurate at estimating breed composition for non-Mangalitsa individuals.

When the results were compared across methods for estimating breed composition, linear regression produced more accurate estimates of breed composition than random forest. However, both methods lacked accuracy when estimating breed composition for crossbred individuals. It must also be noted that these methods were focused on estimating breed composition of

Mangalitsa pigs and different markers should be selected if different breeds will be the focus and accuracy of prediction will depend on the breeds that are available to be used as references for the Fst calculations.

The results presented in this study allow us to conclude that: 1) Random forest was effective at classifying individuals into breeds, but not at estimating breed composition when compared to the linear regression method. 2) Markers filtered using Fst scores are more effective at identifying Mangalitsa breed composition while not as effective at identifying other breeds. 3)

If Fst-filtered markers that are effective at identifying Mangalitsa from other breeds are being used to estimate breed composition for individuals of other breeds, a greater number of markers is needed.

Keywords: Mangalitsa; Mangalica; Swine; Breed Composition; Random Forest; Linear

Regression

121

Introduction

Livestock breeds have been developed through continuous natural and artificial selection over long periods of time, often with specific traits of interest to be targets of selection and hence more prevalent in the population. The conservation of the diversity of breeds with different traits and adaptations can play an important role in developing livestock that are adapted to climatic and specific production systems (Hall and Bradley, 1995) and the increased demand of animal source foods expected in the next decades (Nardone et al., 2010). However, in order to maintain between-breed diversity, it is important to maintain within-breed purity. Breed associations and registries maintain breed purity by enforcing certain conformational and performance characteristics along with cataloguing the pedigree of every animal that is approved for registry within that breed (Funkhouser et al., 2017).

Before the genomic era, a common method to screen for breed purity in addition to pedigree in white pigs was to perform test matings to determine if white boars would only sire white progeny (Giuffra et al., 1999; Marklund et al., 1998). However, this procedure was time and resource demanding (Funkhouser et al., 2017) and therefore, other methods using molecular data have been developed for multiple species (Bertolini et al., 2018, 2015; Funkhouser et al.,

2017; Huang et al., 2014; Jacobs et al., 2018; Munoz et al., 2020; Schiavo et al., 2020). As a consequence of breed formation, population bottlenecks and within breed selection for specific productive or adaptative traits, allele frequencies have been changed and in some cases genetic variants become fixed (Qanbari and Simianer, 2014). Therefore, the genetic heterogeneity present amongst populations and breeds makes genotypes at loci that have been under strong selection pressure more useful to estimate breed composition for an individual (Gorbach et al.,

2010; Kuehn et al., 2011). 122

In the present study, the focus was on the Mangalitsa breed of pigs, that has its origins in

Hungary’s and Romania’s Carpathian Basin as a lard breed. The Mangalitsa pigs are characterized by hairy fleece, similar to that of a sheep (Oroian and Petrescu-Mag, 2014). While being hardy and producing meat and fat with desirable quality, animals of this breed tend to be slow growing (Nistor et al., 2012; Petrovic et al., 2010). Today, Mangalitsa breeders wish to maintain breed purity in order to develop specialized niche markets. In recent years, genomic tools including several medium and high density commercial SNP chip panels for several livestock species including pigs have been developed (Nicolazzi et al., 2015). Normally, SNP chips include thousands of markers across the genome, and panels can be generated by reducing the number of markers used to address specific questions, such as individual animal breed purity evaluation. To identify the most discriminating markers among the thousands available in the commercial SNP chips, several statistical approaches have been applied. Among these approaches, Fst analysis measures the standardized variance in allele frequencies among different populations (Weir and Cockerham, 1984). This approach has been shown to be a simple and effective tool to identify informative genetic markers and population structures in humans and livestock species, including pigs (Bennasir et al., 2010; Bertolini et al., 2018, 2015; Bowcock et al., 1994; Hulsegge et al., 2013; Schiavo et al., 2020; Wilkinson et al., 2011).

These informative marker panels can be coupled with other techniques to classify or assign individuals to groups or breeds. Among those allocation tests, random forest (RF) is an algorithm used for classification and regression that is based on a large number of low-correlated decision trees (Breiman, 2001; Chen and Ishwaran, 2012; Hastie et al., 2009). In this method, decision trees are built using a bootstrap sample of the data set and a random subset of all predictors is chosen to determine the best split at each tree. Therefore, all trees in a forest are 123 different. For each tree, approximately one third of all the observations are not included in the bootstrap sample; these observations are called out-of-bag (OOB) data. The OOB data are then used to estimate prediction accuracy. For a particular tree, each OOB observation is given an outcome prediction. The overall prediction of each individual is then obtained by counting the predictions over all trees for which the individual was out-of-bag, and the outcome with the most predictions is the individual's predicted outcome (Meng et al., 2009). Previous research has shown that RF can effectively assign breeds to individuals based on genotypes (Bertolini et al.,

2018, 2015; Jacobs et al., 2018; Schiavo et al., 2020). Additionally, random forest can produce an estimation of the probability of an observation of being of a specific class, which we argue can be interpreted as breed composition estimations. With this rationale, this study represents also an effort that evaluates the effectiveness of random forest at estimating breed composition.

A second method, which has been successfully used to estimate breed composition in pigs (Funkhouser et al., 2017; Huang et al., 2014) was also used. In this linear regression method, a test animal’s genotypes are regressed onto allele frequencies derived from reference animals (Funkhouser et al., 2017). Additionally, quadratic programming is used to develop linear constraints on the solution of the regression equation so that the estimate of each breed’s proportion is between 0 and 1.

The objectives of this research were to i) develop and compare approaches to identify a marker subset that would effectively identify pigs with sufficient Mangalitsa influence to be included in the herd registry. ii) Evaluate the potential and accuracy of using the probability assigned to a pig as being of one breed by random forest as proxies of breed composition for purebred and crossbred pigs. iii) Compare the performance of random forest and linear regression (Funkhouser et al., 2017) methods to do estimate breed composition. 124

Materials and methods

Animal care and welfare

Animal care and use approval was not needed for this study because all data utilized was sourced from existing databases and no live animals were used.

Animal genotype data sets

Genotypes of Duroc (n=111), Hampshire (n=102), Landrace (n=96) and Yorkshire

(n=114) individuals genotyped with the PorcineSNP60 SNP chip were provided by the National

Swine Registry (NSR). Genotypes from Berkshire (n=44), Hereford (n=22), Large Black (n=3),

Meishan (n=52) and Spotted (n=10) breeds were obtained from the USDA Meat Animal

Research Center (USMARC) through the National Animal Germplasm Program genomic data request tool

(https://agrin.ars.usda.gov/genomic_data_decision_tool_page_dev?language=EN&record_source

=US), these genotypes were produced with the GGP PorcineHD array containing approximately

80,000 markers. Additionally, 23 Pietrain genotypes from a commercial genetics company were used. Mangalitsa genotypes were provided by US Mangalitsa Breed Organization and Registry

(MBOAR). Pietrain and Mangalitsa genotypes were produced with the GGP Porcine v1 array containing approximately 50,000 markers. The Mangalitsa data set included 96 individuals with

48 pure Mangalitsa animals having no grandparents in common except for 2 individuals with one common grandparent (Group 1), and 48 individuals (Group 2) related to those in Groups 1.

Group 2 also included individuals with unknown ancestry that appeared to be pure (n=5) and 4 crossbred individuals ranging from 50% to 87.5% Mangalitsa based on pedigree information.

Based on pedigree information these pigs were crossbreds of Mangalitsa and Red Wattle,

Mulefoot and Large Black. 125

Genotypes were processed and formatted with SNPipeline

(https://github.com/cbkmephisto/SNPipeline) and SNPware

(https://github.com/josuechinchilla/SNPware). Because the genotypes were produced using three different marker panels, once all genotypes were transformed to genotype matrices the first step was to retain only the markers that were common between panels that reduced the number to

~32,000 markers distributed across all 18 autosomal chromosomes plus chromosome X. Quality control (QC) was then performed using plink 1.7 (Purcell et al., 2007) to filter out individuals with a coverage of less than 85% and markers with a call rate of less than 90%. After QC, 648 individuals and 31,089 markers were retained. Finally, before taking on the downstream analyses, marker positions were updated to the Sus scrofa genome assembly version 11.1

(https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.6/) using in-house scripts and the new marker coordinates provided by Neogen Genomics (Lincoln, Nebraska).

After QC, the dataset was divided into a training population and a validation population.

The training population was used for SNP selection (Fst analysis), to train and cross validate (in the case of random forest) models and to calculate allelic frequencies for each breed (in the case of linear regression) while breed composition was estimated on the validation population. For

Mangalitsa, individuals in group 2 and a random set of non-Mangalitsa pigs were selected to be used as the validation population. For the Duroc, Hampshire, Landrace and Yorkshire breeds, 10 pigs were randomly chosen in order to use approximately 10% of the available individuals as validation. Because a small number of Hereford and Pietrain genotypes were available, only 5 of each Hereford and Pietrain were randomly chosen to be used as validation in order to have enough pigs of these breeds in the training population. In a similar manner, 10 Berkshire individuals were randomly assigned to the validation set. Additionally, due to the limited number 126 of animals available, Large Black and Spotted samples were not used for downstream analyses because of the limited sample sizes. In this context, since a number of markers with high Fst scores were not successfully genotyped for individuals of this breed, the Meishan breed was only used for Fst calculation. In Table 5.1 the number of individuals from each breed used to calculate the Fst and those used as training and as validation are presented.

Determining marker subsets for analyses

All purebred Mangalitsa (group 1) and all purebred individuals from the other breeds of the training group were used to calculate Fst scores. The Fst for each marker was calculated with

Plink 1.9 (Purcell et al., 2007) between two populations. Here, the allele frequencies of

Mangalitsa pigs from group 1 were compared against the allele frequencies of the combination of all the other breeds, as done by Zsolnai et al. (2013).

Once Fst was calculated for each marker, 5 panels were created using the 10, 50, 100, 500, and 1000 markers with the highest Fst scores. In order to objectively compare the accuracy for the panels selected with Fst, a second set of panels obtained using the same training group and with the same number of markers as the Fst panels was created by randomly selecting markers across the genome.

Linkage disequilibrium (LD) between selected SNPs was moderate (r2 > .25) only in few pairs of markers indicating that most of the selected SNPs captured different fractions of the variance. Although a marker pruning based on LD is often performed before further analyses when designing custom marker panels, this was deemed unnecessary given that decision trees in random forest are built with markers chosen at random from the available set to minimize correlations between trees and this limits the effects of LD between markers on accuracy prediction. In the case of the linear regression method, we followed the methods used by

Funkhouser et al. (2017) and no LD filter was applied. 127

Additionally, the location of the 10 markers with the highest Fst score was examined to identify qualitative trait loci and genes that were located within 0.5 Mega bases (Mb) upstream or downstream from each marker using the NCBI genome browser

(https://www.ncbi.nlm.nih.gov/genome/gdv/) and release 41 of QTLdatabase

(https://www.animalgenome.org/cgi-bin/QTLdb/index).

Breed composition analyses

Two methods were implemented to determine breed composition, a machine learning approach using random forest algorithms and a linear regression method.

Random forest

Random forest was implemented using the R package randomForest (Liaw and Wiener,

2002). Each marker panel was used to predict breed composition using 500 trees and the number of predictors in each tree was set as the square root of the number of markers in the SNP panel, both of which are the default settings of the algorithm. Table 5.2 shows the number of predictors used for each panel. Additionally, as part of the random forest algorithm pipeline, 1/3 of the individuals used for training the model were considered for as an internal cross-validation set to calculate breed prediction accuracy in terms of OOB error. OOB error is computed by taking, as a predicted value for the ith observation, the most frequent predicted class among the trees that were not fit using that observation and it is a valuable tool to estimate accuracy of prediction

(Bertolini et al., 2015; Hastie et al., 2009). Random forest produced two results: 1. it assigned a breed to each individual and 2. it generated probabilities for each pig to be of each of the breeds present in the reference data.

Linear regression

The linear regression method was implemented through the R package breedTools

(https://github.com/funkhou9/breedTools). With this method, a test animal’s genotypes are 128 regressed onto allele frequencies derived from the reference set of animals (Funkhouser et al.,

2017). Additionally, quadratic programming is used to put a set of linear constraints on the solution of the regression equation so that the estimate of each breed is between 0 and 1. Exact details are explained in “Estimation of genome-wide and locus specific breed composition in pigs” (Funkhouser et al., 2017).

Results and discussion

Marker subsets for analyses

Table 5.3 shows the number of markers per chromosome across all different marker panels used in this study when markers were filtered through Fst scores and randomly chosen.

When 10 Fst-selected markers were used, three markers were located on chromosome 1, two markers were located in chromosome 2, and one marker was located in each of chromosomes 4,

7, 10, 16 and 17. When 10 random markers were used, two markers were located in each of chromosomes one, four and fourteen while one marker was located in each of chromosomes 2, 3,

7 and X. When 50 markers were used, markers were located on all chromosomes except 12 and

X for Fst-filtered markers and 5, 6, 10 and 12 for random markers. All chromosomes except 12 and X were represented in the panel composed of 100 Fst-filtered markers while in the panel composed of random markers, only chromosome 17 was not represented. In the case of panels composed of 500 and 1000 markers, all chromosomes were represented independently of the marker selection strategy. In Table 5.4 the chromosome, position and gene in which each of the

10 markers with the highest Fst scores are located are shown along with the number of QTL located within 0.5 Mb of each marker. Fst scores for these markers ranged from 0.82 to 0.74.

Only two of the ten markers were not located in intragenic regions. Even though the marker selection strategy used in the present study was focused on the Mangalitsa breed, one of the selected markers was in a gene previously detected by Schiavo et al.(2019), as one of the most 129 discriminating region across commercial European pig breeds. This gene is PDE7B and belongs to a gene family that have been linked to meiotic resumption of mammalian oocytes (Gupta et al., 2017).

It must be noted that all 10 markers were within 0.5Mb upstream or downstream of at least 1 QTL. The marker located in chromosome 4 on 97,376,691 had 38 QTL within

0.5Mb upstream or downstream. The QTL near this marker were associated to adipocyte diameter, average daily gain (ADG), last-rib backfat, harvest body weight, carcass length, etc. In total, 67 QTL were near the 10 markers. The full list of QTL in proximity to the 10 markers with the highest Fst score and their details are shown in Table 5.6.

Previous research followed a similar marker selection strategy to identify Mangalitsa pigs and test parentage (Zsolnai et al., 2013) and identified 24 markers that were accurate at differentiating between different Mangalitsa coat colors and Mangalitsa and commercial white pigs. None of the 24 markers reported were represented in our Fst-filtered panels. However, this may be explained by the difference in the breeds that were used to calculate the Fst scores and the difference in the objectives of the research.

Random forest

The OOB errors for each breed along with the average OOB error when 10 random and

Fst-filtered markers were used are shown in panel A of Figure 5.1. When 10 random markers were used OOB error was distributed among all breeds with the Hereford breed having the highest error, being of 55% followed by Mangalitsa with 41%. The lowest OOB error was produced for Duroc with 9%. When Fst-filtered markers were used, Berkshire had the highest

OOB error with 52% followed by Duroc with 51%. As expected, when Fst-filtered markers were 130

used Mangalitsa showed an OOB error of 0%, as Fst -filtered markers were selected to be the most discriminating towards the Mangalitsa breed.

When 50 random markers were used, the overall OOB error dropped to 2% for the

Berkshire. Duroc and Hampshire breeds had an OOB error of 0% and Hereford having the highest OOB error of 22%, while the Mangalitsa breed had an OOB error of 4.5%. When 50 Fst- filtered markers were used the only breeds that had an OOB error greater than 0% were

Berkshire, Landrace, Yorkshire and Hampshire with 16.0%, 7.1%, 4.9% and 3.1%, respectively.

Interestingly, all breeds that tend to show red pigmentation showed an OOB error of 0.0% likely due to the recessive nature of red from MC1R. Given that three of the 50 markers with the highest Fst score were located in chromosome 6, five markers in and two markers in chromosome 17, this might be explained by markers in linkage disequilibrium (LD) with KIT,

MC1R or ASIP, which are known to control coat color in pigs (Drögemüller et al., 2006; Kijas et al., 1998; Marklund et al., 1998). However, none of the markers were located within 0.5Mb upstream or downstream of the three previously mentioned genes. Markers located on chromosome 6 included DIAS0004358, MARC0094560 and ASGA0029651 and were located

47.5, 49.7 and 144.5 Mb downstream of MC1R, respectively. Markers H3GA0024473,

H3GA0024862, ALGA0124437, DRGA0008818 and H3GA0025673 were located 22 Mb upstream and 6.5, 35.5, 80.6, 92.6 Mb downstream of KIT. Markers located in chromosome 17 were ALGA0092829 and ASGA0093216 and were located 32.6 and 10 Mb upstream of ASIP, respectively. Random forest OOB error percentages for all breeds and overall average when 50,

100 and 500 markers across selection strategies were used are shown in Supplementary Figure

5.6. 131

As expected, as the number of markers increased, the OOB error decreased, reaching less than 1% for both marker selection strategies when 100 markers were used and being of 0.0% when 500 and 1000 markers were used to estimate breed composition. When 100 Fst-filtered markers were used, Duroc, Hampshire, Hereford, Mangalitsa and Pietrain breeds had an OOB error of 0.0% and Berkshire, Landrace and Yorkshire breeds had an OOB error of ~5%. In the case of random markers, all breeds had an OOB error of 0.0% except for Hampshire and

Mangalitsa that had on OOB error of ~1.8% and ~9.0% . These results are in line with the findings from Schiavo et al. (2019), who observed an OOB error rate of 0.79% when using a subset of 96 markers to estimate the breed composition for pigs from 7 Italian breeds.

Panel B of Figure 5.1 shows the OOB error for all marker panels when random forest was used to calculate breed composition. Results for breed composition estimation using random forest showed the same trend independent of the method used to select markers. When 10 markers were used, the OOB error was 33% and 25% for markers with the highest Fst score and random markers, respectively.

As shown in panel C of Figure 5.1, when evaluating the known purebred Mangalitsa individuals from the validation using random forest, panels that contained Fst-filtered markers estimated greater Mangalitsa breed composition independently of the number of markers. Once again, this is probably because of the bias present on how the Fst scores were calculated. Across panels, random markers estimated an average Mangalitsa breed composition for 63% while Fst- filtered markers estimated 85%. Intriguingly, the greatest difference in estimated Mangalitsa breed composition between marker selection strategies was observed when only 10 markers were used with a difference of 35% favoring the Fst-filtered markers. The average Mangalitsa breed composition estimated with random markers across panels was 51% and 89% for Fst -filtered 132 markers. It is also worth noting that the average Mangalitsa breed composition trends differed depending on markers being selected at random or based on the Fst scores. Panels that contained markers selected at random produced greater average Mangalitsa breed composition estimates as the number of markers in the panel increased from 10 to 5000, ranging from 54% to 69% (data not shown), respectively. As the number of random markers increases, the chance of including markers with high predictive power increases and therefore, more accurate estimates can be produced. On the other hand, panels composed by markers chosen based on Fst scores, produced lower average Mangalitsa breed composition estimates as the marker number increased, going from 89% to 82% when the panel contained 10 and 5000 markers, respectively. Adding markers with a lower Fst score likely adds noise to the data and lowers its predictive power. Even though the greatest Mangalitsa percentages were obtained when 10 Fst -filtered markers were used, there were 3 individuals that are known to be pure that had an estimated Mangalitsa breed composition ranging from 36% to 48% showing that such low marker numbers should not be used because as the number of markers decreases the possibility that a few animals are not assigned correctly due to atypical genotypes increases (Hulsegge et al., 2013). For the other breeds included in the validation set, breed percentages increased as the marker number increased independently of the marker selection strategy used (data not shown) although, panels with less than 100 markers performed better when composed of random markers.

Figure 5.2 shows the average Mangalitsa breed composition for crossbred individuals and individuals with unknown Mangalitsa breed composition in the validation data set. In the case of individuals with known Mangalitsa breed composition percentage, the most accurate estimations were produced when 100 Fst -filtered marker panels were used and overall, predictions were more accurate when Fst -filtered markers were used. In the case of the individual that is known to 133 have a composition of 50% Mangalitsa, the estimate was of 65.4% while for the three individuals known to have 75% Mangalitsa breed composition, the average composition estimate was of 59.6% when 100 Fst-filtered markers were used. When 100 random markers were used, the estimated Mangalitsa breed composition was of 63.0% for the individual known to be 50%

Mangalitsa and of 28.2% for the individuals that were known to be 75% Mangalitsa. Another interesting finding was that the breed composition estimates for individuals known to be 75%

Mangalitsa were similar when produced with panels with 500 markers or more, independently of the marker selection strategy used whereas for the individual known to be 50% Mangalitsa, the estimates stabilized once 1000 or more markers were used. It is important to note that even though the overall OOB error was 0%, the Mangalitsa breed composition estimates were not as accurate for crossbred pigs likely because these pigs were produced by crossing Mangalitsa to very rare breeds that were not included (Red Wattle and Mulefoot) or were represented by a very small number of individuals (Large Black n =3) in the reference data set which reduces estimation accuracy.

For the five individuals in the validation data set that looked pure but had unknown pedigree, when 10 markers were used, their average estimated Mangalitsa breed composition was 99% and 68% with Fst-filtered markers and random markers, respectively. When 100 Fst- filtered markers were used, the average estimated Mangalitsa breed composition was of 96% with the lowest being 92% while random markers estimated an average Mangalitsa breed composition of 59% ranging from 51% to 72%. Based on the results of random forest, these individuals were assumed to be pure Mangalitsa.

Breed composition estimates for non-Mangalitsa animals in the validation set showed that in smaller panels, random markers performed better for non-Mangalitsa individuals (data not 134 shown). The average estimated composition for the known non-Mangalitsa individuals was of

56% when 10 random markers were used, when Fst-filtered markers were used, the estimated composition dropped to 44%. Intriguingly, random markers produced greater breed composition estimates for all breeds except for Hereford breed. Hereford individuals were estimated to have a

Hereford composition of 52% with random markers, while Fst-filtered markers estimated the

Hampshire composition to be of 91%. As the number of markers used increased, the average breed composition estimate increased for both marker-selection strategy. However, as the number of markers increased, the difference in percentage became smaller. When 500 markers were used, the difference in average breed composition estimates were of 78% with random markers and 75% with Fst-filtered markers and when 1000 markers were used, Fst-filtered markers estimated an average breed composition of 77% compared to 77% estimate obtained when random markers were utilized.

Linear regression

As shown in Figure 5.3, when the linear regression method was implemented, the

2 coefficient of determination (R ) averaged 61% for panels with markers selected using Fst score and 56% when markers were chosen at random. Interestingly, the average R2 for pigs in the validation data set decreased as the marker number increased from 10 to 50 independently of the marker selection strategy to then stabilize at ~60% for markers selected using Fst scores and

~53% when markers were chosen at random. The highest R2 for both marker selection strategies were obtained when only 10 markers were used to calculate breed composition and was of 78% for Fst-selected markers and 66% for randomly selected markers. The increase in the error rate as the number of markers increases indicates that adding more terms to the regression model reduces model fit. However, the relatively high error values may be a consequence of regressing 135 the animal genotype coded as discrete values onto allele frequencies coded as continuous values

(Funkhouser et al., 2017).

Previous research used the same linear regression method and estimated breed composition for purebred Yorkshire pigs using approximately 8000 markers and reported that

95% of the pigs had a Yorkshire composition of at least 82% (Funkhouser et al., 2017). In our case, 95% of the known purebred Mangalitsa individuals were estimated to have a Mangalitsa breed composition for more than 80% when using as little as 50 Fst-filtered markers. For the known purebred animals from other breeds, the percentage of individuals that were estimated to have a breed composition for their known breed of more than 80% ranged from 20% to 70%, with the only exception being the Berkshire breed, where all individuals were estimated at more than 80%. Such a large difference between the predictive power for the different breeds highlights the usefulness of filtering markers with methods such as the Fst when the goal is focused accurately on predicting one specific breed. However, by increasing the marker number, the accuracy when estimating the percentage of breed composition for one specific breed compared the others is increased. For example, when using 5000 Fst -filtered markers (data not shown), only 4 of the 105 animals in the validation data set were estimated to be composed of less than 80% of their known breed. Table 5.5 shows the breed composition percentages for the different breeds in the validation data set along with the number and percentage of animals that were estimated to have a breed composition for less than 80% of their known breed.

When focusing on estimating Mangalitsa breed composition for known purebred

Mangalitsa pigs from the validation data set, the average Mangalitsa percentage estimated was greater than those obtained when using random forest for estimating breed composition independently of the marker-selection strategy and the marker number. Fst -selected markers 136 estimated purebred animals to have 93% average Mangalitsa breed composition, while random markers produced an 85% estimate. Figure 5.4 shows the average Mangalitsa breed composition for purebred animals across marker-selection strategies and marker number when breed composition was predicted using the linear regression method.

The average Mangalitsa breed composition for crossbred animals across marker-selection strategy and marker number when breed composition was predicted using the linear regression method is shown in Figure 5.5. The regression model tended to overestimate the Mangalitsa breed composition for a crossbred individual that was known to be ½ Mangalitsa, especially when Fst -filtered markers were used. This is in line with what Funkhouser et al. ( 2017) found when using the same method. When estimating breed composition for crossbred animals with breeds that were not included in the reference data, they found that prediction was considerably biased to the breed that was included in the reference panel. Surprisingly, for the individual that was known to be ¾ Mangalitsa, Mangalitsa breed composition were accurate when using 50 and

100 Fst-filtered markers estimating 75% and 74%, respectively. These results were unexpected given that the individual is a known crossbred of Mangalitsa and other breeds not included the reference population. Nonetheless, the Mangalitsa percentage was slightly overestimated when all other Fst-filtered marker sets were used. When random makers were used to estimate the breed composition for this individual, the Mangalitsa proportion was underestimated for all markers sets, again contradicting what previous research has reported and what was observed for the ½ Mangalitsa individual.

The five individuals in the validation data set that appeared pure but had unknown pedigree had an average Mangalitsa breed composition estimate of 99% and 75% with Fst- filtered markers and random markers, respectively when 10 markers were used. When 100 Fst- 137 filtered markers were used, the average estimated Mangalitsa breed composition was of 99% with the lowest being 96% while random markers estimated an average Mangalitsa breed composition for 95% ranging from 89% to 100%. As was the case when random forest was used, these individuals were assumed to be pure Mangalitsa with the regression method. The average regression coefficient for each breed across all panels are shown in Table 5.7.

Comparing methods for Mangalitsa breed composition estimation.

When the results of random forest estimations were compared with those of the linear regression method, it was clear that linear regression produces better estimates of breed composition than random forest. As shown in panel C of Figure 5.1, and Figure 5.4, linear regression estimated a greater percentage of Mangalitsa in Mangalitsa-purebred animals than those produced by random forest independently of the number of markers and marker selection strategy. When using Fst-selected markers, linear regression estimated purebred Mangalitsa to be

86.7% Mangalitsa while random forest estimated 77.95% when results were averaged across panels with different number of markers.

Even though previous research has shown that random forest is a valid method to infer the breeds or populations of individuals (Bertolini et al., 2015; Jacobs et al., 2018; Schiavo et al.,

2020), those studies focused on assigning a breed to each individual rather than estimating the proportion of different breeds making up each individual. When used in this manner, random forest represents a valuable tool, as shown by the very low OOB errors obtained when using 100 or more Fst-filtered markers in the present study. However, given that one of the objectives of this work was to develop a subset of markers that would effectively identify pigs with sufficient

Mangalitsa breed influence to be included in the breed registry, it is imperative to accurately estimate the proportion of each breed that make up the individuals that are tested. As discussed previously, the only accurate breed composition estimates for known crossbred Mangalitsa 138

individuals were produced when using 50 and 100 Fst-filtered markers with the linear regression method. The poor accuracy from both methods likely results from the fact that those individuals were crosses of Mangalitsa and breeds that were not included in the reference panel. However, it has been shown that if all the needed breeds are included in the reference panel, the correlation between known breed proportions and estimated breed proportions are very strong (Funkhouser et al., 2017), indicating that breed composition prediction is indeed accurate.

When evaluating the five individuals that had an unknown pedigree but were presumed to be pure Mangalitsa, it is clear that the linear regression method outperformed random forest even though these individuals can be considered pure-bred Mangalitsa through the results of both methods. The Mangalitsa breed composition percentages were consistently greater when the linear regression method was used and it is remarkable that 100 random markers produced estimates of ~95% through the linear regression method while random forest estimated a composition of ~58% using the same markers as reference. Finally, it must be highlighted that the pedigrees are assumed to be without error and pedigree based Mangalitsa breed composition percentages represent a theoretical breed composition that is expected, but the realized

Mangalitsa breed composition can differ from these values.

Conclusions

The present study represents efforts to implement random forest to identify primarily a single breed, estimate breed composition proportions in other crossbred pedigreed animals and further compare its performance with an alternative method. The results presented in this study allow us to conclude that: 1) While random forest proved to be effective at classifying individuals into breeds, interpretation of the probability of a pig as being of one breed calculated by random forest as proxy of breed composition is not accurate and therefore the linear regression method may be preferred 2) Markers filtered using Fst scores are very effective at 139 estimating breed composition for Mangalitsa individuals while not being as effective when estimating other breeds and 3) If Fst-filtered markers that are effective at identifying Mangalitsa individuals are used to estimate breed composition for other breeds, a greater number of markers is required to overcome the estimation power bias. It must also be noted that these methods were focused on estimating breed composition of Mangalitsa pigs and different markers should be selected if different breeds will be the focus and accuracy of prediction will depend on the breeds that are available to be used as references for the Fst calculations.

Declaration of competing interests

None.

Acknowledgements

The authors thank the Mangalitsa Breed Organization and Registry (MBOAR) which was established in 2019 to conserve the genetics and promote the Mangalitsa breed, National Swine

Registry, USDA Meat Animal Research Center (USMARC) through the National Animal

Germplasm Program (NAGP), and Fast Genetics for providing genotypes from a variety of breeds. The assistance of Barbara Meyer zu Altenschildesche, Tania Issa and Peter Solberg

(MBOAR), Dr. Doug Newcom (NSR), Dr. Harvey Blackburn (NAGP) and Dr. Daniela Grossi

(Fast Genetics) is gratefully acknowledged. Funding for this study was provided by the

Ensminger Fund, State of Iowa and Hatch funds.

140

Literature cited

Bennasir, H., Sridhar, S., Abdel-Razek, T.T., 2010. Vitamin A from physiology to disease prevention. Int. J. Pharm. Sci. Rev. Res. 1, 68–73.

Bertolini, F., Galimberti, G., Calò, D.G., Schiavo, G., Matassino, D., Fontanesi, L., 2015. Combined use of principal component analysis and random forests identify population- informative single nucleotide polymorphisms: Application in cattle breeds. J. Anim. Breed. Genet. 132, 346–356. https://doi.org/10.1111/jbg.12155

Bertolini, F., Galimberti, G., Schiavo, G., Mastrangelo, S., Di Gerlando, R., Strillacci, M.G., Bagnato, A., Portolano, B., Fontanesi, L., 2018. Preselection statistics and Random Forest classification identify population informative single nucleotide polymorphisms in cosmopolitan and autochthonous cattle breeds. Animal 12, 12–19. https://doi.org/10.1017/S1751731117001355

Bowcock, A.M., Ruiz-Linares, A., Tomfohrde, J., Minch, E., Kidd, J.R., Cavalli-Sforza, L.L., 1994. High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368, 455–457. https://doi.org/10.1038/368455a0

Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:1010933404324

Chen, X., Ishwaran, H., 2012. Random forests for genomic data analysis. Genomics. https://doi.org/10.1016/j.ygeno.2012.04.003

Drögemüller, C., Giese, A., Martins-Wess, F., Wiedemann, S., Andersson, L., Brenig, B., Fries, R., Leeb, T., 2006. The mutation causing the black-and-tan pigmentation phenotype of Mangalitza pigs maps to the porcine ASIP locus but does not affect its coding sequence. Mamm. Genome 17, 58–66. https://doi.org/10.1007/s00335-005-0104-1

Funkhouser, S.A., Bates, R.O., Ernst, C.W., Newcom, D., Steibel, J.P., 2017. Estimation of genome-wide and locus-specific breed composition in pigs1. Transl. Anim. Sci. 1, 36–44. https://doi.org/10.2527/tas2016.0003

Giuffra, E., Evans, G., Törnsten, A., Wales, R., Day, A., Looft, H., Plastow, G., Andersson, L., 1999. The Belt mutation in pigs is an allele at the Dominant white (I/KIT) locus. Mamm. Genome 10, 1132–1136. https://doi.org/10.1007/s003359901178

Gorbach, D.M., Makgahlela, M.L., Reecy, J.M., Kemp, S.J., Baltenweck, I., Ouma, R., Mwai, O., Marshall, K., Murdoch, B., Moore, S., Rothschild, M.F., 2010. Use of SNP genotyping to determine pedigree and breed composition of dairy cattle in Kenya. J. Anim. Breed. Genet. 127, 348–351. https://doi.org/10.1111/j.1439-0388.2010.00864.x

Gupta, A., Tiwari, M., Prasad, S., Chaube, S.K., 2017. Role of Cyclic Nucleotide Phosphodiesterases During Meiotic Resumption From Diplotene Arrest in Mammalian Oocytes. J. Cell. Biochem. 118, 446–452. https://doi.org/10.1002/jcb.25748 141

Hall, S.J.G., Bradley, D.G., 1995. Conserving livestock breed biodiversity. Trends Ecol. Evol. https://doi.org/10.1016/0169-5347(95)90005-5

Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning., 2nd ed. https://doi.org/10.1007/b94608_1

Huang, Y., Bates, R.O., Ernst, C.W., Fix, J.S., Steibel, J.P., 2014. Estimation of U.S. Yorkshire breed composition using genomic data 1. J. Anim. Sci. 92, 1395–1404. https://doi.org/10.2527/jas.2013-6907

Hulsegge, B., Calus, M.P.L., Windig, J.J., Hoving-Bolink, A.H., Maurice-van Eijndhoven, M.H.T., Hiemstra, S.J., 2013. Selection of SNP from 50K and 777K arrays to predict breed of origin in cattle. J. Anim. Sci. 91, 5128–5134. https://doi.org/10.2527/jas.2013-6678

Jacobs, A., De Noia, M., Praebel, K., Kanstad-Hanssen, Paterno, M., Jackson, D., McGinnity, P., Sturm, A., Elmer, K.R., Llewellyn, M.S., 2018. Genetic fingerprinting of salmon louse (Lepeophtheirus salmonis) populations in the North-East Atlantic using a random forest classification approach. Sci. Rep. 8, 1–9. https://doi.org/10.1038/s41598-018-19323-z

Kijas, J.M.H., Wales, R., Törnsten, A., Chardon, P., Moller, M., Andersson, L., 1998. Melanocortin receptor 1 (MC1R) mutations and coat color in pigs. Genetics 150, 1177– 1185.

Kuehn, L.A., Keele, J.W., Bennett, G.L., McDaneld, T.G., Smith, T.P.L., Snelling, W.M., Sonstegard, T.S., Thallman, R.M., 2011. Predicting breed composition using breed frequencies of 50,000 markers from the US Meat Animal Research Center 2,000 bull project. J. Anim. Sci. 89, 1742–1750. https://doi.org/10.2527/jas.2010-3530

Liaw, A., Wiener, M., 2002. Classification and Regression by randomForest. R News 2, 18–22.

Marklund, S., Kijas, J., Rodriguez-Martinez, H., Ronnstrand, L., Funa, K., Moller, M., Lange, D., Edfors-Lilja, I., Andersson, L., 1998. Molecular basis for the dominant white phenotype in the domestic pig. Genome Res. 8, 826–833. https://doi.org/10.1101/gr.8.8.826

Meng, Y.A., Yu, Y., Cupples, L.A., Farrer, L.A., Lunetta, K.L., 2009. Performance of random forest when SNPs are in linkage disequilibrium. BMC Bioinformatics 10, 78. https://doi.org/10.1186/1471-2105-10-78

Munoz, M., J.M., G.-C., E., A., R., B., C., B., C., C., I., F.A., F., G., Y., N., C., O., A, F., C, R., Silió, L., 2020. Development of a 64 SNV panel for breed authentication in Iberian pigs and their derived meat products. Meat Scien 2019, 104265. https://doi.org/10.1016/j.meatsci.2020.108152

Nardone, A., Ronchi, B., Lacetera, N., Ranieri, M.S., Bernabucci, U., 2010. Effects of climate changes on animal production and sustainability of livestock systems. Livest. Sci. 130, 57– 69. https://doi.org/10.1016/j.livsci.2010.02.011

Nicolazzi, E.L., Biffani, S., Biscarini, F., Orozco ter Wengel, P., Caprera, A., Nazzicari, N., 142

Stella, A., 2015. Software solutions for the livestock genomics SNP array revolution. Anim. Genet. 46, 343–353. https://doi.org/10.1111/age.12295

Nistor, E., Bampidis, V., Pentea, M., Prundeanu, H., Ciolac, V., 2012. Morphological Indices in Mangalitsa Breed. Anim. Sci. Biotechnol. 45, 390–393.

Oroian, I.G., Petrescu-Mag, I. V, 2014. Mangalitsa breed returns to homeland. Porcine Res. 4, 19–21.

Petrovic, M., Radovic, C., Parunovic, N., Mijatovic, M., Radojkovic, D., Aleksic, S., Stanisic, N., Popovac, M., 2010. Quality traits of carcass sides and meat of Moravka and Mangalitsa pig breeds. Biotechnol. Anim. Husb. 26, 21–27. https://doi.org/10.2298/bah1002021p

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J., Sklar, P., De Bakker, P.I.W., Daly, M.J., Sham, P.C., 2007. PLINK: A tool set for whole- genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559– 575. https://doi.org/10.1086/519795

Qanbari, S., Simianer, H., 2014. Mapping signatures of positive selection in the genome of livestock. Livest. Sci. 166, 133–143. https://doi.org/10.1016/j.livsci.2014.05.003

Schiavo, G., Bertolini, F., Galimberti, G., Bovo, S., Dall’olio, S., Nanni Costa, L., Gallo, M., Fontanesi, L., 2020. A machine learning approach for the identification of population- informative markers from high-throughput genotyping data: Application to several pig breeds. Animal. https://doi.org/10.1017/S1751731119002167

Weir, B.S., Cockerham, C.C., 1984. Estimating F-statistics for the analysis of population structure. Evolution (N. Y). 38, 1358–1370. https://doi.org/10.1111/j.1558- 5646.1984.tb05657.x

Wilkinson, S., Wiener, P., Archibald, A.L., Law, A., Schnabel, R.D., McKay, S.D., Taylor, J.F., Ogden, R., 2011. Evaluation of approaches for identifying population informative markers from high density SNP Chips. BMC Genet. 12, 1–14. https://doi.org/10.1186/1471-2156- 12-45

Zsolnai, A., Tóth, G., Molnár, J., Stéger, V., Marincs, F., Jánosi, A., Ujhelyi, G., Koppányné Szabó, E., Mohr, A., Anton, I., Szántó-Egész, R., Sipos, R., Egerszegi, I., Dallmann, K., Tóth, P., Micsinai, A., Brüssow, K.-P., Rátky, J., 2013. Looking for breed differentiating SNP loci and for a SNP set for parentage testing in Mangalica. Arch. Anim. Breed. 56, 200– 207. https://doi.org/10.7482/0003-9438-56-019

Tables and figures

Table 5.1 Number and usage of pig genotypes according to breed.

1 Breed Individuals before QC Individuals after QC Used for Fst calculation Used for model training Used for validation Berkshire 44 44 34 34 10 Duroc 111 107 97 97 10 Hampshire 102 96 86 86 10 Hereford 22 22 17 17 5 Landrace 96 95 85 85 10 Large Black 3 3 3 0 0 Mangalitsa1a 45 38 38 38 0 Mangalitsa 1b 2 1 1 1 0 Mangalitsa 2 48 47 0 0 47

Meishan 52 52 52 0 0 143

Pietrain 23 23 18 18 5 Spotted 10 10 10 0 0 Yorkshire 114 110 100 100 10 Total 1 672 648 541 476 107 1 Fixation Statistic. 2 Sum of pigs across breeds.

144

Table 5.2 Variables used for prediction per tree for each SNP panel.

Number of markers in panel Number of decision splits per tree 10 3 50 7 100 10 500 22 1000 32

Table 5.3 Distribution by chromosome of markers for each marker panel for random markers (left) and Fst1-filtered markers (right).

Random markers Fst-filtered markers Chromosome 10 50 100 500 1000 10 50 100 500 1000 1 2 8 9 47 108 3 5 9 46 95 2 1 5 10 41 69 0 1 2 22 57 3 1 3 5 31 55 0 2 3 20 46 4 2 2 9 35 62 1 6 13 67 135 5 0 0 2 19 42 2 7 8 34 59 6 0 0 5 31 59 0 3 4 41 68 7 1 4 5 31 54 1 4 9 38 73 8 0 2 6 26 66 0 5 8 27 54 9 0 4 6 31 80 0 2 5 27 48 10 0 0 1 8 20 1 1 3 22 43 11 0 1 3 19 40 0 1 4 13 29 12 0 0 2 11 17 0 0 0 3 6 13 0 6 7 42 71 0 3 12 39 70 14 2 2 12 40 86 0 1 3 19 54 15 0 6 9 33 65 0 1 4 19 40 16 0 3 5 18 35 1 2 2 10 25 17 0 2 0 13 29 1 2 3 11 23 18 0 1 2 14 24 0 4 8 38 66 X 1 1 2 10 18 0 0 0 4 9 1 Fixation Statistic.

145

Table 5.4 Ten markers with the highest Fst1 Score.

2 2 3 4 Marker Chromosome Base pair Fst Score Gene QTL ASGA0001873 1 28270187 0.75 PDE7B 5 ALGA0004827 1 87967508 0.74 --- 5 2 ASGA0001842 1 27620041 0.72 MAP7 3 ASGA0021274 4 97376691 0.72 RORC 38 ALGA0116381 5 66474878 0.79 TCF20 1 ALGA0033035 5 77052782 0.76 SLC38A1 3 H3GA0022659 7 97881096 0.74 --- 5 ASGA0099401 10 15592924 0.74 PLD5 3 MARC0109000 16 63825864 0.82 ADRA1B 3 ALGA0092829 17 4962308 0.74 VPS37A 3 1 Fixation Statistic. 2 Aligned to Sus scrofa genome version 11.1 3 Gene in which marker is located. 4 Number of quantitative trait loci within .5Mb upstream or downstream of marker 5 Not located in a gene.

Table 5.5 : Results of breed composition estimation using 50 Fst1-filtered markers with the linear regression method.

Breed # of individuals2 # of individuals < 80%3 %4 Berkshire 10 0 100 Duroc 10 4 60 Hampshire 10 3 70 Hereford 5 2 60 Landrace 10 8 20 Purebred Mangalitsa in group 2 37 11 30 Pietrain 5 1 80 Yorkshire 10 4 60 1 Fixation Statistic. 2 Total number of individuals in validation data set. 3 Number of individuals with estimated breed composition for less than 80% of their known breed. 4 Percentage of individuals with estimated breed composition for less than 80% of their known breed.

146

Figure 5.1A. Out-of-bag (OOB) error for each breed along with the average OOB error across all breeds error 10 random markers with both selection strategies were used. B. OOB error across breeds and marker selection strategy for marker panels up to 1000 markers using random forest. C. Average estimated breed composition for purebred individuals across panels up to 1000 markers and marker selection strategies using random forest

147

Figure 5.2 Comparison of estimated Mangalitsa breed composition for known crossbred individuals and individuals with unknown Mangalitsa breed composition in the validation data set using random forest.

148

Figure 5.3 Average coefficient of determination (R2) for breed composition estimation for pigs in the data validation data set across panels and marker selection strategy.

149

Figure 5.4 Average estimated Mangalitsa breed composition for purebred individuals across panels and marker selection strategies using linear regression.

150

Figure 5.5 Comparison of estimated Mangalitsa breed composition for known crossbred individuals and individuals with unknown

Mangalitsa breed composition in the validation data set using linear regression. 151

Appendix 5.1: Supplemental tables and figures

Table 5.6 Description of the QTL associated to the 10 markers with the highest Fst score.

Marker Associated Trait Chromosome Base pair 1 Daily feed intake 1 25010377-27469814 1 : 27620041 Daily feed intake 1 25010377-27469814 Mean platelet volume 1 27020977-27899646 Feed conversion ratio 1 27020977-27899646 Feed conversion ratio 1 27020977-27899646 1 : 28270187 Mean platelet volume 1 27020977-27899646 Feed conversion ratio 1 28050709-28899322 Mean platelet volume 1 28050709-28899322 Linoleic acid content 1 83142255-87072923 1 : 87967508 Carcass length 1 87788568-89684790 Average daily gain 4 23809394-96936045 Shoulder subcutaneous fat thickness 4 33323681-96936045 Average daily gain 4 84951475-96936045 Adipocyte diameter 4 89529616-96936045 Carcass length 4 89529616-96936045 Dressing percentage 4 89529616-96936045 Loin and neck meat weight 4 89529616-96936045 Carcass weight (cold) 4 89529616-96936045 Body weight (slaughter) 4 89529616-96936045 Carcass length 4 89529616-96936045 Dressing - Ham over carcass 4 89529616-96936045 4 : 97376691 Loin and neck meat weight 4 89529616-96936045 Carcass weight (cold) 4 89529616-96936045 Shoulder weight 4 95921536-96488856 Backfat at last rib 4 95921536-96936045 Conductivity 24 hours post-mortem 4 95921536-96936045 Fat-cuts percentage 4 95921536-96936045 Lean meat percentage 4 95921536-96936045 Average daily gain 4 95921536-96936045 Polyunsaturated fatty acid content 4 96030439-96962176 Monounsaturated fatty acid content 4 96030439-96962176 Linoleic acid content 4 96030439-96962176

152

Table 5.6 continued.

Marker Associated Trait Chromosome Base pair 1 Palmitoleic acid content 4 96456831-96479166 Head weight 4 96936045-98283003 Ham weight 4 96936045-98283003 Shoulder external fat weight 4 96936045-98283003 Shoulder meat weight 4 96936045-98283003 Fat-cuts percentage 4 96936045-98283003 Loin muscle area 4 96936045-98283003 Ham weight 4 96936045-98283003 Head weight 4 96936045-98283003 4 : 97376691 96936045- Meat color-L 4 102054630 96936045- Average daily gain 4 102054630 96936045- Average daily gain 4 102054630 96936045- Corpus luteum number 4 106510877 96936045- subjective pork odor 4 123430356 96936045- Margaric acid content 4 123430356 Meat color chroma 4 96962176-97550675 5 : 66474878 Drip loss 5 66280325-66415278 Unsaturated fatty acid content 5 77055629-82096612 5 : 77052782 Unsaturated fatty acid content 5 77055629-82096612 Saturated fatty acid content 5 77055629-82096612 Left teat number 7 96000929-96926303 Teat number 7 96000929-96926303 7 : 97881096 Head weight 7 96690078-96917300 Backfat at rump 7 96690078-96917300 Thoracic vertebra number 7 97116624-98645440 Gait score (front) 10 14159882-15139496 10 : 15592924 Lumbar vertebra number 10 15042704-15992149 Teat number, difference between sides 10 15042704-15992149

153

Table 5.6 continued.

Marker Chromosom Trait Base pair Associated 1 e 62704437- Corpus luteum number 16 63215033 63215033- 16 : 63825864 Gestation length 16 63725230 63799494- Spinal curvature 16 63944017 Actinobacillus pleuropneumoniae 4082851- 17 susceptibility 64462043 4568863- 17 : 4962308 Small intestine length 17 4609122 Actinobacillus pleuropneumoniae 4609122- 17 susceptibility 66928144 1Chromosome : Base pair.

Table 5.7 Average breed composition coefficient for individuals in the validation set obtained through linear regression for all breeds across all panel markers.

Number Berkshire Duroc Hampshire Hereford Landrace Mangalitsa Yorkshire of Breed markers Fst Random Fst Random Fst Random Fst Random Fst Random Fst Random Fst Random Berkshire 0.28 0.51 0 0.01 0.41 0.05 0.08 0.4 0.18 0.01 0.01 0 0.04 0.03 Duroc 0.07 0 0.22 0.73 0.5 0.03 0.05 0.09 0 0.03 0.01 0.12 0.15 0 Hampshire 0.02 0.04 0.08 0.02 0.7 0.69 0.03 0.02 0.01 0.13 0 0.02 0.17 0.08 10 Hereford 0 0.24 0 0.05 0 0.06 0.87 0.5 0.11 0.12 0.02 0 0 0.02 Landrace 0 0.13 0.17 0.15 0.27 0.04 0.11 0.08 0.26 0.45 0.06 0.08 0.13 0.07 Mangalitsa 0.01 0.02 0 0.13 0 0.07 0.02 0.03 0.05 0.03 0.89 0.71 0.04 0.03 Yorkshire 0.02 0.05 0.01 0.07 0.58 0.32 0.06 0.07 0.01 0 0 0.11 0.33 0.39 Berkshire 0.89 0.02 0.01 0.92 0.03 0.01 0.01 0.01 0.02 0.02 0.02 0 0.03 0.02 154 Duroc 0 0.01 0.92 0 0.01 0.9 0 0 0 0.04 0.01 0.01 0.05 0.03

Hampshire 0.05 0.05 0.02 0.05 0.85 0.04 0.01 0.01 0.02 0.8 0.01 0.03 0.02 0.02 100 Hereford 0.08 0.02 0.02 0.01 0.01 0.01 0.85 0.01 0.02 0.02 0.02 0.05 0 0.87 Landrace 0.03 0 0.07 0.03 0.09 0.04 0.02 0.02 0.72 0.05 0.02 0.41 0.04 0.44 Mangalitsa 0.01 0.11 0.02 0.04 0.02 0.06 0.01 0.01 0.02 0.03 0.93 0.73 0.01 0.02 Yorkshire 0.03 0.48 0.03 0.03 0.03 0.02 0.04 0.44 0.06 0.01 0.01 0.01 0.81 0.01

Table 5.7 continued.

Number Berkshire Duroc Hampshire Hereford Landrace Mangalitsa Yorkshire of Breed markers Fst Random Fst Random Fst Random Fst Random Fst Random Fst Random Fst Random Berkshire 0.93 0.95 0.03 0.02 0 0.01 0.01 0.02 0.01 0 0.01 0 0.01 0 Duroc 0.01 0 0.97 0.97 0 0.01 0 0 0 0.01 0.01 0.01 0.01 0 Hampshire 0.01 0 0 0.01 0.94 0.96 0 0.01 0.02 0.01 0 0 0.02 0.02 500 Hereford 0.07 0.03 0.03 0 0 0.01 0.88 0.89 0.01 0.02 0.01 0.01 0 0.04 Landrace 0 0.02 0.07 0.03 0.02 0.02 0.02 0 0.83 0.92 0.01 0.01 0.05 0.01 Mangalitsa 0 0 0.02 0.02 0.01 0.03 0 0.01 0.03 0.03 0.92 0.89 0.02 0.02 Yorkshire 0.19 0.2 0.01 0.01 0.05 0.01 0.09 0.08 0 0.03 0.03 0.02 0.9 0.65 Berkshire 0.96 0.92 0.01 0.02 0 0.01 0.01 0.04 0.01 0 0 0.01 0 0 Duroc 0 0 0.98 0.97 0 0.01 0 0.01 0 0.01 0.01 0 0.01 0

Hampshire 0.01 0 0 0 0.95 0.97 0 0 2 0 0 0.01 0.02 0.01 155

1000 Hereford 0.05 0.04 0.02 0 0 0 0.9 0.94 0.01 0.01 0.01 0 0.01 0.01 Landrace 0 0.01 0.06 0.01 0.01 0.02 0 0.01 0.88 0.9 0.01 0.02 0.04 0.04 Mangalitsa 0 0 0.02 0.03 0.01 0.02 0 0 0.02 0.02 0.92 0.9 0 0.02 Yorkshire 0 0.18 0.02 0.02 0.01 0.04 0 0.09 0.02 0.01 0.92 0.03 0.92 0.67

156

Figure 5.6 A. Out of bag (OOB) error for all breeds and general average across marker selection methods when 50 markers were used. B. Out of bag OOB error for all breeds and general average across marker selection methods when 100 markers were used. C. Out of bag OOB error for all breeds and general average across marker selection methods when 500 markers were used.

157

CHAPTER 6. SIGNATURES OF SELECTION AND GENOMIC DIVERSITY OF MUSKELLUNGE (ESOX MASQUINONGY) FROM TWO POPULATIONS IN NORTH AMERICA.

Josue Chinchilla-Vargas1*, Jonathan Meerbeek2, Max F. Rothschild1, Francesca Bertolini3.

1 Iowa State University, Department of Animal Science, Ames, Iowa, 50011. 2 Iowa Department of Natural Resources, Spirit Lake Fish Hatchery, Spirit Lake, Iowa, 51360. 3 National Institute of Aquatic Resources, Technical University of Denmark, 2800, KGs. Lyngby,

Modified from a manuscript submitted to Genes

Abstract

Background

Muskellunge (Esox masquinongy) is the largest and most prized game fish for anglers in

North America. However, little is known about Muskellunge genetic diversity in Iowa’s propagation program. We used whole genome sequence from 12 brooding individuals from Iowa and publicly available RAD-seq of 625 individuals from the Saint-Lawrence river in Canada to study the genetic differences between populations, analyze signatures of selection that might shed light on environmental adaptations, and evaluate the levels of genetic diversity in both populations. Given that there is no reference genome available for Muskellunge, reads were aligned to the genome of Pike (Esox lucius), a closely-related species.

Results

Variant calling produced 7,886,471 biallelic variants for the Iowa population and 16,867 high-quality SNPs that overlap with the Canadian samples. The Ti/Tv values were 1.09 and 1.29 for samples from Iowa and Canada, respectively. PCA and Admixture analyses showed a large genetic difference between Canadian and Iowan populations. Moreover, PCA showed clustering by sex in the Iowan population although widow-based Fst did not find outlier regions. Window- based pooled heterozygosity found 6 highly heterozygous windows containing 244 genes in the 158

Iowa population and Fst comparing the Iowa and Canadian populations found 14 windows with

Fst values larger than 0.9 containing 641 genes. One enriched GO term (sensory perception of pain) was found through pooled heterozygosity analyzes. Although not significant, several enriched GO terms associated to growth and development were found through Fst analyses.

Inbreeding calculated as Froh was 0.03 on average for the Iowa population and 0.32 on average for the Canadian samples.

The Canadian inbreeding rate appears to be higher, presumably due to isolation of subpopulations, than the inbreeding rate of the Iowa population.

Conclusions

This study was the first to document that brood stock Muskellunge from Iowa showed marked genetic differences compared to the Canadian population. Additionally, despite genetic differentiation based on sex that has been observed, no major locus has been detected for sex differentiation. Inbreeding does not seem to be an immediate concern for Muskellunge in Iowa, but apparent isolation of subpopulations has caused levels of homozygosity to increase in the

Canadian Muskellunge population. Finally, these results validate the applicabilityof using genomes of closely related species to perform genomic analyses when no reference genome assembly is available.

Background

Muskellunge (Esox masquinongy) is a species of freshwater fish native to North America and is the largest species of the Esocidae family. Moreover, Muskellunge is considered the most prized esocid by anglers (Figure 6.1). Originally, the species could be found in large lakes and rivers ranging from central Canada, east in the waters and branches of the Saint Lawrence River and even reaching south into Tennessee [1, 2]. Distinct regional strains, each composed of multiple subpopulations have been identified in the upper Mississippi River, the Great Lakes, 159 and the Ohio River through genetic data [3]. Because of economic benefits associated with its reputation for sports fishing, Muskellunge was introduced into several states in the US ranging from the Midwest to Texas and even Manitoba in Canada. The wide variety of environmental conditions of the states where Muskellunge were introduced highlights the species’ adaptability

[1, 2]. Although sporadic sightings of individuals have been reported in Iowa, there has been no official records of native populations in the state [1]. Currently, thanks to the effectiveness of the management and stocking practices, populations of Muskellunge can be found in several areas of

North America [1, 4]. While a high percentage of these populations require supplementation through periodical stocking, self-sustaining although not native populations can still be found in a number of lakes and rivers [1].

Muskellunge were first stocked in Iowa in the 1960s with individuals from Wisconsin that can be traced to the northern strain [5]. However previous genetic research has shown evidence of admixture in Iowa’s population [1]. Despite these findings that might point to a certain degree of genetic diversity in the population, one of the most important aspects to be considered in the design of management plans for the species is the need to maintain the genetic diversity. This objective is paramount given that the vast majority of populations in Iowa are dependent on stocking for their maintenance. Moreover, the reduced number of lakes used as sources of brood stock for Iowa’s propagation program might play a role in accelerating the loss genetic diversity as it presents an increased probability of recapturing brood stock fish each year

[1, 6].

160

Currently, recapture of brood stock in the Iowa populations averages 37% annually [1] and therefore introducing individuals from different genetic backgrounds or capturing brood stock from different lakes might be needed. Introducing individuals of multiple genetic backgrounds into a population can have mixed effects. When native populations are small, it is beneficial to increase genetic variance, this permits for purifying selection of deleterious variants while allowing positive selection of positive ones [7]. In turn, this limits inbreeding depression

[8, 9]. However, the potential negative effects of stocking include the the loss of traits related to local adaptation [9] and the reduction of genetic diversity due to the Ryman-Laikre effect in wild populations [9–11], which refers to the an increase in inbreeding and a reduction in total effective population size (NeT) that arise when a few captive parents produce large numbers of offspring in a wild-captured system[12].

In this study, we used whole genome sequence from 12 brood stock individuals from

Iowa (6 males and 6 females) and 625 RAD-seq individuals from Saint-Lawrence river in

Canada available on SRA (Sequence Read Archive) [9]. The Canadian population is composed by approximately 10 subpopulations sampled from 22 different sites. Additionally, since both populations have no recent connection, these data provide an excellent opportunity to study the signatures of selection from two different Muskellunge populations with the idea of identifying loci related to adaptation to the specific environments in which each population has evolved.

Given the lack of a reference genome of Muskellunge to align sequence data, the reference genome of Northern Pike (Esox lucius) was used. The Northern Pike is closely related to Muskellunge and is the most frequently studied member of the Esocids [13]. Both species possess the same number of chromosomes [14] and are capable of producing hybrids known as

Tiger Muskie (Esox lucius * Esox masquinongy) which are considered valuable trophies by 161 anglers. Nonetheless, Northern Pike is known to have a wider distribution than Muskellunge, inhabiting rivers, lakes and brackish water that range from North America to Europe and Eurasia

[15, 16]. This project provides the opportunity to perform a preliminary genomic comparison of these two closely related species.

Results

Whole-genome sequencing and variant calling

After removing duplicate reads, whole genome sequencing of the 12 individuals from

Iowa produced an average of 217,503,897 (± 131,486,905) reads per individual, out of which

96.27% were considered of high quality (quality score >20) and retained for further analyses. On average, 86% (± 0.57) of the reads mapped to the reference Northern Pike genome. This produced an average depth of 26.26x that ranged from 8x to 49x. In the case of the samples from the Canadian population, 86% of the reads were considered high quality, 63% of the high-quality reads were successfully aligned to the Northern Pike genome and a depth of 11.44x was obtained for the sequenced sections of the genome. Details on the sequence data of each individual from

Iowa and the averages for the Iowa and Canadian populations are shown in Table 6.1.

Breadth of coverage at different depth thresholds are shown in Table 6.2. After aligning the reads of Muskellunge from Iowa to the Northern Pike genome, 80% of the bases were covered with a depth larger than 0x. Overall, 66% of the bases were covered with a depth of 10x, which was used as a threshold to be included in all downstream analyses. Also, 0.03 percentage of the genome was covered at a depth higher than 1000x; potentially pointing at highly repetitive segments that showed issues with proper alignment and therefore were counted as the same segment [17].

Figure 6.2 shows the average depth coverage per mega base across the twelve individuals from Iowa after the sequencing was aligned to the Northern Pike genome. The overall average 162 depth across all 1 mega base windows was of 21.16x with a standard deviation of 4.44x. Depth of coverage was highest at chromosome 9, mega base 18.5, reaching a depth of 40.4x. Additional peaks were observed in chromosome 11 mega base 22 and chromosome 23 mega base 10 with depths of 37.1 and 38.6, respectively. Additionally, chromosomes 2, 3, 4, 6, 8, 9, 12, 14, 15, 16,

18, 20, 22 and 25 produced a depth of 0x on the last mega base. Finally, all chromosomes showed higher depth towards the centromere compared to the telomeric regions.

The variant calling pipeline produced three large sets of Single Nucleotide

Polymorphisms (SNPs) that were used in further analyses. When variants were called for the twelve whole genome sequenced samples from Iowa, a total of 36,627,942 biallelic SNPs were called out of which 8,218,039 were not monomorphic. Given the large number of SNPs, all

SNPs that did not have a call rate of 100% were dropped, resulting in a final set of 7,886,471

SNPs that were used for all analyses involving only individuals from Iowa. The variant calling of

Canadian samples produced 128,213 biallelic SNPs, out of which 16,059 were not biallelic. The transition/transversion ratio (Ti/Tv) was 1.09 and 1.29 for samples from Iowa and Canada, respectively.

When combining the two Canadian and Iowan datasets, a total of 108,132 biallelic SNPs were called with 22,705 SNPs not being monomorphic. After retaining only SNPs with a call rate greater than or equal to 90%, a total of 16,867 SNPs were kept for further analyses. The number of Iowa-specific biallelic SNPs and biallelic SNPs called for Iowa and Canada along with the density of SNPs per mega base are shown in Figure 6.3 panels A and B, respectively. As shown in Figure 6.3A, chromosome 19 showed the highest number of biallelic SNPs with 5,675

SNPs, while chromosome 11 showed the most SNPs with 1,125 SNPs that were common between Canada an Iowa samples. Chromosome 25 contained the least number of SNPs in both 163 populations with 2,525 and 331 respectively. On average, 4,162 and 644 SNPs per chromosome were called for the samples from Iowa and the combined samples with a standard deviation of

872.84 and 216.78, respectively. When looking at the density of SNPs per Mb as shown in

Figure 6.3B, the chromosome with the highest density of SNPs was chromosome 23 with 136.54

SNPs per mega base when SNPs were called for the Iowan population only and chromosome 20 with 23.41 SNPs per mega base when SNPs were called for the Canadian and Iowan populations simultaneously. On average 120 SNPs per mega base were called for the Iowa population and 18

SNPs per mega base were called for the Canadian and Iowan populations combined with a standard deviation of 10.44 and 2.62 SNPs per mega base, respectively.

Population stratification analyses

Principal Component Analysis (PCA) results for the Iowa and Canadian populations are shown in Figure 6.4. No clustering was observed when comparing Iowan samples (Figure 6.4A).

Given that individuals were sampled at two different lakes that are known to be connected, these results were not surprising although it is important to note that individuals 4 and 5 did not cluster with the rest of samples. However, in Figure 6.4B a clustering by sex can be observed with the exception of individuals 4 and 5 that did not cluster with the other males. This clustering may indicate some level of genetic differences between individuals of the different sexes and not sex differences themselves. Figure 6.4C shows several clusters that were identified in the Canadian population which are likely related to what part of the water system they were sampled from.

When PCA was performed on Iowa and Canada samples combined (Figure 6.4D), populations from Iowa and Canada clustered separately, indicating genetic differences between the two populations.

164

Admixture analyses confirmed the findings from PCA. Results from admixture analyses are shown in Figure 6.5. When both the Iowa and Canada populations were analyzed, the likely number of subpopulations found was between 12 and 19 as these numbers produced the lowest values for the cross-validation error (Figure 6.5A). Here, admixture detected clear differences between populations from Iowa and Canada. The differences between populations were so large that when the results of Admixture with k =2 were plotted (Figure 6.5B), all 12 individuals from

Iowa showed a composition >0.9 for the same subpopulation while all samples from Canada showed a composition of 1.0 for the second population. In a similar matter when ancestry estimations were plotted for k values of 12 and 19 (shown in Figure 6.10 panels A and B, respectively) all 12 individuals from Iowa grouped in the same subpopulation with a composition

> 0.9 in both cases, illustrating the high degree of differentiation that exists between samples from Iowa and Canada.

Pooled heterozygosity and genome wide Fst.

Regardless the subgroup of the Iowa population that was considered for pooled heterozygosity (Hp) analyses (all individuals, females only and males only), the same windows were seen to be highly heterozygous in all cases, as shown in Figure 6.6. All six windows showed normalized pooled heterozygosity scores that were more than three standard deviations from the mean and were therefore identified as outliers. Table 6.3 shows the six windows that had high heterozygosity. In total, 244 genes were found in these windows although only 53 have been previously annotated (Genes and coordinates shown in Table 6.5). Given the small number of annotated genes found in these six windows, gene ontology analyses only showed one significantly enriched term, this being sensory perception of pain. Other enrichment terms found included sensory perception, positive regulation or synaptic transmission and regulation of glial 165 cell proliferation (Complete list of enriched GO terms related to Hp analyses shown in Table

6.6).

Although PCA showed a clear clustering according to sex in the Iowan population, Fst analyses did not provide any insight on the differences. As shown in Figure 6.7, there were no windows with mFst values above 0.9 and the overall Fst value between sexes was of 0.05.

Population stratification analyses showed that Iowa and Canada have markedly different genetic backgrounds, and this is reinforced by the mFst values obtained for the comparison between both populations, where the overall Fst value was 0.24. Window based Fst results are shown in Figure 6.8. In total, 14 windows produced a mFst value larger than 0.9 and 8 of these windows had an mFst value of 1, indicating that the majority of the SNPs in the window are fixed or almost fixed for opposite alleles. All windows that were deemed of interest after performing analyses of signatures of selection, had been sequenced at a depth that ranged from

17x to 32x and thus are considered as accurate results. This warrants a more in-depth analysis that might shed light on regions of the genome that are responsible for adaptation to the different specific environments. A total of 641 genes were identified in the 14 windows with mFst scores higher than 0.9. However, only 331 of these genes have been annotated and as in the case of Hp analyses, no statistically significant enriched terms were found (List of annotated genes found in mFst windows with score higher than 0.9 shown in Table 6.7). Although not significant, several

GO-terms associated with development and growth were enriched, these included negative regulation of developmental process, positive regulation of chondrocyte differentiation and positive regulation of cartilage development, among others (Complete list of enriched GO terms related to Fst shown in Table 6.8). 166

Inbreeding and runs of homozygosity (ROH)

Inbreeding coefficients ranged from 0.00 to 0.44 depending on the level of stringency considered to call a segment as a run of homozygosity. Nevertheless, out of the six different stringency levels at which runs of homozygosity were analyzed, the level that was considered to produce the most realistic levels of inbreeding was with windows that included at least 20 SNPs while allowing a maximum of three heterozygotes. Individual details for the Iowa population and the average for the Canadian population are found in Table 6.4. On average, individuals from

Iowa showed 3.5 ROH segments with a length of 36,699.20 Kb. The individual with the highest number of ROH segments was sample 9, a female from Big Spirit Lake with 7 segments of

50,042.8 Kb of length in average. In contrast, sample 1, a female from Okoboji did not show any

ROH segments. On average, females showed slightly higher numbers of ROH segments than males; however these segments were approximately 2,000 Kb shorter than in males. As shown on Figure 6.9, individuals from Canada showed a markedly higher level of inbreeding than samples from Iowa (average 0.32.) Additionally, the Canadian population showed a slightly wider distribution of estimated inbreeding coefficients, ranging from 0.25 to 0.38 while the Iowa population shows a very short range from 0.00 to 0.05. The length of the segments in both populations is very similar, spanning about 6,500 Kb.

Discussion

Whole-genome sequencing, alignment to Northern Pike genome and variant calling

One of the main limitations of the present study is the absence of a reference genome for

Muskellunge. Therefore, the traditional bioinformatics pipeline used in whole genome sequencing analyses had to be adapted, to map the reads against the reference genome for

Northern Pike which was available. Northern Pike is an esocid species closely related to

Muskellunge. There is evidence that using a highly related species is a valid option in mammals 167

[18, 19], where Donkey (Equus anus) and Water Buffalo (Bubalus bubalis) reads were aligned to the Horse (Equus caballus) and Cow (Bos Taurus)genome, respectively. Moreover, this approach has been used previously in Muskellunge [9]. The effectiveness of this approach in fish is reflected in the high percentage of reads that were correctly aligned to the Northern Pike genome (86%) and the breadth of coverage at a depth of at least 10x obtained after alignment of

Muskellunge reads to the Northern Pike genome (66%). This being said, there are clear issues with alignment possibly due to differences between species that are reflected in the regions that show reading depths higher than 1,000x. These reading depths can arise from copy number variations and/or chromosomal differences within species [20]. However, it is known that next generation sequencing has inherent issues with repetitive regions due to the short read-length seen in this technology [17] and therefore the exact cause cannot be determined.

The Ti/Tv value refers to the ratio of transitions to transversions observed in the variants called. Transitions are variants within the same type of nucleotide while transversions are mutations from a pyrimidine to a purine or vice versa [21]. The Ti/Tv ratio observed in the datasets used in this research were in line with what has been reported in fish. While in mammals the Ti/Tv value is expected to be near 2.0 [21], values in fish have been observed to be lower,

1.28 for a closely related pike species [22], 1.49 in salmonids [23, 24] and ranging from 0.28 to

1.49 in several teleost species [24–26]. While the increasing number of teleost species sequenced has confirmed this difference compared with mammals and the need to investigate its evolutionary meaning, this value can be also used as a quality parameter for the variant calling.

This is of particular relevance in our work, as a closely related species was used as reference genome. 168

Population stratification

The PCA results reinforce the hypothesis that individuals from Iowa originate from the same strain. Even though individuals were caught at two different lakes, these lakes are interconnected with fish from Spirit lake being able to swim into Okoboji but not in reverse.

Moreover, both lakes are stocked with fish from the same hatchery, where brood stock from the two lakes are mated and no pre-selection based on genetics is performed. However, individuals s4 and s5 did not cluster with the rest of samples from Iowa in Figure 6.3A and B. This may indicate that despite the lack of pre-selection of the brood stock, Iowan population showed a degree of genetic diversity. Nonetheless, the low number of individuals sampled does not allow one to evaluate the degree of separation. The multiple different clusters seen in Figure 6.4C and

D is consistent with the expected results of Rougemont et al.[9] where the fish were sampled from 22 different locations. Furthermore, admixture analyses confirm the large number of subpopulations seen in the Canadian samples as well as the marked difference seen between

Iowa and Canada populations. The original research determined that the number of subpopulations present in these samples was between 8 and 13, while our results indicate that number lies between 12 and 19, these values could have changed since a different reference genome version was used in previous studies [9]. The marked differences between Iowa and

Canada samples seen in PCA and admixture results highlight the idea of both populations having adaptations to their specific environments that have caused them to diverge. Furthermore, these results support previous findings in that they suggest a number of genetically different populations throughout the geographical distribution of Muskellunge [27]. These differences are also likely to be enlarged due to populations having different origins given that Iowa

Muskellunge originally descended from fish from Wisconsin [1, 5] while fish from Canada descend from local brood stock [9]. 169

Signatures of selection and inbreeding

Pooled heterozygosity revealed six windows of higher heterozygosity along the genome and no windows show high homozygosity independently of how the Iowa population was parsed.

These 6 windows showed a high number of genes with an average of 40.6 genes per window.

These results exemplify the low homozygosity estimated in the Iowa population and support the results of ROH analyses. Although the only significantly enriched GO term was related to perception of pain, several other GO terms found were related to perception and neurological processes like positive regulation of synaptic transmission, regulation of glial cell proliferation and positive regulation of glial cell proliferation. However, due to the small sample size the interpretation of these GO terms has to be taken with caution.

Results of mFst agree with PCA and admixture analyses, showcasing the marked genetic differences between both populations. Similarly to Hp, the 14 windows that had mFst values above 0.9 contained a large number of genes, with an average of 46 genes per window.

Several windows were found to have mFst scores of 1 when the populations from Iowa and Canada were compared. This reinforces the findings of PCA and Admixture analyses and was expected since both populations have distinct origins and have been isolated from each other to the best of our knowledge. Previous research has shown the presence of private alleles in the majority of populations of esocids in the Great Lakes [3, 28–30]. If the same is true for the Iowa and Canadian populations, private alleles to either of the populations could be responsible for the high mFst scores found when comparing them. Since the reference genome of Esox lucius is not thoroughly annotated, a large number of genes located in the windows with high mFst values were unannotated and thus caused gene ontology analyses to be unsuccessful. This being said, although not significant, a large number of enriched GO terms were related to growth and cellular differentiation, which highlights differences between both populations and perhaps their 170 adaptation to the different environmental conditions present in their geographical locations.

Additionally, after manual verification, several genes were found to be linked to congenital disorders and other fitness effects. Genes linked to congenital and behavioral disorders included dynein axonemal intermediate chain 1 [31], Rho GTPase Activating Protein 36 [32], ATPase

Na+/K+ Transporting Subunit Alpha 3 [33] and 5-Hydroxytryptamine Receptor 2C [34] while genes related to fitness effects include genes such as Cysteine-Three-Histidine [35].

In teleost fish, sex determination is achieved through a wide variety of mechanisms that include genetic, environmental and social factors [36]. The clustering of sexes seen in Figure

6.4B indicates that possible genetic differences between the sexes exists. Previous research found the master sex determining gene in Northern Pike is located on chromosome (i.e. linkage group)

24 [37]. This motivated us to perform a Fst analysis between sexes despite the low number of animals. Here, when allele frequencies were compared between males and females from Iowa, low mFst values across the genome were found, including chromosome 24. This may indicate that although Northern Pike and Muskellunge are closely related species, sex determining is different among the two species. To investigate the presence of one or more genomic regions that can contribute to sex determination in Muskellunge, the availability of a high-quality

Muskellunge reference genome would be needed.

Given the type of genomic data available, runs of homozygosity were deemed the most appropriate method to estimate some level of inbreeding. However, given the arbitrary method in which stringency levels are set up to identify ROH segments [38], several thresholds were tried before reporting a final result. The small number of ROH segments detected at the chosen stringency level reinforce the results of pooled heterozygosity analyses given that none of these analyses showed highly homozygote regions. It also indicates that inbreeding depression does 171 not represent an immediate concern for the Muskellunge population in Iowa. However, a larger number of individuals is needed to confirm these results. Although inbreeding may not pose a threat in the short-term for Iowa’s Muskie population, caution must be taken since previous research in Minnesota has found statistically detectable reductions in genetic diversity compared with the wild source population [3]. Therefore, it is paramount to implement measures that limit the loss of genetic diversity in the population, especially in the lakes where brood stock is routinely captured. As suggested by Miller et al, measures aimed at this purpose include increasing the frequency at which wild germplasm is collected and using larger numbers of adults as brood stock [3].

In the case of the Canadian population, the markedly higher inbreeding coefficient is in line with previous research, confirming that the genetic structure of this population represents bottlenecked subpopulations of the overall St. Lawrence River population. These bottleneck events could have been caused by the small number of founders from the populations used to stock these locations [9]. Given these results, attention should be paid to managing the genetic diversity within the different subpopulations. This aspect is critical to support the genetic viability of native populations as these allows for a set of diverse genetic resources for reintroductions of the species to lakes were populations have disappeared and the supplementation of other populations that show loss of genetic diversity [3]. Moreover, it has been shown that Muskellunge display a high degree of spatial genetic structure that show clearly subpopulations within each population. This could be due to geographic isolation or the known reproductive fidelity to spawning habitats that the species shows [39, 40], which further supports these results. With this scenario, populations would differentiate from each other, giving rise to the distinct subpopulations found with stratification analyses. As a result, homozygosity would 172 rise within each subpopulation [41], producing the inbreeding seen through Froh analyses.

Interestingly, the length of the ROH segments seems to be similar in both populations, which may indicate that the inbreeding that produced the homozygosity happened at similar times. This notion is reinforced given that stocking in Iowa started in the 1960s [1, 5] and Canada’s stocking began in 1951 [9]. If this assumption held to be true, the estimated higher levels of inbreeding in the Canadian population would indicate a higher rate of inbreeding in this population.

Conclusions

This genomic study is the first of its kind to focus on the Muskellunge population in

Iowa. The results of the study provide the following conclusions:

• Although special attention is needed to filter variants appropriately, using the

genome of a closely related species (Northern Pike) as a reference for alignment

is a valid approach to perform population genomic analyses when no existing

reference genome is available.

• Despite genetic differentiation based on sex, no major locus has been detected.

• Muskellunge from Canada and Iowa represent two clearly distinct populations

with different estimated rates of inbreeding.

• Inbreeding does not seem to be an immediate concern for Muskellunge in Iowa

compared to the Canadian populaton.

• Apparent isolation of subpopulations has caused levels of homozygosity to be

higher in the Canadian Muskellunge population.

173

These results provide insight about the validity of using genomes of closely related species to perform genomic analyses of species that no reference genome assembly.

Additionally, these results can be used to assess the long-term viability of the current management practices of Muskellunge in Iowa.

Methods

Individuals and sequencing

Muskellunge are routinely sampled by Iowa’s Department of Natural Resources (DNR) as part of their hatchery operations through humanely netting random individuals each spring. As part of these standard operations, the DNR obtains very small fin clips to estimate the age of the fish and other projects. Whole-genome sequence was produced from these samples. Whole genome sequence was produced for 12 individuals from 2 lakes (6 from East Okoboji, 3 males and 3 females and 6 from – Big Spirit Lake, 3 females and 3 males) with Illumina paired-end sequencing, performed by Neogen (Lincoln, Nebraska). Additionally, raw sequence reads of 625 samples were recovered from NCBI’s BioProject with accession number PRJNA512459 [42].

These data correspond to RAD-seq data of Muskellunge fish from different Canadian locations

(detailed explanation found in Rougemont et al, 2016) [9].

Bioinformatics pipeline

For all reads (Iowan and Canadian samples) read quality assessment was performed with

FastQC 0.11.5 [43]. Then, reads were trimmed and filtered with Trimmomatic 0.36 [44], cropping the first 10 bases of each read, with a sliding window of 4 base pairs with a minimum quality of 20 and minimum read length of 40 bp. Given that there is no available reference genome for Muskellunge, the reference genome of a closely related fish, Northern Pike (Esox

Lucius) version fEsoLuc1.pri was used to align the reads. Alignment was performed with BWA mem 0.7.17 [45] using default options. 174

SAMtools 1.10 (http://samtools.sourceforge.net) was used to remove duplicate reads and low-quality mapped reads (q<20), while BCFtools 1.10.2 (http://samtools.github.io/bcftools/ bcftools.html) was used to call and filter variants. In the case of samples from Iowa, a minimum depth of 10x and quality score of at least 20 were the parameters required to retain a variant for both the Iowa and Canada populations. For both datasets, only biallelic SNPs were retained to minimize the risk of including alleles from Northern pike in downstream analyses. Additionally, monomorphic alleles were removed for downstream analyses.

Population stratification analyses

The software Admixture 1.3.0 [46] was used to estimate population stratification within both Iowa and Canadian populations as well as within each of the populations. The --cv flag was used to produce the cross-validation error and the number of subpopulations was considered accurate when the cross-validation error was lowest or at an inflection point. Additionally, principal component analyses (PCA) were performed with the flag --pca in Plink 1.9 [47] to visualize population clustering.

Pooled heterozygosity and Fst

To investigate the differences between subpopulations, Fixation Statistic (Fst) and Pooled

Heterozygosity (Hp) analyses were performed for all individuals from Iowa together, males only and females only since PCA showed clustering between sexes. Hp was used to calculate whole-

genome distribution of heterozygosity, averaged over 0.5Mb sliding windows, with 50% overlapping. For each window, Hp values were calculated using the following formula [25, 48]:

∑ ∑ !" = ! #$%& #$'( (∑ #$%&*∑ #$'()!

175

where ΣnMAJ and ΣnMIN are sums of counts of major and minor alleles, respectively counted at all SNPs in the window.

These values were then transformed into Z scores:

ZHp = (Hp − µHp)/σHp [48].

Fst score for each SNP was estimated using the --Fst flag in Plink 1.9, followed by the calculation of mean Fst values (mFst) in 500 Kbp windows with 50% overlapping using an in- house script as performed by Bertolini et al. [26]. Mean Fst (mFst) scores were calculated between the populations of Iowa and Canada.

Gene ontology analyses of the genes contained in the windows of interest for Hp and mFst analyses were performed using FishEnrichr [49, 50]. Terms were considered significant when a corrected P-value ≤ 0.05 was produced.

Inbreeding and runs of homozygosity (ROH)

Given the lack of pedigree information for the individuals included in this research, inbreeding (F) was estimated from runs of homozygosity (Froh). This was deemed as the most appropriate method to estimate the levels of inbreeding of both Muskellunge populations included in the study. To estimate FROH, a percentage of homozygosity was calculated by summing ROH >1 Mb across the covered genome and dividing by the total base pairs represented in the SNP data obtained by calling SNPs from the Canadian and Iowan populations simultaneously. Runs of homozygosity were called using the --homozyg flag in Plink 1.9. To be considered ROH, segments had to be at least 1 Mb in length and have a maximum gap between

SNPs of 500Kb. However, several levels of stringency for other criteria were used to calculate

ROH segments: Three sizes of window were examined (5, 10 and 20 SNPs) with 1, 2 and 3 heterozygotes per window allowed. These 6 levels of stringency were used to calculate Froh. 176

Declarations

Ethics approval and consent to participate

Each year brood stock are routinely and humanely captured by the Iowa Department of

Natural Resources and saved for reproduction and small sample collection. Fin samples are routinely collected for a variety of research projects by Iowa Department of Natural Resources using standard practices for internal use and hence no animal care approval was needed. Data from Canada were publicly available at NCBI’s BioProject Repository, accession

PRJNA512459. Finally, we confirm that all methods were carried out in accordance with relevant guidelines and regulations.

Availability of data and materials

Whole Genome Sequence data produced for this research has been submitted to NCBI’s

Sequence Read Archive under BioProject PRJNA695782. Link to data: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA695782

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Funding

Financial support was provided in part by the State of Iowa and Hatch funds.

Authors’ Contributions

JC-V, FB and MFR did the experimental design. JC-V did the data analysis and management and wrote the manuscript. JM, FB and MR did the manuscript review and editing.

All authors contributed to the article and approved the submitted version. 177

Acknowledgements

The authors thank the personnel at the Spirit Lake Hatchery and Iowa Department of

Natural Resources including George Scholten and Daniel Vogeler for their sample collection and

sharing of information and fish tissue.

Literature cited

1. Meerbeek J. Iowa’s Muskellunge management plan. 2014. doi:10.13140/RG.2.2.32430.46400.

2. Crossman EJ. Taxonomy and distribution of North American escocids. Fish Soc Spec Publ. 1978;:13–26.

3. Miller LM, Farrell JM, Kapuscinski KL, Scribner K, Sloss BL, Turnquist KN, et al. A Review of Muskellunge Population Genetics: Implications for Management and Future Research Needs. Am Fish Soc. 2017;85 November 2018:385–414. %3CGo%0Ato.

4. Kerr SJ. Distribution and management of Muskellunge in North America: An overview. Ontario, Canada; 2011.

5. Madden K, Lynch A. Notes on the First Rearing and Introduction of Esox masquinongy in Iowa Waters. Proc Iowa Acad Sci. 1962;69. https://scholarworks.uni.edu/pias/vol69/iss1/45. Accessed 6 Jan 2021.

6. Jennings MJ, Sloss BL, Hatzenbeler GR, Kampa JM, Simonson TD, Avelallemant SP, et al. Implementation of Genetic Conservation Practices in a Muskellunge Propagation and Stocking Program. Fisheries. 2010;35:388–95. doi:10.1577/1548-8446-35.8.388.

7. Whitlock MC, Bürger R. Fixation of New Mutations in Small Populations. Cambridge University Press; 2004.

8. Bataillon T, Kirkpatrick M. Inbreeding depression due to mildly deleterious mutations in finite populations: Size does matter. Genet Res. 2000;75:75–81. doi:10.1017/S0016672399004048.

9. Rougemont Q, Carrier A, Le Luyer J, Ferchaud AL, Farrell JM, Hatin D, et al. Combining population genomics and forward simulations to investigate stocking impacts: A case study of Muskellunge (Esox masquinongy) from the St. Lawrence River basin. Evol Appl. 2019;12:902– 22.

10. Ryman N, Laikre L. Effects of Supportive Breeding on the Genetically Effective Population Size. Conserv Biol. 1991;5:325–9. doi:10.1111/j.1523-1739.1991.tb00144.x.

11. Laikre L, Ryman N. Effects on intraspecific biodiversity from harvesting and enhancing natural populations. Ambio. 1996;25:504–9.

178

12. Waples RS, Hindar K, Karlsson S, Hard JJ. Evaluating the Ryman-Laikre effect for marine stock enhancement and aquaculture. Curr Zool. 2016;62:617–27. doi:10.1093/cz/zow060.

13. Rondeau EB, Minkley DR, Leong JS, Messmer AM, Jantzen JR, Von Schalburg KR, et al. The genome and linkage map of the Northern Pike (Esox lucius): Conserved synteny revealed between the salmonid sister group and the neoteleostei. PLoS One. 2014;9. doi:10.1371/journal.pone.0102089.

14. Davisson MT. Karyotypes of the Teleost Family Esocidae. J Fish Res Board Canada. 1972;29:579–82.

15. Craig JF. A short review of Pike ecology. Hydrobiologia. 2008;601:5–16. doi:10.1007/s10750- 007-9262-3.

16. Forsman A, Tibblin P, Berggren H, Nordahl O, Koch-Schmidt P, Larsson P. Pike (Esox lucius) as an emerging model organism for studies in ecology and : A review. J Fish Biol. 2015;87:472–9. doi:10.1111/jfb.12712.

17. Giani AM, Gallo GR, Gianfranceschi L, Formenti G. Long walk to genomics: History and. current approaches to genome sequencing and assembly. Comput Struct Biotechnol J. 2019;18:9–19. doi:10.1016/J.CSBJ.2019.11.002.

18. Bertolini F, Scimone C, Geraci C, Schiavo G, Utzeri VJ, Chiofalo V, et al. Next generation semiconductor based sequencing of the donkey (Equus asinus) genome provided comparative sequence data against the horse genome and a few millions of single nucleotide polymorphisms. PLoS One. 2015;10:1–18.

19. Iamartino D, Nicolazzi EL, Van Tassell CP, Reecy JM, Fritz-Waters ER, Koltes JE, et al. Design and validation of a 90K SNP genotyping assay for the water buffalo (Bubalus bubalis). PLoS One. 2017;12. doi:10.1371/journal.pone.0185220.

20. Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am J Hum Genet. 2012;91:597–607.

21. Depristo MA, Banks E, Poplin R, Garimella K V., Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–501. doi:10.1038/ng.806.

22. Lucentini L, Puletti ME, Ricciolini C, Gigliarelli L, Fontaneto D, Lanfaloni L, et al. Molecular and phenotypic evidence of a new species of genus Esox (Esocidae, Esociformes, Actinopterygii): The Southern Pike, Esox flaviae. PLoS One. 2011;6:e25218. doi:10.1371/journal.pone.0025218.

23. Smith CT, Elfstrom CM, Seeb LW, Seeb JE. Use of sequence data from Rainbow trout and Atlantic salmon for SNP detection in Pacific Salmon. Mol Ecol. 2005;14:4193–203. doi:10.1111/j.1365-294X.2005.02731.x. 179

24. McKay SJ, Devlin RH, Smith MJ. Phylogeny of Pacific Salmon and Trout based on growth hormone type-2 and mitochondrial NADH dehydrogenase subunit 3 DNA sequences. Can J Fish Aquat Sci. 1996;53:1165–76.

25. Bertolini F, Geraci C, Schiavo G, Sardina MT, Chiofalo V, Fontanesi L. Whole genome semiconductor based sequencing of farmed European sea bass (Dicentrarchus labrax) Mediterranean genetic stocks using a DNA pooling approach. Mar Genomics. 2016;28:63–70. doi:10.1016/j.margen.2016.03.007.

26. Bertolini F, Ribani A, Capoccioni F, Buttazzoni L, Utzeri VJ, Bovo S, et al. Identification of a major locus determining a pigmentation defect in cultivated gilthead seabream (Sparus aurata). Anim Genet. 2020;51:319–23. doi:10.1111/age.12890.

27. Turnquist KN, Larson WA, Farrell JM, Hanchin PA, Kapuscinski KL, Miller LM, et al. Genetic structure of Muskellunge in the Great Lakes region and the effects of supplementation on genetic integrity of wild populations. J Great Lakes Res. 2017;43:1141–52.

28. Jennings MJ, Hatzenbeler GR, Kampa JM. Spring capture site fidelity of adult Muskellunge in inland lakes. North Am J Fish Manag. 2011;31:461–7. doi:10.1080/02755947.2011.590118.

29. Bosworth A, Farrell JM. Genetic Divergence among Northern Pike from Spawning Locations in the Upper St. Lawrence River. North Am J Fish Manag. 2006;26:676–84. doi:10.1577/M05- 060.1.

30. Miller LM, Kallemeyn L, Senanan W. Spawning-Site and Natal-Site Fidelity by Northern Pike in a Large Lake: Mark–Recapture and Genetic Evidence. Trans Am Fish Soc. 2001;130:307–16. doi:10.1577/1548-8659(2001)130<0307:ssansf>2.0.co;2.

31. Li Y, Yagi H, Onuoha EO, Damerla RR, Francis R, Furutani Y, et al. DNAH6 and Its Interactions with PCD Genes in Heterotaxy and Primary Ciliary Dyskinesia. PLoS Genet. 2016;12.

32. Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, Irie R, et al. Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet. 2004;36:40–5. doi:10.1038/ng1285.

33. Sánchez E, Azcona LJ, Paisán-Ruiz C. Pla2g6 Deficiency in Zebrafish Leads to Dopaminergic Cell Death, Axonal Degeneration, Increased β-Synuclein Expression, and Defects in Brain Functions and Pathways. Mol Neurobiol. 2018;55:6734–54.

34. Klee EW, Schneider H, Clark KJ, Cousin MA, Ebbert JO, Hooten WM, et al. Zebrafish: A model for the study of addiction genetics. Human Genetics. 2012;131:977–1008.

35. Thompson MJ, Lai WS, Taylor GA, Blackshear PJ. Cloning and characterization of two yeast genes encoding members of the CCCH class of zinc finger proteins: Zinc finger-mediated impairment of cell growth. Gene. 1996;174:225–33.

180

36. Sandra GE, Norma MM. Sexual determination and differentiation in teleost fish. Reviews in Fish Biology and Fisheries. 2010;20:101–21. doi:10.1007/s11160-009-9123-4.

37. Pan Q, Feron R, Yano A, Guyomard R, Jouanno E, Vigouroux E, et al. Identification of the master sex determining gene in Northern Pike (Esox lucius) reveals restricted sex chromosome differentiation. PLoS Genet. 2019;15:e1008013. doi:10.1371/journal.pgen.1008013.

38. Meyermans R, Gorssen W, Buys N, Janssens S. How to study runs of homozygosity using plink? a guide for analyzing medium density snp data in livestock and pet species. BMC Genomics. 2020;21.

39. Kapuscinski KL, Sloss BL, Farrell JM. Genetic population structure of Muskellunge in the great lakes. Trans Am Fish Soc. 2013;142:1075–89. doi:10.1080/00028487.2013.799515.

40. Wilson CC, Liskauskas AP, Wozney KM. Pronounced Genetic Structure and Site Fidelity among Native Muskellunge Populations in Lake Huron and Georgian Bay. Trans Am Fish Soc. 2016;145:1290–302. doi:10.1080/00028487.2016.1209556.

41. Falconer DS, Mackay TFC. Introduction to Quantitative Genetics. Fourth. Essex, England: Longman Group Limited; 1996.

42. Rougemont Q, Carrier A, Le Luyer J, Ferchaud AL, Farrell JM, Hatin D, et al. Esox masquinongy (Accession: PRJNA512459 ID 512459) - BioProject - NCBI . https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA512459. Accessed 9 Jan 2021.

43. Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data. 2010. doi:https://qubeshub.org/resources/fastqc.

44. Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20. doi:10.1093/bioinformatics/btu170.

45. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–95. doi:10.1093/bioinformatics/btp698.

46. Alexander DH, Lange K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics. 2011;12:246. doi:10.1186/1471-2105-12-246.

47. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.

48. Rubin CJ, Zody MC, Eriksson J, Meadows JRS, Sherwood E, Webster MT, et al. Whole- genome resequencing reveals loci under selection during chicken domestication. Nature. 2010;464:587–91. doi:10.1038/nature08832.

49. Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles G V., et al. Enrichr: Interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14. doi:10.1186/1471-2105-14-128. 181

50. Kuleshov M V., Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44:W90–7. doi:10.1093/nar/gkw377.

Tables and Figures

Table 6.1 Depth of coverage for raw whole-genome sequence data for Iowa samples.

Aligned Aligned High-quality Retained Sample # Sex Lake reads (#) 1 reads (%) reads (%) 2 Depth (x) 1 Female Okoboji 203,824,613 86.34 96.85 16.23 2 Female Okoboji 315,700,163 86.16 96.78 24.93 3 Female Okoboji 306,850,333 86.04 96.99 24.34 4 Male Okoboji 99,981,043 87.30 96.22 8.26 5 Male Okoboji 150,029,226 86.38 96.75 12.12 6 Male Okoboji 538,733,400 87.32 96.44 41.53 7 Female Big Spirit 438,773,328 86.29 95.68 49.98 8 Female Big Spirit 538,733,400 85.62 95.29 47.41 9 Female Big Spirit 510,088,829 85.62 94.77 47.41 10 Male Big Spirit 157,555,467 86.76 96.56 14.34 11 Male Big Spirit 140,430,988 86.73 96.41 12.81 12 Male Big Spirit 177,235,006 85.92 96.56 15.8 Iowa Average ------217,503,897 86.37 96.27 26.26 Canada Average ------1,022,373 63.11 86.15 11.44 3 1. Reads aligned to Northern Pike reference genome. 2. Quality score > 20. 3. Depth was calculated for sequenced sections.

182

Table 6.2 Average breadth of coverage across Iowa samples.

Depth threshold # of bases above threshold percentage >0x 735,465,012 80.05 10x 607,490,936 66.12 20x 539,961,789 58.77 50x 21,395,310 2.33 100x 7,981,231 0.87 1000x 260,029 0.03

Table 6.3 Pooled heterozygosity values for individuals from Iowa.

Chromosome Megabase Minor allele counts Major allele counts Hp 1 zHp 2 17 48.5 746 1,366 0.0009 32.71 2 40 826 2,702 0.0006 19.24 14 37.5 1,622 3,994 0.0004 11.77 3 36.5 2,954 6,190 0.0002 6.9 4 35 2,605 9,131 0.0002 5.19 19 47.5 4,486 8,906 0.0001 4.44 1. Pooled heterozygosity. 2. Normalized pooled heterozygosity values.

183

Table 6.4 Individual and average Froh scores.

ID Sex Lake # ROH 1 Total Kb Av. length (Kb) 2 Froh S1 Female Okboji 0 0.00 0.00 0.00 S2 Female Okboji 2 12,622.20 6,311.08 0.01 S3 Female Okboji 3 23,042.30 7,680.75 0.03 S4 Male Okboji 2 12,730.70 6,365.36 0.01 S5 Male Okboji 2 15,358.40 7,679.21 0.02 S6 Male Okboji 4 24,561.20 6,140.30 0.03 S7 Female BigSpirit 5 36,351.40 7,270.28 0.04 S8 Female BigSpirit 5 36,499.60 7,299.91 0.04 S9 Female BigSpirit 7 50,042.80 7,148.97 0.05 S10 Male BigSpirit 4 31,859.10 7,964.77 0.03 S11 Male BigSpirit 3 21,129.30 7,043.11 0.02 S12 Male BigSpirit 5 36,699.20 7,339.84 0.04 Canada average 3 -- -- 46 294,087.37 6,446.51 0.32 1. Number of segments considered runs of homozygosity. 2. Average length of roh segments in kilobases. 3. Average for all Canadian samples.

Figure 6.1 Specimen of Muskellunge (Esox masquinongy) caught and released in Iowa from an artificially stocked lake.

184

Figure 6.2 Average depth of coverage per mega base across all Iowa samples.

185

Figure 6.3 Distribution of SNPs by chromosome.

186

Figure 6.4 Principal component analysis (PCA) results. A Samples from Iowa colored by lake of origin. Big Spirit in blue and Okoboji in red. B Samples from Iowa colored by sex. Females in orange and males in green. C Samples from Canada. D. Principal component analysis (PCA) results for Iowa and Canada populations combined. PC1 and PC2 indicate principal component 1 and 2, respectively. Canada samples in red and Iowa samples in teal.

187

Figure 6.5 A. Cross-validation error value for multiple subpopulation numbers. B. Admixture plot for two subpopulations.

188

Figure 6.6 Mean pooled heterozygosity (Hp) values for 0.5 mega base windows with a 50% overlap for: A. All individuals B. Females only. C. Males only.

189

Figure 6.7 Mean Fst (MFst) values for 0.5 mega base windows with a 50% overlap contrasting males and females in the Iowan population.

190

Figure 6.8 Mean Fst (MFst) values for 0.5 mega base windows with a 50% overlap contrasting the populations of Iowa and Canada.

191

Figure 6.9 Distribution of inbreeding coefficients (Froh) for the populations of Iowa and Canada.

Appendix 6.1 Supplemental tables and figures

Table 6.5 Annotated genes located in windows found significant in pooled heterozygosity analyses.

Gene Gene description Chr1 Start (bp) PPARG peroxisome proliferator-activated receptor gamma [Source:NCBI gene;Acc:105013043] 2 199,500 tsen2 tRNA splicing endonuclease subunit 2 [Source:NCBI gene;Acc:105013054] 2 263,483 mkrn2 makorin ring finger protein 2 [Source:NCBI gene;Acc:105013060] 2 294,423 raf1b Raf-1 proto-oncogene, serine/threonine kinase [Source:NCBI gene;Acc:105013076] 2 318,819 cnbpb CCHC-type zinc finger nucleic acid binding protein [Source:NCBI gene;Acc:105013083] 2 365,194 isy1 ISY1 splicing factor homolog [Source:NCBI gene;Acc:105013103] 2 386,837 zgc:77375 haloacid dehalogenase-like hydrolase domain-containing 5 [Source:NCBI gene;Acc:105013113] 2 400,963 rab43 RAB43, member RAS oncogene family [Source:NCBI gene;Acc:105013124] 2 419,887 nuf2 UF2 component of NDC80 kinetochore complex [Source:NCBI gene;Acc:393280] 2 423,635

efcc1 EF-hand and coiled-coil domain containing 1 [Source:NCBI gene;Acc:105013136] 2 433,320 192

cacnb2b calcium channel, voltage-dependent, beta 2b [Source:ZFIN;Acc:ZDB-GENE-050208-129] 3 24,401 si:dkey-32m20.1 transmembrane protein 236 [Source:NCBI gene;Acc:105030192] 3 59,909 stam signal transducing adaptor molecule [Source:NCBI gene;Acc:105030193] 3 71,732 clul1 clusterin-like protein 1 [Source:NCBI gene;Acc:105030195] 3 124,782 adcyap1b glucagon family neuropeptides [Source:NCBI gene;Acc:105030196] 3 152,142 emilin2a EMILIN-2 [Source:NCBI gene;Acc:105030198] 3 243,954 lpin2 lipin 2 [Source:ZFIN;Acc:ZDB-GENE-060503-153] 3 272,602 myom1a myomesin 1a (skelemin) [Source:ZFIN;Acc:ZDB-GENE-030131-2856] 3 317,132 bag1 BAG cochaperone 1 [Source:NCBI gene;Acc:105007713] 3 373,993 chmp5a charged multivesicular body protein 5 [Source:NCBI gene;Acc:105007712] 3 391,776 oprk1 delta-type opioid receptor-like [Source:NCBI gene;Acc:105007147] 3 412,131

Table 6.5 continued.

Gene Gene description Chr1 Start (bp) atp6v1h ATPase H+ transporting V1 subunit H [Source:NCBI gene;Acc:105007143] 3 430,192 rgs20 regulator of G-protein signaling 20 [Source:NCBI gene;Acc:105007144] 3 472,460 sh2d1ab SH2 domain containing 1A duplicate b [Source:ZFIN;Acc:ZDB-GENE-091204-326] 4 27,411 SH3BGRL SH3 domain binding glutamate rich protein like [Source:NCBI gene;Acc:105007782] 4 365,748 TBX22 T-box transcription factor 22 [Source:NCBI gene;Acc:105007239] 4 447,612 tent5d uncharacterized LOC105026969 [Source:NCBI gene;Acc:105026969] 7 31,760,094 sh3bgrl SH3 domain-binding glutamic acid-rich-like protein [Source:NCBI gene;Acc:109614642] 7 31,821,004 shank3b SH3 and multiple ankyrin repeat domains 3b [Source:ZFIN;Acc:ZDB-GENE-041210-74] 14 10,400 hspa14 heat shock protein 14 [Source:NCBI gene;Acc:565232] 14 160,520 dclre1c DNA cross-link repair 1C [Source:NCBI gene;Acc:105015375] 14 173,391 meig1 meiosis/spermiogenesis associated 1 [Source:NCBI gene;Acc:105015368] 14 181,688 tmem243b transmembrane protein 243 [Source:NCBI gene;Acc:105015369] 14 186,480 193

dmtf1 cyclin D binding myb-like transcription factor 1 [Source:NCBI gene;Acc:436938] 14 198,857 cwf19l1 CWF19 like cell cycle control factor 1 [Source:NCBI gene;Acc:105015373] 14 221,913 cd9b CD9 antigen [Source:NCBI gene;Acc:105015366] 14 249,742 ano2b anoctamin-1 [Source:NCBI gene;Acc:105015365] 14 286,330 ntf3 neurotrophin 3 [Source:NCBI gene;Acc:109614653] 14 406,864 gls2b glutaminase kidney isoform, mitochondrial [Source:NCBI gene;Acc:105007068] 17 55,800 rdh5 retinol dehydrogenase 5 (11-cis/9-cis) [Source:ZFIN;Acc:ZDB-GENE-050208-411] 17 112,220 bloc1s1 biogenesis of lysosomal organelles complex 1 subunit 1 [Source:NCBI gene;Acc:105007070] 17 132,917 ITGA7 integrin subunit alpha 7 [Source:NCBI gene;Acc:105006268] 17 166,171

Table 6.5 continued.

Gene Gene description Chr1 Start (bp) dnase1l4.1 deoxyribonuclease 1 like 4, tandem duplicate 1 [Source:ZFIN;Acc:ZDB-GENE-040808-35] 17 267,634 ccdc51 coiled-coil domain containing 51 [Source:ZFIN;Acc:ZDB-GENE-060825-200] 17 276,545 mcm2 minichromosome maintenance complex component 2 [Source:ZFIN;Acc:ZDB-GENE-020419-24] 17 281,847 zgc:171566 zgc:171566 [Source:ZFIN;Acc:ZDB-GENE-050309-193] 17 308,307 lsm3 Esox lucius LSM3 homolog, U6 small nuclear RNA and mRNA degradation associated (lsm3), mRNA. 17 338,809 RGS20 regulator of G-protein signaling 20-like [Source:NCBI gene;Acc:105019398] 21 24,223,913 OPRK1 delta-type opioid receptor [Source:NCBI gene;Acc:105008246] 21 24,315,737 LPIN2 lipin 2 [Source:HGNC Symbol;Acc:HGNC:14450] 21 24,491,168 CLUL1 clusterin like 1 [Source:HGNC Symbol;Acc:HGNC:2096] 21 24,890,920 1 Chromosome

194

Table 6.6 List of enriched GO terms related to Hp analyses.

Nominal Adjusted Genes related Term P-value P-value to GO term sensory perception of pain (GO:0019233) 0.000176767 0.034646375 oprk1;adcyap1b sensory perception (GO:0007600) 0.001176234 0.076847292 oprk1;adcyap1b positive regulation of synaptic transmission, glutamatergic (GO:0051968) 0.001176234 0.076847292 shank3b;adcyap1b regulation of oligodendrocyte progenitor proliferation (GO:0070445) 0.015204682 0.094129479 adcyap1b cytoplasmic mRNA processing body assembly (GO:0033962) 0.015204682 0.094129479 lsm3 regulation of glial cell proliferation (GO:0060251) 0.017716664 0.094129479 adcyap1b tRNA splicing, via endonucleolytic cleavage and ligation (GO:0006388) 0.017716664 0.094129479 tsen2 pituitary gland development (GO:0021983) 0.015204682 0.094129479 adcyap1b positive regulation of excitatory postsynaptic potential (GO:2000463) 0.015204682 0.094129479 shank3b regulation of G-protein coupled receptor protein signaling pathway (GO:0008277) 0.015204682 0.094129479 adcyap1b positive regulation of neural retina development (GO:0061075) 0.015204682 0.094129479 atp6v1h 195 positive regulation of peptide hormone secretion (GO:0090277) 0.015204682 0.094129479 adcyap1b

modulation of excitatory postsynaptic potential (GO:0098815) 0.015204682 0.094129479 shank3b purine ribonucleoside monophosphate metabolic process (GO:0009167) 0.020222363 0.094129479 adcyap1b positive regulation of synaptic transmission (GO:0050806) 0.004447641 0.094129479 shank3b;adcyap1b spliceosomal conformational changes to generate catalytic conformation (GO:0000393) 0.015204682 0.094129479 isy1 ATP metabolic process (GO:0046034) 0.017716664 0.094129479 adcyap1b regulation of cAMP-mediated signaling (GO:0043949) 0.017716664 0.094129479 adcyap1b regulation of potassium ion transport (GO:0043266) 0.020222363 0.094129479 adcyap1b regulation of gliogenesis (GO:0014013) 0.025214977 0.094129479 adcyap1b postsynaptic density assembly (GO:0097107) 0.015204682 0.094129479 shank3b excitatory synapse assembly (GO:1904861) 0.015204682 0.094129479 shank3b postsynaptic density organization (GO:0097106) 0.015204682 0.094129479 shank3b postsynaptic specialization assembly (GO:0098698) 0.015204682 0.094129479 shank3b negative regulation of potassium ion transport (GO:0043267) 0.015204682 0.094129479 adcyap1b

Table 6.6 continued.

Nominal Adjusted Genes related Term P-value P-value to GO term cellular response to glucocorticoid stimulus (GO:0071385) 0.015204682 0.094129479 adcyap1b dendritic spine morphogenesis (GO:0060997) 0.020222363 0.094129479 shank3b nerve growth factor signaling pathway (GO:0038180) 0.017716664 0.094129479 ntf3 positive regulation of cytokine production (GO:0001819) 0.025214977 0.094129479 adcyap1b peptide hormone secretion (GO:0030072) 0.020222363 0.094129479 adcyap1b chromosome organization (GO:0051276) 0.030182647 0.094129479 nuf2 regulation of cation channel activity (GO:2001257) 0.003011172 0.094129479 shank3b;cacnb2b regulation of calcium ion transmembrane transporter activity (GO:1901019) 0.022721796 0.094129479 cacnb2b response to glucocorticoid (GO:0051384) 0.025214977 0.094129479 adcyap1b regulation of voltage-gated calcium channel activity (GO:1901385) 0.030182647 0.094129479 cacnb2b interstrand cross-link repair (GO:0036297) 0.027701923 0.094129479 dclre1c 196 neurotrophin signaling pathway (GO:0038179) 0.017716664 0.094129479 ntf3

positive regulation of dendritic spine development (GO:0060999) 0.022721796 0.094129479 shank3b telomere capping (GO:0016233) 0.020222363 0.094129479 dclre1c regulation of synaptic transmission, glutamatergic (GO:0051966) 0.004226419 0.094129479 shank3b;adcyap1b regulation of blood vessel diameter (GO:0097746) 0.020222363 0.094129479 adcyap1b purine ribonucleotide metabolic process (GO:0009150) 0.030182647 0.094129479 adcyap1b non-recombinational repair (GO:0000726) 0.032657166 0.094129479 dclre1c positive regulation of cell projection organization (GO:0031346) 0.027701923 0.094129479 adcyap1b negative regulation of ion transport (GO:0043271) 0.027701923 0.094129479 adcyap1b attachment of mitotic spindle microtubules to kinetochore (GO:0051315) 0.017716664 0.094129479 nuf2 positive regulation of dendrite development (GO:1900006) 0.025214977 0.094129479 shank3b double-strand break repair via nonhomologous end joining (GO:0006303) 0.032657166 0.094129479 dclre1c DNA-templated transcription, initiation (GO:0006352) 0.032657166 0.094129479 dmtf1 dendritic spine organization (GO:0097061) 0.022721796 0.094129479 shank3b

Table 6.6 continued.

Nominal Adjusted Genes related Term P-value P-value to GO term cellular response to nerve growth factor stimulus (GO:1990090) 0.030182647 0.094129479 ntf3 regulation of neural retina development (GO:0061074) 0.030182647 0.094129479 atp6v1h regulation of protein localization (GO:0032880) 0.027701923 0.094129479 adcyap1b DNA-templated transcription, termination (GO:0006353) 0.027701923 0.094129479 dmtf1 glycerophospholipid biosynthetic process (GO:0046474) 0.032657166 0.094129479 zgc:77375 response to alcohol (GO:0097305) 0.030182647 0.094129479 adcyap1b positive regulation of cell-substrate adhesion (GO:0010811) 0.030182647 0.094129479 emilin2a folic acid-containing compound metabolic process (GO:0006760) 0.032657166 0.094129479 zgc:171566 response to ethanol (GO:0045471) 0.030182647 0.094129479 adcyap1b response to unfolded protein (GO:0006986) 0.027701923 0.094129479 hspa14

cellular response to unfolded protein (GO:0034620) 0.027701923 0.094129479 hspa14 197 mitotic metaphase plate congression (GO:0007080) 0.030182647 0.094129479 nuf2

cellular response to topologically incorrect protein (GO:0035967) 0.032657166 0.094129479 hspa14 neuron projection development (GO:0031175) 0.002928041 0.094129479 ntf3;shank3b;adcyap1b regulation of cell-substrate adhesion (GO:0010810) 0.027701923 0.094129479 emilin2a positive regulation of cell adhesion (GO:0045785) 0.030182647 0.094129479 emilin2a neuropeptide signaling pathway (GO:0007218) 0.012682087 0.094129479 oprk1;adcyap1b ribonucleoprotein complex assembly (GO:0022618) 0.028432217 0.094129479 isy1;lsm3 regulation of dendritic spine development (GO:0060998) 0.035125495 0.096966154 shank3b glycerophospholipid metabolic process (GO:0006650) 0.035125495 0.096966154 zgc:77375 long-term synaptic potentiation (GO:0060291) 0.035125495 0.096966154 shank3b regulation of neural precursor cell proliferation (GO:2000177) 0.037587648 0.102321931 adcyap1b neuron projection morphogenesis (GO:0048812) 0.038176976 0.102502565 ntf3;shank3b microtubule cytoskeleton organization involved in mitosis (GO:1902850) 0.04249349 0.112550326 nuf2 synapse assembly (GO:0007416) 0.044937209 0.115890697 shank3b

Table 6.6 continued.

Nominal Adjusted Genes related Term P-value P-value to GO term chaperone mediated protein folding requiring cofactor (GO:0051085) 0.044937209 0.115890697 hspa14 thymus development (GO:0048538) 0.047374813 0.117537511 mcm2 photoreceptor cell differentiation (GO:0046530) 0.047374813 0.117537511 rdh5 'de novo' posttranslational protein folding (GO:0051084) 0.047374813 0.117537511 hspa14 dendrite morphogenesis (GO:0048813) 0.049806317 0.117614918 shank3b negative regulation of neuron apoptotic process (GO:0043524) 0.049806317 0.117614918 ntf3 bone development (GO:0060348) 0.049806317 0.117614918 atp6v1h phospholipid biosynthetic process (GO:0008654) 0.049806317 0.117614918 zgc:77375 negative regulation of cell death (GO:0060548) 0.052231736 0.117671498 ntf3 mRNA splice site selection (GO:0006376) 0.052231736 0.117671498 isy1

cellular response to heat (GO:0034605) 0.052231736 0.117671498 hspa14 198 regulation of postsynaptic membrane potential (GO:0060078) 0.052231736 0.117671498 adcyap1b

regulation of AMPA receptor activity (GO:2000311) 0.054651086 0.117710031 shank3b posterior lateral line system development (GO:0048915) 0.054651086 0.117710031 cd9b negative regulation of neuron death (GO:1901215) 0.054651086 0.117710031 ntf3 myosin filament assembly (GO:0031034) 0.054651086 0.117710031 myom1a regulation of glutamate receptor signaling pathway (GO:1900449) 0.05706438 0.117732827 shank3b regulation of neurotransmitter receptor activity (GO:0099601) 0.05706438 0.117732827 shank3b regulation of protein kinase activity (GO:0045859) 0.05706438 0.117732827 adcyap1b cardiac muscle fiber development (GO:0048739) 0.05706438 0.117732827 myom1a negative regulation of hydrolase activity (GO:0051346) 0.059471634 0.121421254 adcyap1b skeletal muscle myosin thick filament assembly (GO:0030241) 0.061872863 0.123745726 myom1a skeletal muscle thin filament assembly (GO:0030240) 0.061872863 0.123745726 myom1a striated muscle myosin thick filament assembly (GO:0071688) 0.064268081 0.127237818 myom1a positive regulation of neuron projection development (GO:0010976) 0.066657304 0.130648315 adcyap1b

Table 6.6 continued.

Nominal Adjusted Genes related Term P-value P-value to GO term negative regulation of cell cycle (GO:0045786) 0.069040545 0.131378125 adcyap1b positive regulation of neuron differentiation (GO:0045666) 0.069040545 0.131378125 adcyap1b brain morphogenesis (GO:0048854) 0.069040545 0.131378125 shank3b regulation of neuron projection development (GO:0010975) 0.07141782 0.133313264 adcyap1b cardiac muscle tissue morphogenesis (GO:0055008) 0.07141782 0.133313264 myom1a neuromuscular junction development (GO:0007528) 0.073789143 0.136440302 cacnb2b muscle tissue morphogenesis (GO:0060415) 0.076154529 0.138206367 myom1a protein complex subunit organization (GO:0071822) 0.076154529 0.138206367 nuf2 regulation of neuron apoptotic process (GO:0043523) 0.078513992 0.141181123 ntf3 cytoplasmic translation (GO:0002181) 0.080867547 0.141518207 hspa14

mitotic spindle organization (GO:0007052) 0.080867547 0.141518207 nuf2 199 skeletal myofibril assembly (GO:0014866) 0.080867547 0.141518207 myom1a

peripheral nervous system development (GO:0007422) 0.083215208 0.141827659 ntf3 cardiac myofibril assembly (GO:0055003) 0.083215208 0.141827659 myom1a gland development (GO:0048732) 0.083086112 0.141827659 mcm2;adcyap1b MAPK cascade (GO:0000165) 0.08555699 0.144561811 raf1b plasma membrane bounded cell projection morphogenesis (GO:0120039) 0.092547206 0.154947173 ntf3 negative regulation of cell proliferation (GO:0008285) 0.094865616 0.154947173 adcyap1b positive regulation of hydrolase activity (GO:0051345) 0.094865616 0.154947173 adcyap1b eye morphogenesis (GO:0048592) 0.094865616 0.154947173 rdh5 RNA processing (GO:0006396) 0.097178218 0.156122383 tsen2 striated muscle contraction (GO:0006941) 0.097178218 0.156122383 myom1a eye photoreceptor cell differentiation (GO:0001754) 0.099485027 0.158339132 rdh5 endocrine system development (GO:0035270) 0.101786058 0.158339132 adcyap1b nervous system development (GO:0007399) 0.100188404 0.158339132 ntf3;shank3b

Table 6.6 continued.

Nominal Adjusted Genes related Term P-value P-value to GO term brain development (GO:0007420) 0.101789442 0.158339132 shank3b;adcyap1b regulation of neuron differentiation (GO:0045664) 0.108654617 0.166377382 ntf3 cardiac muscle cell development (GO:0055013) 0.108654617 0.166377382 myom1a lateral line development (GO:0048882) 0.110932673 0.168548868 cd9b cAMP-mediated signaling (GO:0019933) 0.115471675 0.174095757 adcyap1b posterior lateral line development (GO:0048916) 0.117732649 0.174815145 cd9b cardiac muscle tissue development (GO:0048738) 0.117732649 0.174815145 myom1a sarcomere organization (GO:0045214) 0.122237611 0.180139637 myom1a enzyme linked receptor protein signaling pathway (GO:0007167) 0.1289528 0.187220361 ntf3 regulation of cell communication (GO:0010646) 0.1289528 0.187220361 ntf3

muscle contraction (GO:0006936) 0.131179984 0.187673553 myom1a 200 regulation of GTPase activity (GO:0043087) 0.131179984 0.187673553 adcyap1b

positive regulation of signal transduction (GO:0009967) 0.135617615 0.192616323 shank3b positive regulation of protein kinase activity (GO:0045860) 0.150975027 0.21288518 adcyap1b plasma membrane bounded cell projection organization (GO:0120036) 0.153146991 0.21288518 adcyap1b double-strand break repair (GO:0006302) 0.153146991 0.21288518 dclre1c RNA splicing, via transesterification reactions with bulged adenosine as nucleophile (GO:0000377) 0.15747459 0.215839298 lsm3 negative regulation of apoptotic process (GO:0043066) 0.1571534 0.215839298 ntf3;adcyap1b modulation of chemical synaptic transmission (GO:0050804) 0.161780505 0.220201243 ntf3 positive regulation of GTPase activity (GO:0043547) 0.17032771 0.230236077 adcyap1b regulation of neurogenesis (GO:0050767) 0.172451125 0.231509729 ntf3 positive regulation of kinase activity (GO:0033674) 0.174569211 0.232758948 adcyap1b camera-type eye development (GO:0043010) 0.176681983 0.233984247 adcyap1b adenylate cyclase-activating G-protein coupled receptor signaling pathway (GO:0007189) 0.182988536 0.240709753 adcyap1b ER to Golgi vesicle-mediated transport (GO:0006888) 0.189247723 0.247283691 rab43

Table 6.6 continued.

Nominal Adjusted Genes related Term P-value P-value to GO term Rab protein signal transduction (GO:0032482) 0.191323652 0.248340635 rab43 negative regulation of cellular process (GO:0048523) 0.199575384 0.25074856 adcyap1b mRNA processing (GO:0006397) 0.197520224 0.25074856 lsm3 muscle fiber development (GO:0048747) 0.197520224 0.25074856 myom1a positive regulation of cytosolic calcium ion concentration (GO:0007204) 0.199575384 0.25074856 adcyap1b endosomal transport (GO:0016197) 0.197520224 0.25074856 bloc1s1 myofibril assembly (GO:0030239) 0.203670237 0.254263481 myom1a positive regulation of cell proliferation (GO:0008284) 0.20774455 0.257708429 adcyap1b regulation of cytosolic calcium ion concentration (GO:0051480) 0.213817728 0.263574054 adcyap1b actomyosin structure organization (GO:0031032) 0.221844347 0.271759325 myom1a

regulation of programmed cell death (GO:0043067) 0.223838415 0.272498939 ntf3 201 positive regulation of intracellular signal transduction (GO:1902533) 0.227811533 0.274627799 adcyap1b

translation (GO:0006412) 0.229790608 0.274627799 hspa14 heart morphogenesis (GO:0003007) 0.229790608 0.274627799 myom1a positive regulation of cellular process (GO:0048522) 0.231764709 0.275308382 adcyap1b positive regulation of ERK1 and ERK2 cascade (GO:0070374) 0.235698043 0.278294075 adcyap1b mRNA splicing, via spliceosome (GO:0000398) 0.24350557 0.285790969 lsm3 positive regulation of MAPK cascade (GO:0043410) 0.247379958 0.2869022 adcyap1b regulation of ERK1 and ERK2 cascade (GO:0070372) 0.247379958 0.2869022 adcyap1b actin filament organization (GO:0007015) 0.25888677 0.295010506 myom1a adenylate cyclase-modulating G-protein coupled receptor signaling pathway (GO:0007188) 0.25888677 0.295010506 adcyap1b eye development (GO:0001654) 0.25888677 0.295010506 adcyap1b Golgi vesicle transport (GO:0048193) 0.273961246 0.310383839 rab43 transmembrane receptor protein tyrosine kinase signaling pathway (GO:0007169) 0.279536325 0.314879998 ntf3 neuron development (GO:0048666) 0.292382134 0.32746799 adcyap1b

Table 6.6 continued.

Nominal Adjusted Genes related Term P-value P-value to GO term skeletal system development (GO:0001501) 0.296010938 0.329648544 atp6v1h DNA repair (GO:0006281) 0.303213896 0.335762281 dclre1c positive regulation of protein phosphorylation (GO:0001934) 0.322651537 0.355279221 adcyap1b G-protein coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger (GO:0007187) 0.326128085 0.357101143 oprk1 Ras protein signal transduction (GO:0007265) 0.331310129 0.360759918 rab43 anterograde trans-synaptic signaling (GO:0098916) 0.33986018 0.368025388 cacnb2b regulation of signal transduction (GO:0009966) 0.373001417 0.401693834 adcyap1b regulation of cell proliferation (GO:0042127) 0.384210983 0.411504659 adcyap1b chemical synaptic transmission (GO:0007268) 0.393662619 0.419336268 cacnb2b circulatory system development (GO:0072359) 0.419676596 0.444630339 nuf2

negative regulation of programmed cell death (GO:0043069) 0.484704544 0.510763928 ntf3 202 intracellular protein transport (GO:0006886) 0.488687964 0.512207706 rab43

protein transport (GO:0015031) 0.49132679 0.512234313 rab43 central nervous system development (GO:0007417) 0.495260011 0.513602975 adcyap1b regulation of cell cycle (GO:0051726) 0.500457957 0.516261892 adcyap1b cellular protein localization (GO:0034613) 0.50432194 0.517524085 rab43 organelle assembly (GO:0070925) 0.546117118 0.557494558 lsm3 heart development (GO:0007507) 0.551964568 0.560544328 nuf2 positive regulation of transcription from RNA polymerase II promoter (GO:0045944) 0.574624711 0.580548678 adcyap1b positive regulation of transcription, DNA-templated (GO:0045893) 0.646353696 0.64966833 adcyap1b regulation of transcription from RNA polymerase II promoter (GO:0006357) 0.868151581 0.868151581 adcyap1b

Table 6.7 Annotated genes located in windows found significant in mFst analyses.

Chr Gene start Gene Gene description 1 (bp) aminoadipate-semialdehyde dehydrogenase-phosphopantetheinyl transferase [Source:ZFIN;Acc:ZDB-GENE-050913- aasdhppt 1 31,410,418 36] msantd4 Myb/SANT DNA binding domain containing 4 with coiled-coils [Source:NCBI gene;Acc:105024078] 1 31,422,396 gria4a glutamate ionotropic receptor AMPA type subunit 4 [Source:NCBI gene;Acc:105024075] 1 31,441,950 pdgfd platelet derived growth factor d [Source:ZFIN;Acc:ZDB-GENE-071217-1] 1 31,621,540 robo2 roundabout guidance receptor 2 [Source:NCBI gene;Acc:105023564] 1 30,271,102 trpc2a transient receptor potential cation channel subfamily C member 2a [Source:ZFIN;Acc:ZDB-GENE-130530-602] 1 30,674,159 stim1a stromal interaction molecule 1 [Source:NCBI gene;Acc:105023558] 1 30,693,269 alkbh8 alkB homolog 8, tRNA methyltransferase [Source:NCBI gene;Acc:105024079] 1 31,296,563 cwf19l2 CWF19 like cell cycle control factor 2 [Source:NCBI gene;Acc:619271] 1 31,312,426 GUCY1A guanylate cyclase 1 soluble subunit alpha 2 [Source:NCBI gene;Acc:105024076] 1 31,350,967

2 203 tollip toll interacting protein [Source:NCBI gene;Acc:336876] 2 20,100,750

cmc2 C-X9-C motif containing 2 [Source:NCBI gene;Acc:105027288] 2 20,734,186 cenpn centromere protein N [Source:NCBI gene;Acc:105027286] 2 20,748,140 atmin ATM interactor [Source:NCBI gene;Acc:105027285] 2 20,754,012 CIBAR2 CBY1 interacting BAR domain containing 2 [Source:NCBI gene;Acc:105027282] 2 20,796,768 gse1 Gse1 coiled-coil protein [Source:NCBI gene;Acc:105027281] 2 20,841,858 gins2 GINS complex subunit 2 [Source:NCBI gene;Acc:105027280] 2 21,113,299 emc8 ER membrane protein complex subunit 8 [Source:NCBI gene;Acc:105027279] 2 21,118,980 dusp22a dual specificity protein phosphatase 22-A [Source:NCBI gene;Acc:105027290] 2 21,126,076 irf8 interferon regulatory factor 8 [Source:ZFIN;Acc:ZDB-GENE-040718-367] 2 21,172,040 foxf1 forkhead box F1 [Source:NCBI gene;Acc:105015863] 2 21,334,976 mthfsd methenyltetrahydrofolate synthetase domain containing [Source:NCBI gene;Acc:105015873] 2 21,342,978 gas8 growth arrest specific 8 [Source:NCBI gene;Acc:105015885] 2 21,356,895 def8 differentially expressed in FDCP 8 homolog [Source:NCBI gene;Acc:105015917] 2 21,424,818 mc1r melanocortin 1 receptor [Source:NCBI gene;Acc:109616909] 2 21,456,352

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) tcf25 transcription factor 25 [Source:NCBI gene;Acc:105015951] 2 21,465,448 mvda mevalonate diphosphate decarboxylase [Source:NCBI gene;Acc:105015965] 2 21,495,912 pdcd5 programmed cell death 5 [Source:NCBI gene;Acc:105015973] 2 21,505,423 kif7 kinesin family member 7 [Source:NCBI gene;Acc:105016040] 2 21,709,669 gps1 G protein pathway suppressor 1 [Source:NCBI gene;Acc:105005746] 5 19,077,754 TNNT1 troponin T1, slow skeletal type [Source:HGNC Symbol;Acc:HGNC:11948] 5 26,397,181 si:ch211-200p22.4 si:ch211-200p22.4 [Source:ZFIN;Acc:ZDB-GENE-081104-61] 7 43,751,238 tpi1a triosephosphate isomerase A [Source:NCBI gene;Acc:105025848] 10 21,255,210 eno2 enolase 2 [Source:ZFIN;Acc:ZDB-GENE-040704-27] 10 21,262,997 iffo1a intermediate filament family orphan 1 [Source:NCBI gene;Acc:105025856] 10 21,416,314 fam131ba protein FAM131B [Source:NCBI gene;Acc:105025860] 10 21,485,401

clcn1a chloride channel protein 1 [Source:NCBI gene;Acc:105025861] 10 21,511,159 204

styk1a tyrosine-protein kinase STYK1-like [Source:NCBI gene;Acc:105025862] 10 21,541,620 si:dkey-14o18.2 neuronal pentraxin-1 [Source:NCBI gene;Acc:105025869] 10 21,593,358 cacng7a voltage-dependent calcium channel gamma-7 subunit [Source:NCBI gene;Acc:105025872] 10 21,661,003 grin2da glutamate receptor, ionotropic, N-methyl D-aspartate 2D, a [Source:ZFIN;Acc:ZDB-GENE-041008-124] 10 21,706,325 dbpa hepatic leukemia factor [Source:NCBI gene;Acc:105025873] 10 21,822,620 znf865 zinc finger protein 865 [Source:NCBI gene;Acc:105025874] 10 21,837,877 shisa7b protein shisa-7 [Source:NCBI gene;Acc:105025876] 10 21,855,866 ccdc106b coiled-coil domain-containing protein 106 [Source:NCBI gene;Acc:105025880] 10 21,913,339 rcvrn3 visinin [Source:NCBI gene;Acc:105025881] 10 21,923,471 zgc:194578 epsin-1 [Source:NCBI gene;Acc:105025884] 10 21,942,023 necap1 adaptin ear-binding coat-associated protein 1 [Source:NCBI gene;Acc:105025883] 10 21,952,669 si:ch211-171h4.3 serine/threonine-protein kinase SBK1-like [Source:NCBI gene;Acc:105025885] 10 21,966,407

Table 6.7 continued.

Chr Gene start Gene Gene description 1 (bp) zgc:91910 zinc finger protein 706 [Source:NCBI gene;Acc:105012773] 10 22,051,463 ndufs5 NADH:ubiquinone oxidoreductase subunit S5 [Source:ZFIN;Acc:ZDB-GENE-050522-437] 10 22,269,042 rnf19b ring finger protein 19B [Source:NCBI gene;Acc:105012769] 10 22,273,424 ak2 adenylate kinase 2 [Source:NCBI gene;Acc:105012768] 10 22,303,997 erf ETS domain-containing transcription factor ERF [Source:NCBI gene;Acc:105029756] 10 20,946,197 cica protein capicua homolog [Source:NCBI gene;Acc:105029757] 10 20,976,101 si:ch211- carcinoembryonic antigen-related cell adhesion molecule 5 [Source:NCBI gene;Acc:105029762] 10 21,179,002 264f5.6 zgc:174904 CD209 antigen-like protein C [Source:NCBI gene;Acc:105029763] 10 21,216,535 guanine nucleotide binding protein (G protein), beta polypeptide 3b [Source:ZFIN;Acc:ZDB-GENE-040426- gnb3b 10 21,247,761 2280] pih1d1 Esox lucius PIH1 domain containing 1 (pih1d1), mRNA. [Source:RefSeq mRNA;Acc:NM_001303772] 11 47,167,227 205 rac3a ras-related C3 botulinum toxin substrate 3 [Source:NCBI gene;Acc:105031329] 11 45,969,983

gps1 COP9 signalosome complex subunit 1 [Source:NCBI gene;Acc:105031328] 11 45,989,763 dus1l dihydrouridine synthase 1 like [Source:NCBI gene;Acc:105031327] 11 46,034,488 engase endo-beta-N-acetylglucosaminidase [Source:NCBI gene;Acc:561239] 11 46,124,538 scpep1 serine carboxypeptidase 1 [Source:ZFIN;Acc:ZDB-GENE-040426-890] 11 46,142,901 coil coilin [Source:NCBI gene;Acc:105031332] 11 46,149,330 rab11fip4a RAB11 family interacting protein 4 (class II) a [Source:NCBI gene;Acc:436806] 11 46,155,332 suz12a SUZ12 polycomb repressive complex 2 subunit a [Source:NCBI gene;Acc:794171] 11 46,270,894 crlf3 cytokine receptor like factor 3 [Source:NCBI gene;Acc:105031640] 11 46,292,949 zgc:193811 uncharacterized LOC105031633 [Source:NCBI gene;Acc:105031633] 11 46,354,404 kcnc3a potassium voltage-gated channel subfamily C member 1 [Source:NCBI gene;Acc:105031448] 11 46,839,157 aldh16a1 aldehyde dehydrogenase 16 family member A1 [Source:NCBI gene;Acc:105031445] 11 46,972,616 syt5a synaptotagmin Va [Source:ZFIN;Acc:ZDB-GENE-040718-110] 11 47,084,496 tnnt1 troponin T, slow skeletal muscle-like [Source:NCBI gene;Acc:114828652] 11 47,140,508

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) tnnt1 troponin T type 1 (skeletal, slow) [Source:ZFIN;Acc:ZDB-GENE-080723-27] 11 47,154,905 zmat4a zinc finger matrin-type 4 [Source:NCBI gene;Acc:105029468] 13 14,595,518 ppp3cca serine/threonine-protein phosphatase 2B catalytic subunit gamma isoform [Source:NCBI gene;Acc:105006198] 13 14,811,278 SUSD2 sushi domain-containing protein 2 [Source:NCBI gene;Acc:105014058] 13 32,468,508 si:dkey-174n20.1 retinol dehydrogenase 11 [Source:NCBI gene;Acc:105014061] 13 32,509,350 adam28 zinc metalloproteinase-disintegrin-like brevilysin H2a [Source:NCBI gene;Acc:105006197] 13 14,849,862 abcb9 ATP binding cassette subfamily B member 9 [Source:NCBI gene;Acc:105029380] 13 14,939,794 zgc:113436 zgc:113436 [Source:ZFIN;Acc:ZDB-GENE-050220-6] 13 14,947,738 ogfod2 2-oxoglutarate and iron dependent oxygenase domain containing 2 [Source:NCBI gene;Acc:105029381] 13 14,951,752 zgc:110329 tetraspanin-15-like [Source:NCBI gene;Acc:105029383] 13 14,997,626 ncaph non-SMC condensin I complex subunit H [Source:NCBI gene;Acc:105029386] 13 15,096,287

vamp8 vesicle-associated membrane protein 8 [Source:NCBI gene;Acc:105029388] 13 15,109,089 206

vamp5 vesicle-associated membrane protein 5 [Source:NCBI gene;Acc:105029389] 13 15,119,770 rnf181 ring finger protein 181 [Source:NCBI gene;Acc:105029392] 13 15,137,943 tmem150aa transmembrane protein 150A [Source:NCBI gene;Acc:105029390] 13 15,144,739 si:dkey-13n15.2 si:dkey-13n15.2 [Source:ZFIN;Acc:ZDB-GENE-060526-209] 13 32,760,976 hs3st1l2 heparan sulfate glucosamine 3-O-sulfotransferase 1 [Source:NCBI gene;Acc:105007233] 13 32,832,305 ckap2l cytoskeleton associated protein 2-like [Source:ZFIN;Acc:ZDB-GENE-030131-6690] 13 32,909,468 TMEM167A protein kish-A [Source:NCBI gene;Acc:105027580] 13 4,773,592 suds3 SDS3 homolog, SIN3A corepressor complex component [Source:ZFIN;Acc:ZDB-GENE-040801-236] 13 31,881,889 srrm4 serine/arginine repetitive matrix 4 [Source:NCBI gene;Acc:105014052] 13 32,131,527 hspb8 heat shock protein family B (small) member 8 [Source:NCBI gene;Acc:105014053] 13 32,194,117 si:dkey-1k23.3 heat shock protein 67B1-like [Source:NCBI gene;Acc:105014063] 13 32,203,021 xbp1 X-box binding protein 1 [Source:ZFIN;Acc:ZDB-GENE-011210-2] 13 32,261,630

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) znrf3 zinc and ring finger 3 [Source:NCBI gene;Acc:105014055] 13 32,279,764 coq5 coenzyme Q5, methyltransferase [Source:NCBI gene;Acc:447802] 13 32,357,816 kremen1 kringle containing transmembrane protein 1 [Source:NCBI gene;Acc:105014057] 13 32,378,796 nsmfb NMDA receptor synaptonuclear signaling and neuronal migration factor b [Source:NCBI gene;Acc:569891] 13 13,636,024 tnc tenascin C [Source:ZFIN;Acc:ZDB-GENE-980526-104] 13 13,721,844 rabl6b RAB, member RAS oncogene family-like 6b [Source:ZFIN;Acc:ZDB-GENE-081104-97] 13 13,855,474 imp4 IMP U3 small nucleolar ribonucleoprotein 4 [Source:NCBI gene;Acc:105014390] 13 13,872,667 araf A-Raf proto-oncogene, serine/threonine kinase [Source:NCBI gene;Acc:105014388] 13 13,877,695 cd53 Esox lucius CD53 molecule (cd53), mRNA. [Source:RefSeq mRNA;Acc:NM_001303920] 13 13,891,248 lman2lb Esox lucius VIP36-like protein (LOC105014385), mRNA. [Source:RefSeq mRNA;Acc:NM_001310946] 13 13,900,825 tbx3b T-box transcription factor 3b [Source:ZFIN;Acc:ZDB-GENE-060531-144] 13 13,922,180

tm2d2 TM2 domain containing 2 [Source:NCBI gene;Acc:105014384] 13 13,929,121 207

htra4 HtrA serine peptidase 4 [Source:NCBI gene;Acc:105014508] 13 13,936,558 unc5db unc-5 netrin receptor Db [Source:ZFIN;Acc:ZDB-GENE-060531-162] 13 14,177,692 hdr tumor necrosis factor receptor superfamily member 10B [Source:NCBI gene;Acc:105014382] 13 14,384,294 loxl2b lysyl oxidase-like 2b [Source:NCBI gene;Acc:791144] 13 14,451,358 sfrp1a secreted frizzled related protein 1 [Source:NCBI gene;Acc:105029461] 13 14,548,983 tpst2 tyrosylprotein sulfotransferase 2 [Source:ZFIN;Acc:ZDB-GENE-040426-1400] 14 12,679,273 npm2b nucleoplasmin [Source:NCBI gene;Acc:105015223] 14 12,726,278 grk3 G protein-coupled receptor kinase 3 [Source:ZFIN;Acc:ZDB-GENE-030616-382] 14 12,771,403 crybb1 crystallin beta B1 [Source:NCBI gene;Acc:105015219] 14 12,816,361 cryba4 crystallin beta A4 [Source:NCBI gene;Acc:105015220] 14 12,828,521 acads short-chain specific acyl-CoA dehydrogenase, mitochondrial [Source:NCBI gene;Acc:105015217] 14 12,839,465 cldn5b claudin-5 [Source:NCBI gene;Acc:105015216] 14 12,853,799

Table 6.7 continued.

Chr Gene start Gene Gene description 1 (bp) sept5b septin-5 [Source:NCBI gene;Acc:105015215] 14 12,865,111 mctp1b multiple C2 and transmembrane domain-containing protein 1 [Source:NCBI gene;Acc:105015213] 14 12,873,527 arrdc3b arrestin domain-containing protein 3-like [Source:NCBI gene;Acc:105015212] 14 12,919,031 cetn3 centrin 3 [Source:NCBI gene;Acc:105015211] 14 12,933,123 mef2ca myocyte-specific enhancer factor 2C [Source:NCBI gene;Acc:105015210] 14 12,943,150 edil3b EGF-like repeat and discoidin I-like domain-containing protein 3 [Source:NCBI gene;Acc:105015209] 14 12,970,813 hapln1b hyaluronan and proteoglycan link protein 1 [Source:NCBI gene;Acc:105015207] 14 12,992,946 xrcc4 DNA repair protein XRCC4 [Source:NCBI gene;Acc:105015205] 14 13,034,932 tmem167 transmembrane protein 167A [Source:NCBI gene;Acc:541340] 14 13,060,801 a atg10 autophagy related 10 [Source:NCBI gene;Acc:105015204] 14 13,071,986 zcchc9 zinc finger CCHC-type containing 9 [Source:NCBI gene;Acc:105015203] 14 13,078,358 208

wdr45 WD repeat domain 45 [Source:NCBI gene;Acc:105015201] 14 13,167,481 stc1l stanniocalcin [Source:NCBI gene;Acc:105015195] 14 13,209,562 vamp8 Esox lucius vesicle-associated membrane protein 8 (vamp8), mRNA. [Source:RefSeq mRNA;Acc:NM_001303715] 14 28,463,532 kctd9b BTB/POZ domain-containing protein KCTD9-like [Source:NCBI gene;Acc:105015193] 14 13,254,875 Esox lucius E3 ubiquitin-protein ligase MARCH5 (LOC105015192), mRNA. [Source:RefSeq march5l 14 13,263,130 mRNA;Acc:NM_001304007] actr1 beta-centractin [Source:NCBI gene;Acc:105015191] 14 13,267,128 npy8ar neuropeptide Y receptor Y8a [Source:ZFIN;Acc:ZDB-GENE-990415-175] 14 13,288,026 dbnla drebrin-like a [Source:ZFIN;Acc:ZDB-GENE-040704-42] 14 13,356,953 dusp11 dual specificity phosphatase 11 [Source:NCBI gene;Acc:105015186] 14 13,393,664 gfra2b GDNF family receptor alpha-2 [Source:NCBI gene;Acc:105015273] 14 13,491,913 tia1 nucleolysin TIA-1 [Source:NCBI gene;Acc:105015180] 14 13,715,490 SUSD2 sushi domain-containing protein 2 [Source:NCBI gene;Acc:105014910] 14 24,771,485 dtwd2 DTW domain containing 2 [Source:NCBI gene;Acc:105015179] 14 13,727,131

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) dmgdh dimethylglycine dehydrogenase [Source:NCBI gene;Acc:105015173] 14 13,804,868 bhmt betaine--homocysteine S-methyltransferase 1 [Source:NCBI gene;Acc:105015174] 14 13,809,649 ARSB arylsulfatase B [Source:NCBI gene;Acc:105015172] 14 13,838,108 lhfpl2a LHFPL tetraspan subfamily member 2 [Source:NCBI gene;Acc:105015171] 14 13,895,189 scamp1 secretory carrier membrane protein 1 [Source:NCBI gene;Acc:105015170] 14 13,908,987 ap3b1a adaptor related protein complex 3 subunit beta 1 [Source:NCBI gene;Acc:105015169] 14 13,926,038 tbca tubulin cofactor a [Source:ZFIN;Acc:ZDB-GENE-040426-962] 14 14,011,610 otpa orthopedia homeobox [Source:NCBI gene;Acc:105015167] 14 14,033,649 wdr41 WD repeat domain 41 [Source:NCBI gene;Acc:105015166] 14 14,047,594 pde8b phosphodiesterase 8B [Source:NCBI gene;Acc:105015165] 14 14,065,950 aggf1 angiogenic factor with G-patch and FHA domains 1 [Source:NCBI gene;Acc:105015163] 14 14,116,920 209 lcat phosphatidylcholine-sterol acyltransferase [Source:NCBI gene;Acc:105030834] 16 12,828,542 myorg myogenesis regulating glycosidase (putative) [Source:NCBI gene;Acc:105017039] 17 15,954,746 cita citron rho-interacting serine/threonine kinase a [Source:ZFIN;Acc:ZDB-GENE-130530-981] 17 15,996,863 rplp0 ribosomal protein lateral stalk subunit P0 [Source:NCBI gene;Acc:105017046] 17 16,106,951 loxl2a lysyl oxidase homolog 2A [Source:NCBI gene;Acc:105017050] 17 16,131,599 tacc1 transforming acidic coiled-coil-containing protein 1 [Source:NCBI gene;Acc:105017051] 17 16,173,906 si:dkey-81j8.6 serine/threonine-protein kinase 10 [Source:NCBI gene;Acc:105017101] 17 16,234,112 tctn1 tectonic family member 1 [Source:NCBI gene;Acc:105017054] 17 16,252,679 hvcn1 hydrogen voltage gated channel 1 [Source:NCBI gene;Acc:105017056] 17 16,263,631 dnai1.2 dynein axonemal intermediate chain 1 [Source:NCBI gene;Acc:105017063] 17 16,399,694 zgc:109965 zgc:109965 [Source:ZFIN;Acc:ZDB-GENE-050913-21] 17 16,441,226 rem1 GTP-binding protein GEM [Source:NCBI gene;Acc:105017066] 17 16,458,735 pnp4a purine nucleoside phosphorylase [Source:NCBI gene;Acc:105017069] 17 16,483,744

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) sox12 transcription factor SOX-12 [Source:NCBI gene;Acc:105017070] 17 16,553,539 trib3 tribbles homolog 2 [Source:NCBI gene;Acc:105017071] 17 16,631,684 rbck1 RanBP-type and C3HC4-type zinc finger containing 1 [Source:ZFIN;Acc:ZDB-GENE-040704-3] 17 16,647,533 tbc1d20 TBC1 domain family member 20 [Source:NCBI gene;Acc:105017074] 17 16,660,325 elmo2 engulfment and cell motility 2 [Source:NCBI gene;Acc:105017075] 17 16,671,462 arfgap1 ADP ribosylation factor GTPase activating protein 1 [Source:NCBI gene;Acc:105017076] 17 16,696,223 sys1 SYS1 golgi trafficking protein [Source:NCBI gene;Acc:105017081] 17 16,744,395 dnajc14 DnaJ heat shock protein family (Hsp40) member C14 [Source:NCBI gene;Acc:105017084] 17 16,772,302 nab2 NGFI-A binding protein 2 [Source:NCBI gene;Acc:105017083] 17 16,786,923 dgkab diacylglycerol kinase, alpha b [Source:ZFIN;Acc:ZDB-GENE-121105-5] 17 16,872,368 cnpy2 Esox lucius canopy FGF signaling regulator 2 (cnpy2), mRNA. [Source:RefSeq mRNA;Acc:NM_001310848] 17 16,975,855 210 syt6b synaptotagmin-6 [Source:NCBI gene;Acc:105017096] 17 16,993,333 olfml3b olfactomedin like 3 [Source:NCBI gene;Acc:105017097] 17 17,053,849 gpr25 G protein-coupled receptor 25 [Source:ZFIN;Acc:ZDB-GENE-141216-12] 17 17,059,831 inavab innate immunity activator protein [Source:NCBI gene;Acc:105017099] 17 17,083,273 rnpep arginyl aminopeptidase [Source:NCBI gene;Acc:105016934] 17 17,116,928 adipor1a adiponectin receptor 1 [Source:NCBI gene;Acc:105016936] 17 17,137,465 rabif RAB interacting factor [Source:NCBI gene;Acc:105016938] 17 17,148,604 kdm5ba lysine demethylase 5B [Source:NCBI gene;Acc:105016939] 17 17,153,154 nelfcd negative elongation factor complex member C/D [Source:ZFIN;Acc:ZDB-GENE-040426-720] 17 17,180,428 ctsz cathepsin Z [Source:NCBI gene;Acc:105016941] 17 17,189,571 npepl1 aminopeptidase like 1 [Source:ZFIN;Acc:ZDB-GENE-050417-177] 17 17,199,992 stx16 syntaxin 16 [Source:NCBI gene;Acc:105016943] 17 17,212,982 EEF1AKMT3 EEF1A lysine methyltransferase 3 [Source:NCBI gene;Acc:105016945] 17 17,224,978

Table 6.7 continued..

Gene Gene description Chr1 Gene start (bp) jph3 junctophilin 3 [Source:NCBI gene;Acc:105018516] 19 29,139,454 zcchc14 zinc finger, CCHC domain containing 14 [Source:ZFIN;Acc:ZDB-GENE-060503-319] 19 29,185,069 map1lc3b microtubule associated protein 1 light chain 3 beta [Source:NCBI gene;Acc:105018514] 19 29,218,410 det1 DET1 partner of COP1 E3 ubiquitin ligase [Source:NCBI gene;Acc:105018512] 19 29,268,048 ntrk3a neurotrophic receptor tyrosine kinase 3 [Source:NCBI gene;Acc:105018511] 19 29,443,030 sv2bb synaptic vesicle glycoprotein 2Bb [Source:ZFIN;Acc:ZDB-GENE-030131-2789] 19 30,057,585 lrp5 LDL receptor related protein 5 [Source:NCBI gene;Acc:105026612] 19 30,179,636 si:ch211-194m7.5 olfactomedin-4-like [Source:NCBI gene;Acc:105018536] 19 28,736,461 ndrg4 NDRG family member 4 [Source:NCBI gene;Acc:105018537] 19 28,742,122 gins3 GINS complex subunit 3 [Source:NCBI gene;Acc:105018534] 19 28,778,304 znrf1 zinc and ring finger 1 [Source:NCBI gene;Acc:105018533] 19 28,786,181 211 ldhd lactate dehydrogenase D [Source:NCBI gene;Acc:105018532] 19 28,831,333 snrkb SNF-related serine/threonine-protein kinase-like [Source:NCBI gene;Acc:105018531] 19 28,843,839 fa2h fatty acid 2-hydroxylase [Source:NCBI gene;Acc:105018529] 19 28,941,701 csnk2a2b casein kinase 2 alpha 2 [Source:NCBI gene;Acc:105018528] 19 28,982,909 ccdc113 coiled-coil domain containing 113 [Source:NCBI gene;Acc:450041] 19 28,994,241 znf319b zinc finger protein 319 [Source:NCBI gene;Acc:105018524] 19 29,004,131 usb1 U6 snRNA biogenesis 1 [Source:NCBI gene;Acc:445066] 19 29,008,033 pla2g15 phospholipase A2, group XV [Source:ZFIN;Acc:ZDB-GENE-030131-6948] 19 29,012,261 lcat lecithin-cholesterol acyltransferase [Source:NCBI gene;Acc:105018521] 19 29,023,833 ccdc135 coiled-coil domain containing 135 [Source:ZFIN;Acc:ZDB-GENE-120406-10] 19 29,037,870 uba2 ubiquitin-like modifier activating enzyme 2 [Source:NCBI gene;Acc:406672] 19 29,051,472 ca5a carbonic anhydrase 5A [Source:NCBI gene;Acc:105018519] 19 29,063,629 slc7a5 solute carrier family 7 member 5 [Source:NCBI gene;Acc:105018518] 19 29,084,358

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) klhdc4 kelch domain containing 4 [Source:NCBI gene;Acc:105018517] 19 29,118,673 ENO2 gamma-enolase [Source:NCBI gene;Acc:105029963] 20 19,856,249 rps4x Esox lucius ribosomal protein S4 X-linked (rps4x), mRNA. [Source:RefSeq mRNA;Acc:NM_001303937] 24 24,953,275 wrap53 WD repeat containing antisense to TP53 [Source:NCBI gene;Acc:105029892] 24 15,760,834 COL4A5 collagen type IV alpha 5 chain [Source:NCBI gene;Acc:105030863] 24 24,527,008 gc2 retinal guanylyl cyclase 2 [Source:NCBI gene;Acc:105030861] 24 24,662,726 znf16l zinc finger protein 16 like [Source:NCBI gene;Acc:570544] 24 8,195,050 sstr1b somatostatin receptor 1b [Source:ZFIN;Acc:ZDB-GENE-120410-2] 24 8,309,391 cd248a CD248 molecule, endosialin a [Source:ZFIN;Acc:ZDB-GENE-030131-2084] 24 8,331,259 peli3 E3 ubiquitin-protein ligase pellino homolog 1 [Source:NCBI gene;Acc:105022127] 24 8,374,597 rce1a Ras converting CAAX endopeptidase 1 [Source:NCBI gene;Acc:105022128] 24 8,381,130 212 eml3 EMAP like 3 [Source:NCBI gene;Acc:105020715] 24 8,414,974 rasl11a RAS like family 11 member A [Source:NCBI gene;Acc:105030860] 24 24,711,603 usp12b ubiquitin carboxyl-terminal hydrolase 12 [Source:NCBI gene;Acc:105030858] 24 24,718,343 arhgap36 rho GTPase-activating protein 6 [Source:NCBI gene;Acc:105030857] 24 24,736,469 nhsl2 NHS like 2 [Source:NCBI gene;Acc:105028219] 24 24,853,978 hdac8 histone deacetylase 8 [Source:NCBI gene;Acc:105028222] 24 24,964,713 dock11 dedicator of cytokinesis 11 [Source:ZFIN;Acc:ZDB-GENE-060503-196] 24 14,194,794 vbp1 VHL binding protein 1 [Source:NCBI gene;Acc:105028223] 24 25,031,254 zgc:162171 ras-related protein Rab-38 [Source:NCBI gene;Acc:105028225] 24 25,053,393 si:ch211-200p22.4 si:ch211-200p22.4 [Source:ZFIN;Acc:ZDB-GENE-081104-61] 24 25,098,211 si:dkey-172j4.3 diacylglycerol kinase delta [Source:NCBI gene;Acc:105028228] 24 25,167,330 gpr185b G-protein coupled receptor 12 [Source:NCBI gene;Acc:105007680] 24 25,319,992 zgc:109889 zgc:109889 [Source:ZFIN;Acc:ZDB-GENE-050522-547] 24 25,408,858

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) htr2cl1 5-hydroxytryptamine receptor 2C [Source:NCBI gene;Acc:105007309] 24 25,454,353 hdac8 phosphorylase b kinase regulatory subunit alpha, skeletal muscle isoform [Source:NCBI gene;Acc:105007262] 24 26,395,602 taf6l TATA-box binding protein associated factor 6 like [Source:NCBI gene;Acc:105020720] 24 8,486,156 cth1 mRNA decay activator protein ZFP36L2 [Source:NCBI gene;Acc:105020718] 24 8,492,385 mta2 metastasis associated 1 family, member 2 [Source:ZFIN;Acc:ZDB-GENE-030131-4803] 24 8,495,823 fxr2 fragile X mental retardation syndrome-related protein 1 [Source:NCBI gene;Acc:105020589] 24 14,755,843 ecsit ECSIT signaling integrator [Source:NCBI gene;Acc:105020716] 24 8,550,095 si:dkey-106g10.7 si:dkey-106g10.7 [Source:ZFIN;Acc:ZDB-GENE-160728-46] 24 8,566,692 men1 menin 1 [Source:NCBI gene;Acc:105020713] 24 8,596,192 map4k2 mitogen-activated protein kinase kinase kinase kinase 2 [Source:NCBI gene;Acc:105020711] 24 8,602,587 rbm4.2 RNA-binding protein 4.1 [Source:NCBI gene;Acc:105020710] 24 8,619,117 213 rbm4.1 RNA-binding protein 4.1-like [Source:NCBI gene;Acc:114828645] 24 8,624,727 sf1 splicing factor 1 [Source:ZFIN;Acc:ZDB-GENE-030131-2492] 24 8,628,962 ALPK1 Esox lucius ependymin-like (LOC105020702), mRNA. [Source:RefSeq mRNA;Acc:NM_001303716] 24 8,666,402 si:dkey-165a24.9 si:dkey-165a24.9 [Source:ZFIN;Acc:ZDB-GENE-141209-1] 24 14,789,470 ugt5g1 UDP glucuronosyltransferase 5 family, polypeptide G1 [Source:ZFIN;Acc:ZDB-GENE-080305-10] 24 14,794,693 cldn7a claudin-7-A [Source:NCBI gene;Acc:105020582] 24 14,856,508 chrnb1l acetylcholine receptor subunit beta [Source:NCBI gene;Acc:105020580] 24 14,869,669 chrnb1 acetylcholine receptor subunit beta-like [Source:NCBI gene;Acc:105020605] 24 14,901,074 fgf11a fibroblast growth factor 11 [Source:NCBI gene;Acc:105020579] 24 14,936,411 tmem102 protein MB21D2 [Source:NCBI gene;Acc:105020578] 24 15,046,074 si:dkey-85k7.12 si:dkey-85k7.12 [Source:ZFIN;Acc:ZDB-GENE-130530-855] 24 15,268,690 si:dkey-85k7.11 si:dkey-85k7.11 [Source:ZFIN;Acc:ZDB-GENE-160728-145] 24 15,671,460

Table 6.7 continued.

Gene Gene description Chr1 Gene start (bp) ufsp1 ufm1-specific protease 1 [Source:NCBI gene;Acc:105029896] 24 15,692,241 epoa erythropoietin [Source:NCBI gene;Acc:105029895] 24 15,704,265 pop7 POP7 homolog, ribonuclease P/MRP subunit [Source:NCBI gene;Acc:105029893] 24 15,714,788 drap1 DR1 associated protein 1 [Source:NCBI gene;Acc:105020695] 24 9,547,508 rela putative transcription factor p65 homolog [Source:NCBI gene;Acc:105020694] 24 9,555,721 1 Chromosome

214

Table 6.8 List of enriched GO terms related to regions with high Fst scores.

Nominal P- Adjusted P- Term Genes related to GO term value value negative regulation of developmental process (GO:0051093) 0.00039515 0.232348323 tcf7l1a;loxl2b;march5l;loxl2a positive regulation of chondrocyte differentiation (GO:0032332) 0.00369938 0.416845706 loxl2b;loxl2a negative regulation of multicellular organismal process (GO:0051241) 0.004075782 0.416845706 tcf7l1a;loxl2b;loxl2a positive regulation of cartilage development (GO:0061036) 0.004888354 0.416845706 loxl2b;loxl2a negative regulation of transcription from RNA polymerase II promoter 0.006019018 0.416845706 drap1;loxl2b;loxl2a;suds3;cica;mta2;tcf25;rela (GO:0000122) lipoprotein biosynthetic process (GO:0042158) 0.00622883 0.416845706 wdr45;atg10 drap1;stc1l;men1;loxl2b;loxl2a;suds3;cica;mta2 negative regulation of transcription, DNA-templated (GO:0045892) 0.007540683 0.416845706 ;tcf25;rela peptidyl-lysine oxidation (GO:0018057) 0.007716498 0.416845706 loxl2b;loxl2a regulation of epithelial to mesenchymal transition (GO:0010717) 0.007716498 0.416845706 loxl2b;loxl2a hemangioblast cell differentiation (GO:0060217) 0.007716498 0.416845706 snrkb;aggf1 215 peptidyl-lysine modification (GO:0018205) 0.010043779 0.416845706 loxl2b;uba2;loxl2a regulation of chondrocyte differentiation (GO:0032330) 0.011116611 0.416845706 loxl2b;loxl2a positive regulation of developmental growth (GO:0048639) 0.013020871 0.416845706 nsmfb;tnc epithelial to mesenchymal transition (GO:0001837) 0.013020871 0.416845706 loxl2b;loxl2a neuron projection fasciculation (GO:0106030) 0.013020871 0.416845706 robo2;tnc negative regulation of Wnt signaling pathway (GO:0030178) 0.01450015 0.416845706 znrf3;sfrp1a;tollip;tcf7l1a leukocyte differentiation (GO:0002521) 0.015055951 0.416845706 ak2;irf8 mesodermal cell differentiation (GO:0048333) 0.015055951 0.416845706 snrkb;aggf1 axonal fasciculation (GO:0007413) 0.017217969 0.416845706 robo2;tnc epithelial cell migration (GO:0010631) 0.019503121 0.416845706 loxl2b;loxl2a ncRNA processing (GO:0034470) 0.020575678 0.416845706 pop7;imp4;pih1d1 endothelial cell migration (GO:0043542) 0.021907683 0.416845706 loxl2b;loxl2a regulation of MAP kinase activity (GO:0043405) 0.024428011 0.416845706 trib3;pdgfd response to alkaloid (GO:0043279) 0.027060536 0.416845706 chrnb1l;chrnb1

Table 6.8 continued.

Term Nominal P-value Adjusted P-value Genes related to GO term response to nicotine (GO:0035094) 0.027060536 0.416845706 chrnb1l;chrnb1 positive regulation of multicellular organismal process (GO:0051240) 0.027060536 0.416845706 loxl2b;loxl2a negative regulation of cellular protein metabolic process (GO:0032269) 0.027060536 0.416845706 fxr2;cnot3b negative regulation of cellular amide metabolic process (GO:0034249) 0.029801762 0.416845706 fxr2;cnot3b transcription, DNA-templated (GO:0006351) 0.032092075 0.416845706 xbp1;mef2ca;taf6l negative regulation of canonical Wnt signaling pathway (GO:0090090) 0.032092075 0.416845706 znrf3;sfrp1a;tcf7l1a regulation of RNA splicing (GO:0043484) 0.032648271 0.416845706 srrm4;usb1 mesenchymal cell differentiation (GO:0048762) 0.032648271 0.416845706 loxl2b;loxl2a regulation of AMPA receptor activity (GO:2000311) 0.035596713 0.416845706 shisa7b;cacng7a histone deacetylation (GO:0016575) 0.035596713 0.416845706 suds3;mta2 synaptic transmission, cholinergic (GO:0007271) 0.038643812 0.416845706 chrnb1l;chrnb1 216 regulation of glutamate receptor signaling pathway (GO:1900449) 0.038643812 0.416845706 shisa7b;cacng7a regulation of neurotransmitter receptor activity (GO:0099601) 0.038643812 0.416845706 shisa7b;cacng7a collagen fibril organization (GO:0030199) 0.038643812 0.416845706 loxl2b;loxl2a canonical Wnt signaling pathway (GO:0060070) 0.038875355 0.416845706 sfrp1a;tcf7l1a;lrp5 negative regulation of cellular macromolecule biosynthetic process (GO:2000113) 0.039449967 0.416845706 fxr2;stc1l;men1;loxl2b;cnot3b cellular metal ion homeostasis (GO:0006875) 0.040676133 0.416845706 stim1a;stc1l;atp1a3a protein deacetylation (GO:0006476) 0.04178636 0.416845706 suds3;mta2 neuromuscular synaptic transmission (GO:0007274) 0.04178636 0.416845706 chrnb1l;chrnb1

217

Figure 6.10 A. Admixture analysis results for Muskellunge populations from Iowa and Canada with 12 assumed subpopulations. B. Admixture analysis result for Muskellunge populations from Iowa and Canada with 19 assumed subpopulation

218

CHAPTER 7. GENERAL CONCLUSIONS

The manuscripts contained in this dissertation aim to showcase the multiple applications of genomic tools to tackle the main issues that agriculture will face in the near future. The first manuscript successfully identified several SNPs that are linked to the variation in the contents of beta-carotene in buffalo and cow milk. In this case, sequencing of candidate genes was coupled with quantitative analysis to identify markers to be used in selection to not only improve productivity but to also improve the nutritional value of animal products. In a broader context, this manuscript portrays the importance of employing modern technologies in conjunction with older methodologies like the use of a candidate gene to effectively tackle production issues.

Additionally, this manuscript showcases the impact that genetic tools can have when applied to health and food security issues through increasing productivity of locally adapted breeds and the nutritional values of their products. This role of deploying the appropriate genotypes at the appropriate environment to maximize productivity is one of the biggest challenges the world of genetics has historically faced, this is shown by the ongoing efforts like the Functional

Annotation of Animal Genomes (FAANG) and Agricultural Phenome to Genome Iniciative

(AG2PI) projects, aimed at understanding how genomes dictate phenomes, thus allowing to choose the right genotype to produce the appropriate phenotype in each environment.

The second manuscript represented one of the first efforts to identify the genetic basis of blood cell traits in beef cattle. The methodology seen in this manuscript represents a blueprint of the process used to select traits to be developed and included in breeding programs; starting by measuring phenotypes and followed by identifying regions associated to phenotypic variation, estimating variance components in the population and calculating correlations with other traits of interest. At the same time, this research is an example of how novel phenotypes need to be 219 examined in order to assess their usefulness in modern breeding programs. Assessing and selecting new traits will gain importance in the very near future given the current push of phenomics and big data in the field of animal production. With the rapidly growing variety of traits that breeders can collect and evaluate in selection candidates, thoroughly studying the role that each trait plays in the performance of an individual will be vital to optimize resource usage and profitability of agricultural operations. Moreover, the methodology used in this manuscript will play an important role in the future as developing countries change from low-input production systems to more industrialized production models.

A novel approach to estimated breed composition was explored in the third manuscript.

This research also served as an example of the multiple applications that can be given to one analysis method, in this case Fst, a statistic mainly used to identify signatures of selection was used to identify SNPs that would be useful to differentiate breeds. One of the main concerns for agricultural breeding programs is the loss of genetic diversity, therefore accurate methods to identify the genetic background of individuals is of great importance to limit inbreeding and loss of genetic diversity when pedigrees are not available. Furthermore, the conservation of genetic resources in the form of heritage breeds will gain importance in the near future as climate change and emerging pathogen challenges represent a big obstacle for livestock production and warrant the need of improving locally adapted genotypes that can express their full productive potential despite these challenges. Another important application of the methodology developed for this manuscript is the capacity of accurately tracing animal products to their breed of origin.

Improving this aspect of traceability can empower small producers that focus on niche products that are dependent on a specific breed to capture premium prices in the market. Finally, the manuscript also implemented a machine learning technique to estimate the breed composition, 220 showing the very varied skillset needed by geneticists in the present day. Artificial intelligence and machine learning have proved to be very powerful and efficient at handling big data in varied disciplines. Therefore, the adaptation of this technology will play a major role in breeding and genetics as high throughput phenotyping and sequencing technologies advance.

The third and fourth manuscripts also illustrate the multitude of applications of genomics to areas other than breeding. In the case of the fourth manuscript, state-of-the-art genomic analyses were used to assess the genetic diversity of a species through an ecology-based point of view to improve population management strategies. This aspect is important since animal farming is often seen as limited to raising livestock. However, it can take many forms with one of them being producing trophy fish for the sportfishing industry. Additionally, climate change and the expansion of the agricultural frontier have put pressure in wildlife populations. Thus, understanding the genetic make-up of different populations within a species plays a key role in successful reintroduction, population management and germplasm conservation.

In the 21st century, agriculture is facing a complex set of challenges that include pressure to increase production while minimizing resource input, climate change and an important increase of demand for animal products from developing countries. In addition, consumers are becoming more educated and critical about how food is produced and how it impacts the environment, adding another level of complexity to increasing production, that needs to balance profitability with sustainability and social responsibility. Although not directly showcased in the findings of the manuscripts included, the varied list of topics addressed and methods implemented in this dissertation manifest the intricate interactions of disciplines like quantitative and molecular genetics, statistics, computer science, physiology, nutrition, husbandry, 221 engineering and veterinary medicine that will be key to address the challenges currently faced by agriculture.

Genetic improvement of livestock populations provides the chance to improve performance of these populations in a cumulative manner and therefore geneticists are indispensable to securing and improving the production of agricultural products. In the near future geneticists will play a key role in developing and deploying genotypes that will ultimately perform in the different environments. Additionally, from a molecular point of view, a lot of work is needed to achieve the ultimate goal of fully understanding how the genome and environment interact to produce the phenotypes observed. Once this goal is reached, predictive biology will allow geneticists to maximize production and product quality regardless of the environmental challenges faced in production. Finally, animal breeding and genetics provide a valuable set of tools that will allow researchers to tackle challenges related to many areas that include agriculture, medicine, ecology, population management, food safety and food security and the topics discussed in this dissertation serve as testament of some of these issues.