Quick viewing(Text Mode)

1 Development of Prediction Models of Methane Production by Sheep

1 Development of Prediction Models of Methane Production by Sheep

Development of Prediction Models of Methane Production by Sheep and Cows Using

Rumen Microbiota Data

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in

the Graduate School of The Ohio State University

By

Boyang Zhang

Graduate Program in Animal Sciences

The Ohio State University

2018

Master's Examination Committee:

Dr. Zhongtang Yu, Advisor

Dr. Moraes, Co-advisor

Dr. Firkins

1

Copyrighted by

Boyang Zhang

2018

2

Abstract

Methane emission from the rumen leads to approximately 10 % loss of the ingested energy, at the conversion of digestible to metabolizable energy. Besides research on mitigation of methane emission, much research interests have also been gravitated towards development of prediction models of methane emissions from livestock because the global warming is reducing agriculture productivity (Johnson and Johnson, 1995). An accurate prediction of enteric methane production from cattle and sheep can assist in balancing the increased livestock production with subsequent environmental impacts.

Methane is an inevitable byproduct of the microbial fermentation processes in the rumen, and certain ruminal microbes have direct impacts to methane production (Morgavi et al.,

2010).

Thus, we hypothesized that the inclusion of individual microbial groups as predictor variables could improve the robustness and accuracy of prediction models.

However, inclusion of microbial variables into prediction models can result in overfitting.

Machine-learning algorithms can automatically select the key explanatory predictors, and

Linear Mixed Models can provide a framework to predict random effects describing between animal variations. We proposed three novel frameworks for subset selections of microbial variables (MV) and one framework for generalized linear mixed models

(GLMM) using L1-penalization (GLMMLASSO) selection with cross-validations (CV) ii to address the overfitting problems and to develop parsimonious prediction models of methane production.

Methane emission can be expressed as unit methane per animal per d, per unit of dry matter intake (DMI), or per unit of metabolic body weight (MBW) per d. Thus, we developed prediction models based on g CH4/per animal/d (Animal-based models), g CH4/kg DMI

(DMI-based models), and g CH4/kg metabolic bodyweight/d (MBW-based models). Two datasets were used: one dataset from a study that compared the rumen microbiota and methane production among sheep in New Zealand, while the other dataset that was collated from two datasets (one generated using dairy cows in Finland, and the other generated using steers in

Australia). The cattle datasets were generated from studies that used different anti-methane feed additives to mitigate methane emission. Each dataset contained both animal data and relative sequence abundance (RSA) of genera of rumen microbes including bacteria, , protozoa, and fungi. Relative abundance of a genus was expressed as % of the sequences assigned to that genus over the total sequences of a marker gene, 16S rRNA gene for bacteria and methanogens, 18S rRNA gene for protozoa, and ITS1 for fungi). Subset and

GLMMLASSO selections of MV combined with CV were based on minimal Bayesian information criterion (BIC). Linear mixed effects models were built based on the MV selected. The cross-validation was used to identify the best subset of MV that resulted in the lowest mean square prediction error (MSPE) to include in the prediction models. We also compared our new models with traditional models that only contained DMI, acetate: propionate ratio, and BW.

iii

From the sheep dataset, we developed 3 parsimonious models when the size of the pool of potential variables was limited to ≤132, but parsimonious models were not developed when the size of the pool of potential variables was over 300. Most importantly, GLMMLASSO successfully converged when the penalty parameter that controls the shrinkage, Lambda, was set between 0 to 1,000, selecting about 10 variables for all the models (animal-, DMI-, or MBW-based) when the size of pool of potential predictor variables was limited to be between 132 and 310. The GLMMLASSO approach combined with CV identified the important variables to explain the effects of the 8 feed additives in the combined cow dataset. The random effect associated with the between-animal variance component was small and of similar magnitude to the random error variance component. The linear mixed effects models based on GLMMLASSO selection of MV have root mean square error as the percentage of the mean of methane emission (RMSPE %) reduced by 2.3 percent points and performed better than those based on forward-stepwise

MV selection and the traditional models based only on animal variables. Log transformation of the microbial data generally improved model performance, probably due to a better monotonicity of the MV. This thesis research indicates that individual groups of rumen microbes can be included in methane prediction models to improve prediction of methane production from sheep and cows.

In conclusion, we built frameworks for model selections using MV, and

GLMMLASSO combined with CV and forward-stepwise selection allowed identification of significant MV that can be included in methane prediction models that solved the overfitting problem and improved the prediction accuracy. GLMMLASSO selection coupled with CV is iv a useful method to extract a parsimonious and significant subset of MV from hundreds of MV that can improve the accuracy of methane prediction models. This is the first research to develop methane prediction models that contain rumen microbes as predictor variables.

Future research can focus on the exploration of the common microbial predictor variables in the cow and the sheep datasets to understand the contribution of the microbes to methane production in the rumen and to develop models with a more mechanistic nature.

v

Acknowledgments

I would like to thank Dr. Yu, Dr. Moraes and Dr. Firkins for their time and patience working with me on my thesis project. I would also like to thank the entire lab of my advisor Dr. Yu and my co-advisor Dr. Moraes for all their helps. Finally, would like to thank Dr. Peter Janssen for providing the data of sheep, Dr. Bayat Alireza for providing the dairy cow data, and Chris McSweeny for providing the steer data.

vi

Vita

2016...... Bachelor of Agriculture, Northeast

Agricultural University, China

2016 to present ...... Master of Science, Department of Animal

Sciences, The Ohio State University

Fields of Study

Major Field: Animal Sciences

vii

Table of Contents

Abstract ...... ii Acknowledgments...... vi Vita ...... vii List of Tables ...... x List of Figures ...... xi Chapter 1. Introduction ...... 1 Chapter 2. Literature Review ...... 4 The Rumen Microbial Ecosystem and Methane Production from Ruminants ...... 4 The Rumen Microbiota, Feed Digestion and Fermentation: ...... 4 Quantitative analysis of the rumen microbiota by qPCR and NGS: ...... 6 Methane production in the rumen and hindgut and mitigation of emission: ...... 8 Prediction Models of Methane Emission from Ruminants ...... 10 Statistical models: ...... 12 Dynamic models: ...... 14 Chapter 3. Exploratory Data analysis and Statistical Models ...... 18 Abstract ...... 18 Introduction ...... 19 Methods and Materials ...... 21 Sheep dataset:...... 22 Cow data: ...... 23 Results and Discussion ...... 24 EDA and PCA of Sheep Dataset:...... 24 EDA and PCA of Cow Dataset: ...... 25 Chapter 4. Variable Selections Using the Forward Stepwise Selection ...... 41 Abstract ...... 41 viii

Material and Method: ...... 44 Framework Development: ...... 44 Variable Selections of Methane Prediction Models: ...... 46 Results and Discussion ...... 47 Summary ...... 49 Chapter 5. Variable Selections by GLMMLASSO and Model Development ...... 61 Abstract ...... 61 Introduction ...... 62 Method and Material: ...... 66 Variable Selection: ...... 66 Development of Methane Prediction Models: ...... 68 Results and Discussion: ...... 68 Variable Selection: ...... 68 Model Selection: ...... 72 Summary ...... 73 Chapter 6. Discussion and Conclusion ...... 97 Discussion ...... 97 Conclusion ...... 98 Reference ...... 99

ix

List of Tables

Table 2.1. Predominate Reactions of Methanogenesis in Rumen ...... 6 Table 3.1. The top 10 taxa of microbes contributing the most to PC 1 ...... 40 Table 4.1. Sizes of variable selections by forward-stepwise selection based on Z test and BIC...... 56 Table 4.2. For the sheep dataset, model 1 selected by the forward-stepwise selection. ... 57 Table 4.3. For the sheep dataset, model 2 selected by the forward-stepwise selection based on BIC...... 58 Table 4.4. For the sheep dataset, model 3 selected by the forward-stepwise selection based on BIC...... 59 Table 4.5. Details of Selected MV’ Taxa ...... 60 Table 5.1. From the sheep dataset, discarding 4 potential variables reduced VIF to below 10 and 17 different combinations of variables resulted...... 84 Table 5.2. Cross-validation of Variables Selections Made by GLMMLASSO of Three Responses in the Sheep Dataset...... 85 Table 5.3. From the cow dataset, discarding 1 potential variable reduced VIF to below 10...... 85 Table 5.4. Details of the Variables Selected of Three Responses in the Sheep Dataset. .. 86 Table 5.5. Two Variables Selections of Animal-based Response in the Combined Cow Dataset...... 88 Table 5.6. T-value of the Best Selection of Animal-based Response in the Sheep Dataset...... 89 Table 5.7. T-value of the Best Selection of DMI-based Response in the Sheep Dataset. 90 Table 5.8. T-value of the Best Selection of MBW-based Response in the Sheep Dataset...... 91 Table 5.9. T-value the Best Selection of Animal-based Response in the Cow Dataset. .. 92 Table 5.10. Cross-validations of Five Models for Each Response in the Sheep Dataset. 93 Table 5.11. Equations of the Best Model for the Sheep Methane Production...... 94 Table 5.12. Cross-validations of Seven Models for Animal-based Response in the Cow Dataset...... 95 Table 5.13. Equation of Animal-based Model Cow Dataset...... 96

x

List of Figures

Figure 3.1. Plot of methane production (per kg DMI) and RSA of genera of sheep rumen microbes...... 27 Figure 3.2. Trends of CH4 (gram) production (per sheep per day) with A:P ratio and the RSA of 2 taxa of rumen microbes...... 28 Figure 3.3. Trends of CH4 (gram) production (per kg DMI) with A:P ratio and the RSA of Ruminococcaceae among different sheep breeds...... 29 Figure 3.4. Percentages of variance explained by each principal component...... 30 Figure 3.5. Graph of variables based on Dimension 1 and 2...... 31 Figure 3.6. Relationship between methane (gram) production (per kg DMI) by cows and the RSA of Proteobacteria...... 32 Figure 3.7. Plot of methane (gram) production (per kg DMI) and the RSA of Ruminococcus (top panel), and the linear relationship between methane production (per cow per day) and the RSA of (bottom panel) in cows...... 33 Figure 3.8. Linear trend of methane (gram) production (per kg DMI) with RSA of Ruminococcaceae and acetate: propionate ratio in cows...... 34 Figure 3.9. Trends of CH4 (gram) production (per kg DMI) with A:P ratio and RSA of Ruminococcaceae among different dietary treatments* in cows...... 35 Figure 3.10. Trends of CH4 (gram) production (per kg DMI) with A:P ratio and RSA of Ruminococcus among different cows...... 37 Figure 3.11. Percentages of variance explained by each principal component of the cow data...... 38 Figure 3.12. Graph of variables of the cow data, based on Dimension 1 and 2...... 39 Figure 4.1. An example of underfitted linear model of methane production by including only DMI as the sole variable...... 51 Figure 4.2. An example of proper linear model of methane production by including DMI and Ruminococcaceae as variables...... 51 Figure 4.3. An example of overfitted linear model of methane production by including DMI and 10 different types of bacteria and feed ingredients as variables...... 52 Figure 4.4. The flowchart showing of the no-tracker algorithm to automatically complete the best-subset selection based on LMMs...... 52 Figure 4.5. The flowchart showing the tracker algorithm to automatically complete the best-subset selection based on LMMs...... 53 Figure 4.6. The flowchart showing the algorithm for the forward-stepwise selection. .... 54 Figure 4.7. The change of BIC with forward-stepwise iterations ...... 55 Figure 5.1. Flowchart showing the framework of the GLMMLASSO...... 75 Figure 5.2. Flowchart showing the VIF reduction before GLMMLASSO selection...... 76 xi

Figure 5.3. Flowchart showing initializations before GLMMLASSO selection...... 77 Figure 5.4. Regulations performed by the GLMMLASSO selection in sheep dataset according to methane production per sheep...... 78 Figure 5.5. Observe methane production vs. predicted methane production using models 1 to 5 ...... 79 Figure 5.6. Observed methane production vs. predicted methane production using models 6 to 10 ...... 80 Figure 5.7. Observed methane production vs. predicted methane production using models 11 to 15 ...... 81 Figure 5.8. Observed methane production vs. predicted methane production using model 16 to 22 ...... 82 Figure 5.9. QQ Plot of the Residual of animal-based Models from the cos dataset – according to g methane per kg MBW per day ...... 83

xii

Chapter 1. Introduction

Ruminant emissions of methane have recently been estimated at 5 million tons per year in global (Moss et al., 2000). Methane emission from the rumen leads to approximately

10 % of the energy ingested by the ruminant, at the step of digestible energy conversion to metabolizable energy. From an animal production standpoint, the energy lost as methane could otherwise be utilized for animal growth and production (Johnson and Johnson, 1995).

More importantly, methane is a very potent greenhouse gas (GHG) that significantly contributes to global warming, which threatens the environment and agriculture productivity

(IPCC, 2014). For ruminants, methane is an inevitable byproduct of the microbial fermentation processes in the rumen (Martin et al., 2010). In the past decade, intensive research has been conducted to develop effective mitigation of methane emission, especially dietary interventions, and varying success has been reported in the literature (Beauchemin et al., 2008).

Many researchers have also explored modeling of methane emission from ruminants

(Ellis et al., 2007). Most of the models of methane production are predictive models that only included feed and animal performance data, such as dry matter intake (DMI), feed composition, bodyweight (BW), and average daily gain (ADG); rumen fermentation characteristics (including pH, VFA concentrations, acetate: propionate ratio); and methane

1 output, often expressed as amount of methane per animal, per kg DMI, or per kg BW or

BW0.75 (Ramin and Huhtanen, 2013).

To our knowledge, none of the methane prediction models contains any microbial data as a predictor variable. This is because quantitative analysis of the rumen microbes is time- consuming, expensive, and technically difficult for many ruminant nutritionists, and thus only a small number of studies have analyzed or measured the rumen microbiota. In most of the studies that have analyzed the rumen microbiota, only a few groups (mostly total bacteria, methanogens, protozoa, and known taxa of rumen bacteria) were analyzed (Martin et al.,

2010). Furthermore, each study had a small number of animals and analyzed different taxa or groups of rumen microbes. If the data from multiple studies are collated into a larger dataset, few taxa or groups of rumen microbes are common, making it impossible to include microbial predictor variables in any models.

Methanogens are the direct producers of methane in the rumen and other anoxic environments. Rumen bacteria, protozoa, and fungi can contribute to methane production indirectly through interactions (e.g., inter-species hydrogen transfer) with methanogens.

Therefore, we hypothesized that some rumen microbes have mathematical relationships with methane output from the rumen and can be included as predictor variables in methane prediction models. Such models could improve prediction accuracy and model robustness.

The recent advancements in DNA-based technologies, such as next-generation sequencing (NGS), have made it affordable and technically practical to comprehensively analyze complex microbiota, including the rumen microbiota. Individual taxa (taxonomic groups) of microbes can be quantified using real-time PCR or semi-quantified using NGS

2 technologies. Recent studies produced data from sheep and cows, and these data included animal performance data and microbial data of individual taxa of all cellular forms of rumen microbes (bacteria, methanogens, protozoa, and fungi). The objective of this research was to develop methane prediction models using both animal data and rumen microbial data.

Because hundreds of taxa of rumen microbes can be measured, overfitting can result from including too many microbial variables into prediction models. Thus, we also explored (i)

GLMMLASSO, a machine-learning algorithm to automatically select key explanatory microbial predictors that need to be included in parsimonious models without overfitting, and (ii) Linear Mixed Models to provide an appropriate framework for multivariate regression and methane prediction taking variations among animals into consideration. The computational advantage of GLMMLASSO is adding a L-1 penalty in the cost function.

Normal linear regression algorithms fit parameters by minimizing the cost function, for example, the cost function widely used in regressions being the residual sum of squares.

The GLMMLASSO selection also minimizes the parameters by a weighted L-1 norm penalty of parameters. The GLMMLASSO successfully converged when Lambda was set between 0 to 1000 to make selections of microbial variable. This research represents the first study to develop methane prediction models using microbial data as predictor variables.

3

Chapter 2. Literature Review

The Rumen Microbial Ecosystem and Methane Production from Ruminants

The Rumen Microbiota, Feed Digestion and Fermentation:

The prediction of methane production from ruminants has attracted much research interest in recent years because methane emission from these animals leads to feed energy loss and contributes to global warming. Sheep and cattle are the most abundant domesticated ruminants that have developed a symbiotic system with microorganisms in their rumen to enable plant structural polysaccharides, mainly cellulose and starch, to be digested and provide nutrients (e.g., volatile fatty acids (VFA) and microbial proteins) to host animals (Van Soest,

1994). The ruminal anaerobic microbes, primarily bacteria, ferment feed materials to nutrients that host animals can use. However, methane is an inevitable byproduct of the fermentation processes of the ruminal microbiome.

The rumen microbiota of ruminants is an enormously diverse community. The bacterial community in the rumen ecosystem can contain over 300 phylotypes (or species)

(Yu and Morrison, 2006). This microbiota has co-evolved with their host-animal for millions of years and is capable of vast metabolic functions in the rumen, which are essential for the growth and health of host animals (Morgavi et al., 2010). Conceptually, feed substances (such

4 as dietary carbohydrates and proteins) ingested by ruminants are first microbially digested

(hydrolyzed by microbial extracellular hydrolases) to monomers (i.e., sugars and amino acids), and the digestion products are fermented to VFA including acetate, propionate, and butyrate, CO2 and H2 as the major products. VFA are absorbed through the rumen wall into the bloodstream and used as energy or converted into proteins, sugars, and lipids (Hobson and

Stewart, 2012). Hydrogen, both as dihydrogen (H2) or as metabolic hydrogen ([H]), forms as some of the carbons in the monomers are oxidized to CO2 during the microbial fermentation.

Being more thermodynamically favorable, methanogenesis is the major hydrogen disposal mechanism in the rumen, and enteric methane is a major fermentation gas. Methanogens, a unique group of , can utilize two major types of substrates: CO2 and acetate, and most of those substrates can be generated from the fermentation stage (Liu and Whitman, 2008). In addition, methanogens form a symbiotic relationship with other microbes, especially rumen protozoa and fungi, most of which produce large quantities of H2 via their hydrogenosomes

(Embley et al., 2003). Reactions of methanogenesis and the roles of the symbiotic relationship are summarized by Liu and Whitman (2008). More importantly, in the rumen, hydrogenotrophic methanogenesis is the predominant methanogenesis pathway, through which CO2 as the electron acceptor is reduced by H2 as the primary electron donor (Morgavi et al., 2010, Table 1). Evidently, although methanogens are the direct producing organisms of methane, other members of the ruminal microbiota also contribute to and affect methane production (Mathison et al., 1998). Thus, it can hypothesized that mathematical relationships exist between methane production by ruminants and certain members of their rumen microbiota.

5

Table 2.1. Predominate Reactions of Methanogenesis in Rumen

Reactions Methanogens

4H2 + CO2 ⟶ CH4 + 2H2O Hydrogenotrophic methanogens

4HCOOH ⟶ CH4 + 3CO2 + 2H2O Many hydrogenotrophic methanogens

4CH3OH ⟶ 3CH4 + CO2 + 2H2O Methylotrophic methanogens

Notes. Modified from Liu and Whitman (2008).

Quantitative analysis of the rumen microbiota by qPCR and NGS:

Recently, the next-generation sequencing (NGS) has been the primary technology to investigate the rumen microbiota, its diversity, composition, and functions, and how these affect animal nutrition and health (Kim et al., 2017). Rumen cannulation can facilitate the collection of representative digesta contents from the rumen, although only a small number of animals can be cannulated. When a large number of animals needs to be sampled, rumen content can be collected using oral tubing (Lodge-Ivey et al., 2009). Using both NGS and qPCR, groups or taxa of bacteria, protozoa, and methanogens can be quantified, either as absolute abundance (by qPCR) or as relative sequence abundance (RSA, expressed as % of a marker gene sequence assigned to a taxon over total sequences). These data are important variables or parameters, and some of them can be associated with methane production in the rumen (Meale et al., 2017).

To overcome the limitation of cultivation-based analysis (i.e., most of the rumen microbes cannot be cultured or analyzed), DNA-based analyses is used. Understandably,

6 efficient extraction of representative DNA is the prerequisite of accurate analysis of the rumen microbiota. Several DNA extraction methods have been developed and used. For example, Yu and Morrison (2004) developed an efficient extraction approach based on repeated bead- beating and column purification, resulting in high DNA yield and quality. In the analysis of the rumen microbiota, a marker gene (16S rRNA gene for bacteria and archaea, 18S rRNA gene for protozoa, and ITS1 for fungi) is sequenced or quantified. These marker genes allow identification and quantification (or semi-quantification) of individual groups of microbes of the rumen microbiota. These marker genes are chosen as such because they are phylogenetically conserved and do not laterally transfer (Woo et al., 2008).

Both qPCR and NGS allow quantification of the marker genes, and the former enables determination of absolute abundance (copies of marker genes per g or ml of sample), while the latter only determines relative sequence abundance (RSA, % of sequence over total sequences). qPCR can only determine the abundance of a small number of target microbes and the analysis of the qPCR data (numeric data only) is simple. Total bacteria, total archaea, total fungi, and total protozoa have been quantified (McSweeney et al., 2007). Specific genera of rumen bacteria, such as Ruminococcus and Prevotella, have also been quantified. In some studies (Ellis et al., 2007), the mathematical correlation between individual groups of rumen microbes and methane production has been explored, but often conclusions are inconsistent or conflicting. In the past five years, NGS of amplicons of marker genes has become the primary approach to analyze the rumen microbiota in an untargeted manner, potentially identifying and semi-quantifying all the major microbes in the rumen. High-throughput sequence data are typically processed and analyzed by QIIME (an open-source software pipeline) (Caporaso et

7 al., 2010). Sequences are usually clustered into OTUs (operational taxonomic units), which is equivalent to species. The OTUs can be assigned to a taxon by sequence comparison with databases, such as RDP, SILVA, and Greengenes. Taxonomic assignment and the RSA data are used to characterize the rumen microbiota and correlate to other animal data, including methane production. The absolute abundance determined by qPCR is more accurate and reliable than the RSA determined by NGS (Kim et al., 2017).

Methane production in the rumen and hindgut and mitigation of emission:

Methane is a greenhouse gas (GHG) with a potential global warming effect 25 times of that of carbon dioxide (IPCC, 2007). The concentration of CH4 in the atmosphere has increased by 160% over the recent 200 years, and the IPCC (Intergovernmental Panel on

Climate Change) has suggested an expected negative effect of the global warming on agriculture, such as reduced productivity (IPCC, 2014). However, with the growing population, food demand in the next 30 years is anticipated to increase by over 60% compared to that of 2006 (FAO, 2016). The growing methane concentration in the atmosphere increases the risk and the vulnerability of agriculture. Thus, collective international efforts have been made, such as the Paris Agreement, to develop and implement targets and actions to reduce

GHG emissions across the world (Rogelj et al., 2016).

Globally, most attention has been focused on methane emissions from livestock, especially ruminants, because enteric methane production from ruminants is the largest anthropogenic source of methane emission (Gerber et al., 2013). Enteric fermentation in livestock is the largest share (40%) of direct emissions of methane in the agriculture sector, 8 and the second largest share in the overall emission represents 21% of anthropogenic methane emitted in the United States (EPA, 2011). The methane output from the livestock sectors has been estimated to amount to 5 million tons per year globally (Moss et al., 2000). Production of ruminant meat and milk contributes to most of the GHG emissions among the livestock commodities. Numerous technologies have been explored to mitigate methane emission from ruminants, but the results are mixed. Based on published equations, Moraes et al. (2012) pointed out that enteric methane mitigation using dietary manipulations can have large financial consequences on dairy farms. In order to mitigate methane emissions through practical management, the amount of methane produced by ruminants needs to be accurately quantified; the factors that contribute or affect the emissions need to be identified and understood; and inventories of methane emission need to be established.

The development of methane emission inventories (i.e., IPCC) or the evaluation of the effects of methane mitigation on animal performance and farm profitability relies on the measurement or quantification of methane production from ruminant animals. Many methods for measuring methane emission have been used. Respiration chambers have been used as energy measurement of methane loss from ruminants for over 100 years (Hammond et al.,

2016). Using chamber systems, airflow rate and the concentrations of methane are determined, and the methane yields are then calculated from the airflow rate and the methane concentrations in the airflow. Although the chamber system can accurately quantify methane emission from animals, standard chambers are expensive to build, and only a very small number of animals can be used practically. In addition, after being placed in individual chambers, the animals often change their behavior, including decreasing feed intake. Another

9 widely applied measurement technique is Sulphur hexafluoride (SF6) tracer technique

(Zimmerman, 1993). The relatively large size of the tracer technique equipment can affect animal behaviors (Storm et al., 2012). The GreenFeed measurement system is a relatively new method that measures metabolic gas fluxes from an animal, and methane can be assessed by head position sensors in combination with decision rules. Importantly, measurement of enteric

CH4 can be complex, expensive, and impractical at regional scales (Niu et al., 2018). Overall, it is impossible to measure methane emission from every ruminant because of limited equipment, time, and funding availability. Therefore, prediction models are needed to establish national and local inventories of methane emission and to assess the outcome of methane mitigation interventions. The prediction models can also be useful in guiding mitigation strategies and regulatory policies. Because methanogens are the direct producer of methane and other microbes can directly or indirectly affect methanogens, rumen microbes can have mathematical relationships with methane production from ruminants. Thus, we hypothesized that information describing rumen microbes should be included in prediction models as predictor variables of methane emission. It is also useful from a biological standpoint to understand and quantify the roles of microbes by including rumen microbes as predictor variables in methane prediction models.

Prediction Models of Methane Emission from Ruminants

In the past decade, much research interest has been directed to mitigation of methane emission from ruminants, and in parallel, modeling of methane emission from ruminants has also attracted much interest. Generally, prediction models of methane production can be 10 classified into two categories: statistical models and dynamic models. Dynamic models utilize comprehensive mechanisms of digestion and rumen fermentation to simulate and predict methane production. Ideally, these models can simulate system behaviors at lower levels of aggregation and are suitable for prediction on a larger variety of scenarios. Dynamic models are not easily applied to practical predictions because of the requirement of numerous inputs and computational demands. The accuracy of dynamic models is often similar to or below that of statistical models. Practical predictions are generally made by statistical models, such as

IPCC’s prediction equations. Statistical models utilize associated information to quantify relationships between variables and make predictions of the methane production. Important variables are often selected based on data availability and significant relationships with methane emission. Conventionally, statistical models contain correlated variables from diets, digestion characteristics, and animal performance, which may limit model performance because some more valuable variables are hidden in the microbiota, the direct producer of methane or effectors. In the past decade, more and more data have become available from many studies across the globe, increasing the number of potentially useful variables that can be used for model development. For example, the first prediction model developed in 1930s only contained carbohydrate intake as the predictor variable (Kriss, 1930). At present, several hundreds of microbial variables, besides the animal variables, can be generated from a single study that employs NGS technology (Meale et al., 2017). The newly available microbial data provide new opportunities to develop improved prediction models of methane emission from ruminants.

11

Statistical models:

Statistical models are fairly successful in predicting methane production (Ellis et al.,

2007). The first attempt of methane prediction was to analyze the statistical relationship between feed intake and the methane production (Kriss et al., 1930). Bratzler (1940) and

Swift et al. (1948) approximated the daily methane production of steers and cows using the amount of carbohydrate intake. The equation only contained the amount of carbohydrate intake and was simple to make an approximation. However, those equations failed to explain practical situations because further studies pointed out that change of diet digestibility and energy concentration in the diet can change the amount of the methane production (Johnson and Johnson, 1995).

Blaxter and Clapperton (1965) analyzed variation on methane production by testing more than 2,500 samples and combined the feed digestibility with the dietary energy at the maintenance to predict the methane production. They showed that the day-to-day variations in the same animal and between animals were more than 3% greater than instrumental error, and both the maintenance energy and the feed digestibility were highly significant statistically in the regression model. In addition, they provided evidence that methane production in a group of sheep consuming similar feed could vary significantly and suggested that the variation could be caused by changes in methanogens. Moe and Tyrrell (1979) collected data of methane production and diet components of 404 cows and concluded that accurate methane prediction for lactating dairy cattle would require the determination of diet ingredients, such as the content of neutral detergent fiber (NDF) and soluble residues. The widely applied equation adapted from Moe and Tyrrell’s linear regression is shown as equation 1.1 12

−1 퐶퐻4(푀퐽 푑푎푦 ) = 3.38 + 0.51푁퐹퐶 (푘푔/푑) + 2.14퐻퐶(푘푔/푑) + 2.65퐶(푘푔/푑) (1.1)

Holter and Young (1992) selected significant variables among 15 feed ingredients using backward-stepwise model selection and tested the curvilinear relationships among the feed ingredients. Corroborating the study of Moe and Tyrrell, the authors also reported that the inclusion of the quadratic amount of the DMI as a predictor variable increased prediction errors. Wilkerson et al. (1995) evaluated the published equations mentioned above and concluded that the equations from Moe and Tyrrell (1979), which included cellulose, hemicellulose, and nonfiber carbohydrates, could offer the highest reproducibility and lowest prediction errors. However, the prediction of methane production based on statistical analysis of nutrients is far from being reliable (Baldwin, 1995).

It is well established that, with the linear increase in feed intake, the percentage of methane lost as gross energy declines, as suggested by Mills et al., 2003. These authors developed linear models with quadratic terms and non-linear models using the Mitscherlich equation with energy and feed intake as predictor variables. Their results showed that, if the quadratic terms were eliminated in the backward selection, both the linear and the non-linear models produced MSPE as the Moe and Tyrrell’s model.

Practical methane predictions, such as country-wide predictions , are generally made using statistical models. However, it is difficult to assess methane production using equations containing over 5 diets variables, such as those included in the Moe and Tyrrell’s equation, because they require data on cellulose and hemicellulose content of the feed, and yet feed can

13 change and vary often among farms. The Intergovernmental Panel on Climate Change (IPCC,

1997) provided tier 1 and tier 2 methods to evaluate national methane emissions. The tier 1 method only requires the population of ruminant animals, and the tier 2 method requires additional information on country-specific feed and energy intakes. IPCC (2011) introduced the tier 3 method which includes features of microbial fermentation processes, feed intake, and dietary characteristics. To improve practical prediction accuracy, Ellis et al. (2007) established statistical models that included regularly measured dietary variables such as DMI and NDF using linear mixed models, and selected the models which have as fewer variables as possible to accommodate the scenario of inadequate feed analysis. Ramin and Huhtanen (2013) evaluated the effects of several feed ingredients, such as DMI, NDF, and animal BW, and developed a methane prediction model using linear mixed regression and the cross-validation technique. The best model from Ramin and Huhtanen had a root mean squared prediction error (RMSPE) of as 3.04 kJ/MJ. Moraes et al. (2014) developed various enteric methane prediction models that used different feed data as variables, including energy intake only or both energy intake and NDF dietary content. Those models were selected using a Bayesian model selection procedure and showed the improved goodness of fit when compared with the models from IPCC and the Food and Agriculture Organization of the United Nations (FAO).

Dynamic models:

Alongside statistical models, with the advancement of understanding of the physiology of digestion and comprehensive metabolism in ruminants and the needs for reliable prediction models of animal performances, dynamic models have also been developed 14 based on properties of rumen functions, feed components, microbial growth, the interaction between the animal and the digestive processes, stoichiometric relationships, and metabolic pathways. Baldwin et al. (1977) constructed rigorous and quantitative dynamic models of ruminant digestive processes and identified specific aspects of ruminant digestion. Those authors also indicated that microbial functions might be key aspects of rumen digestion.

France et al. (1982) constructed a dynamic model of the rumen digestive processes based on continuous diet inputs and suggested that the asymptotic value of the microbial growth increased prediction errors. Dijkstra et al. (1992) modified the microbial features included in the two dynamic models mentioned above and constructed a whole-rumen function model with more comprehensive microbial effects and volatile fatty acids (VFA) to represent the fermentation processes. Mills et al. (2001) modified certain ruminal fermentation parameters, such as the limitation of the ratio of lipogenic to glucogenic VFA, and improved Dijkstra’s model by taking into account the postruminal digestion. As mentioned above, Mills et al.

(2003) developed linear models with quadratic terms and non-linear models using the

Mitscherlich equation. Their results showed that these models had similar prediction error as the Moe and Tyrrell’s model, and previous non-linear models even had more errors and more practical limitations than linear alternatives. Ellis et al. (2008) reviewed the dynamic models developed by Benchaar and Mills (2001) and hypothesized that incorporations of microbiology into dynamical models could help the development of methane models.

In general, the accuracy of statistical and dynamic models based on commonly measured feed inputs is far from being satisfactory. Benchaar et al. (1998) compared the prediction capacity of dynamic and mechanistic models reported by Baldwin et al. (1987) and

15

Dijkstra et al. (1992) to that of regression equations reported by Blaxter and Clapperton (1965) and Moe and Tyrrell (1979b) in predicting methane emission from dairy cows. Their results showed that when diets widely varied, mechanistic models are more accurate than regression models, while predictions using simple regression were poor. However, the accuracy of mechanistic models was influenced by the input parameters such as chemical composition, degradation rates, and passage rates. Ellis et al. (2010) evaluated 9 equations that are currently being used in practice, and found significant amounts of uncertainty in those 9 equations. The authors suggested that the usage of those equations is associated with substantial errors and thus leads to incorrect mitigation recommendation. Conventional models also lack the ability to explain the microbial impact of different mitigation strategies. In addition, recent studies have proposed that major impacts of different mitigation strategies come from the alternation of the microbiota in the rumen. Conventional models only take into considerations of diets, digestibility, and animal performance, but they lack variables describing the rumen microbiota, and thus they are not able to explain the mechanism of those mitigation strategies.

For example, Ellis et al. (2008) hypothesized that incorporations of microbiology into dynamic models could help the development of methane modeling.

Indeed, based on published equations, Moraes et al. (2012) proposed that the cost of methane-mitigating diets can be substantially higher than commercial diets currently fed in different parts of the country. Methane prediction models can help determine the next steps in mitigation strategies and economic consequences of regulatory policies. The goodness-of-fit of prediction models depends on the suitability and mathematical formulation of the primary concepts that link various aspects of the rumen fermentation processes. Both the importance

16 and complexity of the rumen microbiota and their impacts on ruminant performances have been recognized owing to the continued research on the rumen microbiota. The knowledge of individual rumen microbes with respect to their population dynamics, metabolism, function, and impact on animal performance provides new opportunities to develop quantitative relationships with methane emissions. Currently, statistical models are based on measurements of nutrient supply to host animals. Therefore, it is conceivable that models to predict methane production should include rumen microbes as the predictor variables. Such models can improve upon previous models that only contain animal or feed variables by increasing reliability. Additionally, using microbial data to predict methane production can extend the prediction usability of the models by improving their repeatability and utility in different animal breeds and herds. Moreover, such models can also help understand the rumen microbial ecosystem, which is crucial to advance feed systems designed for methane mitigation strategies.

17

Chapter 3. Exploratory Data analysis and Statistical Models

Abstract

The rumen microbiota can have hundreds of genera and thousands of species of different microbes. Thus, datasets of the rumen microbiota are high-dimensional. From a microbiological perspective, different microbial taxa can have different mathematical relationship, or to different degree, with methane production. In this study, we used exploratory data analysis (EDA) to examine the features of the data and maximize what we can learn from the data. One cow dataset and one sheep dataset were analyzed in this study. These two datasets each contained high- dimensional data of both animals and rumen microbes. We used graphical displays to demonstrate the linear relationship of methane production and pertaining variables. Then, we used principal component analysis (PCA) to extract relevant information from inter-related microbe profiles and examine the relationship or covariance of different microbes, and potential meaningful variables from the microbiota and animal feed ingredients. The purpose of this analysis was to help us reduce multicollinearity of methane prediction models (see Chapter 5 for the model selection). The EDA analysis showed that the overall trends of methane production with microbial variables (MV) and animal variables were linear, but some variations existed among different animals. The microbial data generally followed normal distribution, and log- transformation of MV increased the normality of MV’ distributions. The results of PCA showed that the top 10 principle components were all MV. Based on the data structures and the pattern of the methane production found from EDA and PCA, we made some assumptions about methane

18 prediction models and proposed that linear mixed effects models are suitable to predict methane production from both animal and microbial data.

Introduction

Datasets derived from studies that analyzed feed, animal performance, and methane production can have many data entries (or animals) but with low number of variables. Therefore, the dimensions of these datasets are typically small (Chapman et al., 2018). Overfitting is not likely an issue with such datasets. However, studies that analyzed the rumen microbiota could generate datasets of very large dimensions. This is because the rumen microbiota is very diverse and rich in species containing hundreds of microbial taxa. For example, one study examined the relationship between the rumen microbiota and methane emission from sheep generated a dataset that contained more than 200 genera of rumen microbes (bacteria, methanogens, protozoa, and fungi) (Kittelmann et al., 2014). Such a large dataset can create two potential issues when it is used to develop methane production models: overfitting and missing significant microbial taxa that have strong mathematical relationships with methane production. Therefore, it is important to first perform exploratory data analysis (EDA) to identify the microbial variables (MV) that have the most important mathematical relationship with methane production.

PCA can provide a small number of uncorrelated variables to potentially describe a substantial amount of the variation and visualize the relatedness of variables in a large-scale dataset (Einasto et al., 2011). The framework of PCA is to build a covariance matrix based on the original data set containing all variables and then compute eigenvalues and eigenvectors of the covariance matrix. Then, accordingly sorting eigenvalues builds a new matrix which contains principal components (PC). There are 3 PCA functions in R: “princomp”, “prcomp” and “PCA”.

19

The difference among these functions is the method of the eigenvector decomposition.

According to the R help, the function “prcomp” has a preferred decomposition approach

(Singular-value Decomposition) compared to “princomp”.

Results of methane emission experiments often contain repeated data from the same animal.

The linear mixed model is a linear regression that models correlated data and thus can analyze the regression for repeated measures data (Verbeke, 1997). In term of linearity, predictors in LMMs are linearly related when models’ predictors are normally distributed and have independent variance. In term of mixed effects, LMMs contain both fixed and random-effects. Formally, as shown in equation

3.1, the LMMs estimate an overall mean 휇 after considering a random effect and a random error (e.g., apparatus errors). LMMs are described in equation 3.1.

Equation 3.1. Definition of LMMs.

풀 = 푿휷 + 푍풃 + 휀

Y is the vector of observations

푿 is the design matrix for the fixed effects,

휷 is the vector of fixed effects parameters,

풁 is the design matrix for the random effects,

풃 is the vector of random-effects, assumed to follow a normal distribution with mean vector zero and variance covariance G.

휺 is vector of the random errors, assumed to follow a normal distribution with mean vector zero and variance covariance matrix R, often take as 퐑 = 퐈휎2 where I is the identity matrix and 휎2 the error common variance. It is often further assumed that 휺 is independent of 풃.

Mixed model’s parameters can be estimated by cost functions to regulate LMMs’ unexplained variance, such as maximum log-likelihood, and several approaches to approximate cost functions were proposed, such as PQL (penalized quasi-likelihood) (Breslow & Clayton, 1993). An efficient

20 strategy evaluates parameters using a Cholesky factorization. In practice, an R function “lmer” from the package “lme4” can fit a LMM to data, via maximum likelihood or restricted maximum likelihood

(Bates et al., 2014).

Methods and Materials

Two datasets were obtained: one sheep dataset that had been published (Kittelmann et al.,

2014) and one cow dataset that contained data from two studies. One set of the cow data was derived from a methane mitigation study using steers (Martinez-Fernandez, et al., 2016), and the other set of cow data was generated from lactating dairy cows (unpublished). Each dataset had data on diet, animals, methane production, and ruminal microbiota composition. The animal data included dry matter intake (DMI), bodyweight (BW), methane output (per animal or per kg

DMI), rumen fermentation characteristics including total volatile fatty acids (VFA), pH, and acetate: propionate ratio. The microbial data were RSA of individual taxa (phyla, genera, and

OTUs) represented as percentages of 16S rRNA gene sequences. Conventionally, the matrix representation of the data is 푿(n × p), where n is the number of samples, and p is the number of predictors variables. To explore the features of the data, we first graphically displayed the relation between methane production and the RSA of individual microbial variables. Secondly, we added the animal data to the first step’s graphs. Thirdly, we compared the relations among categorical variables. Also, we examined the shape of the distribution among the variables.

Lastly, we used PCA to extract the relevant information. Our PCA was based on the “prcomp” function of R, and we then visualized the results using the “factoextra” package in R

(Kassambara and Mundt, 2016). Specifically, X was let be the data set containing all variables, and X was an n × p matrix.

Procedures of PCA (Shlens, 2014) :

21

푛 1 Center and scale all data: 푥푖 = 푥푖 − ∑ 푥푗 푛 푗=1

1 Calculate covariance matrix CX: CX = 퐗퐗T 푝

Sort eigenvalues and build a new matrix P. Each row of P (Pi) is an eigenvector of CX

Y = PX is a new representation of that data set based on PCA.

Sheep dataset:

The sheep dataset was generated in a published study entitled “Two Different Bacterial

Community Types Are Linked with the Low-Methane Emission Trait in Sheep” (Kittelmann et al., 2014). In that study, 117 sheep were selected to quantitatively analyze the ruminal microbiota and measure the methane production from each sheep. Sheep were fed a pelleted alfalfa diet, and methane production was measured in respiration chambers. Feed composition was not determined. Feed intake and animal growth were recorded. Rumen samples were collected by the stomach tubing method, and the fermentation characteristics including VFA concentration, pH, and A:P ratio were analyzed. The composition and structure of the rumen microbiota were analyzed using MiSeq sequencing of marker gene amplicons and subsequent bioinformatic analysis using QIIME. All the data were provided by Dr. Peter Janssen (the corresponding author of the published study) and organized as an Excel sheet. This dataset has several features: (i) a relatively large number of animals (118 to be exact) that were not subjected to any dietary intervention, (ii) each animal has two samples (236 samples in total), (iii) some animal data are available including breed, gender, body weights, A: P ratio, and (iv) a large number of microbial taxa (229 taxa, including 132 major genera having a RSA >0.01%).

22

Cow data:

We analyzed one cow dataset that was combined from two studies. The first dataset was from Finland and it was generated from a study on lactating dairy cows (unpublished data) that contained data from two feeding experiments aimed to reduce methane emission using conjugated linoleic acid (CLA) as the dietary treatment. This dataset comprised 195 animal data entries (not 195 cows because Latin Squares design was used) of feed, animal, animal performance, and microbial data. The second dataset was generated from a published study entitled “Methane inhibition alters the microbiota, hydrogen flow, and fermentation response in the rumen of cattle”(Martinez-Fernandez et al., 2016). It contained two experiments designed through Latin Square experiments with each experiment utilizing 4 or 8 cows, and 6 treatments: cyclodextrin only, cyclodextrin plus chloroform at low, medium and high concentrations, chloroform only, and chloroform plus phloroglucinol. The total number of data entries was 57 from 12 cows. We combined the above two cow datasets and created a dataset with 2 feed variables: DMI and additive ingredients, 6 fermentation variables: PH of rumen, total VFA in the rumen, acetate molar proportion, propionate molar proportion, butyrate molar proportion, and the ratio of acetate to propionate. The data set contained 28 microbial genera. Two of them were archaea, and the remaining genera were bacteria. The microbial data were described as percentages of relative sequence abundances (RSA) of genera.

23

Results and Discussion

EDA and PCA of Sheep Dataset:

For the sheep dataset, we first graphically displayed the relationship between methane production per kg DMI and the RSA of the genera Methanobacteria and Prevotella. As shown in

Figure 3.1, methane production tended to have a linear relationship with the relative sequence abundances of those two genera. Secondly, we added A: P ratio into the first step’s graphs. The linear relationship could still be observed (Figure 3.2). Thirdly, we compared those relationships among the different sheep breeds. As shown in Figure 3.3, the breeds did not have significant effects on the linear relationship. Lastly, we examined the shape of the distribution among the microbial taxa (229 taxa in total). 149 were normally distributed, and 152 were normally distributed after log transformation.

On the PCA plot of the sheep dataset (Figure 3.6), the first principle component (PC1) only explained 13.4% of total variance, while the second to tenth PCs only explained from 6.6% to 2.7% variance. The percentage of variance explained by each PC was small, and the first 31

PC together explained 85.41% of total variance. The overall results still contained a lot of redundancy. As shown in Figure 3.7, which is showed the variables based on PC1 and PC22, 10

MV contributed the most to PC1 (Table 3.1). Firmicutes, specifically the order Clostridiales, made the largest contribution to PC 1, and the family Ruminococcaceae was the largest contributing taxon among the Firmicutes taxa. The genus of Selenomonas was negatively associated with axis 1 and axis 2 (Figure 3.7). Two taxa of archaea (methanogens) and one taxon of protozoa were the primary contributors to PC1, which indicates that those three taxa might play important roles in methane production.

24

In summary, we discovered that the RSA of some taxa of rumen microbes have a linear relationship with methane production from sheep and that different animals might have random effects on this linear relationship. The microbial data generally followed a normal distribution, which indicates that co-variances in the microbiota are low and microbial variances are relatively independent. PCA, as an unsupervised machine-learning method, provided us an insight into the characteristics of the microbial data but cannot significantly reduce the complexity of the dataset.

EDA and PCA of Cow Dataset:

We processed the cow dataset similarly as we did for the sheep dataset. First, we graphically displayed the relationship between methane production per kg DMI and the RSA of the genus Ruminococcus. We also visualized the relationship between methane production per cow per day and the RSA of the genus Methanobrevibacter. As shown Figures 3.8 and 3.9, methane production trended to have a positive linear relationship with the RSA of Ruminococcus and Methanobrevibacter (Figures 3.8 and 3.9). Secondly, we added the A:P ratio to the first step’s graphs. The graphs showed a stronger linear relationship between A:P ratio and methane production per cow per day with the increasing of the RSA of Ruminococcus. Then, we compared those relationships among nine different dietary treatments. Different dietary treatments appeared to have a significant effect on the linear relationship. In addition, we tested the effects of animals that had ≥3 replicate samples on the linear relationship (Figure 3.4). The overall trend was linear. Lastly, we examined the distribution shape of 77 variables, and 72 of them were normally distributed, while 73 of them were normally distributed after log transformation. These results indicate that most of the variables were normally distributed among cow data.

25

We used PCA to further analyze the data. The PC1 explained 23.5% of total variance, while the second to tenth PCs explained from 11.0% to 2.5% of total variance. The percentage of variance explained by individual PC was small, and the first 14 PCs together explained 80.24% of total variance. As shown in Figure 3.14, Butyrivibrio and Succiniclasticum were the major contributors to dimension 1. Two archaeal taxa were the largest contributors to dimension 2, and the butyrate molar proportion had a strongly negative effect on dimension 2.

Based on the EDA including PCA, we could make the following assumptions: (1) there is a linear relationship between the relative sequence abundance, at least some taxa of rumen microbes, and methane production from sheep and cows (both dairy and beef); (2) different animals have random effects on this linear relationship; (3) the RSA data of most rumen microbes are normally and independently distributed. Linear mixed models (LMMs) can provide an appropriate framework for multivariate regression analysis and prediction of methane production from sheep and cows. Because each animal has two or more replicate samples, the animal’s ID can be treated as a random effect and other predictor variables, such as DMI and microbial data, can be treated as fixed effects. Therefore, we can built LMMs with both animal performance data and microbial data included.

26

Figure 3.1. Plot of methane production (per kg DMI) and RSA of genera of sheep rumen microbes.

27

Figure 3.2. Trends of CH4 (gram) production (per sheep per day) with A:P ratio and the RSA of 2 taxa of rumen microbes.

28

Figure 3.3. Trends of CH4 (gram) production (per kg DMI) with A:P ratio and the RSA of Ruminococcaceae among different sheep breeds.

29

Figure 3.4. Percentages of variance explained by each principal component.

30

Figure 3.5. Graph of variables based on Dimension 1 and 2.

*Positively correlated variables point to the same direction. Negatively correlated variables point to the opposite direction.

31

Figure 3.6. Relationship between methane (gram) production (per kg DMI) by cows and the RSA of Proteobacteria.

32

Figure 3.7. Plot of methane (gram) production (per kg DMI) and the RSA of Ruminococcus (top panel), and the linear relationship between methane production (per cow per day) and the RSA of Methanobrevibacter (bottom panel) in cows.

33

Figure 3.8. Linear trend of methane (gram) production (per kg DMI) with RSA of Ruminococcaceae and acetate: propionate ratio in cows.

34

Figure 3.9. Trends of CH4 (gram) production (per kg DMI) with A:P ratio and RSA of Ruminococcaceae among different dietary treatments* in cows.

35

*Treatments 1-4, 6 and 8 received 3 g cyclodextrin/100 kg live bodyweight (LW); Treatment 1: chloroform (CF)-cyclodextrin (1.6 g CF/ 100 kg live weight), Treatment 2: chloroform-cyclodextrin (1.6 g CF/ 100 kg LW) + phloroglucinol (75 g/ 100 kg LW), Treatment 3: chloroform-cyclodextrin (2.6 g CF/ 100 kg LW), Treatment 4: chloroform-cyclodextrin (1.6 g CF/ 100 kg LW), Treatment 5: Control without treatment, Treatment 6: cyclodextrin (3 g/100 kg LW), Treatment 7: 150 g/d conjugated linoleic acid, Treatment 8: chloroform-cyclodextrin (1 g CF/ 100 kg LW), Treatment 9: 100 g/d conjugated linoleic acid

36

Figure 3.10. Trends of CH4 (gram) production (per kg DMI) with A:P ratio and RSA of Ruminococcus among different cows.

37

Figure 3.11. Percentages of variance explained by each principal component of the cow data.

38

Figure 3.12. Graph of variables of the cow data, based on Dimension 1 and 2.

*Positively correlated variables point to the same direction, while negatively correlated variables point to the opposite direction.

39

Table 3.1. The top 10 taxa of microbes contributing the most to PC 1

Rank phylum Class Order Family Genus

1 Bacteria Firmicutes Clostridia Clostridiales Unknown Unknown

2 Bacteria Firmicutes Clostridia Clostridiales Catabacteriaceae Unknown

3 Bacteria Lentisphaerae Lentisphaerae Victivallales Victivallaceae Unknown

4 Eukarya Ciliophora Litostomatea Entodiniomorphida Ophryoscolecidae Enoploplastron

5 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Unknown

6 Archaea Methanobacteria Methanobrevibacter

7 Bacteria Firmicutes Clostridia Clostridiales XIII.IncertaeSedis Unknown

8 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Oscillospira

9 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Ruminococcus

10 Archaea Euryarchaeota Methanobacteria Methanobacteriales Methanobacteriaceae

40

Chapter 4. Variable Selections Using the Forward Stepwise Selection

Abstract

Microbial variables (MV) have strong correlation with methane production because microbial fermentation directly contribute to or affects methane production, as discussed in chapter 2. The relative sequence abundance (RSA) of some MV appeared to have linear trend with methane production, as discussed in chapter 3. However, utilizing all rumen MV in the development of prediction models for ruminant methane emissions may result in overfitting.

Thus, the selection of key explanatory predictors is needed to develop parsimonious models that are able to robustly predict methane production. Subset selection is a technique for addressing overfitting problems. Here we proposed two frameworks of best subset and one framework of forward-stepwise selection in R to model methane production by sheep and cattle. The novelty of these three frameworks is to automatically select variables in linear mixed models based on the size of variables and information criteria. The variables selection of subset was performed when all potential predictors were available and when MV that had over 0.01% RSA were available.

We utilized three approaches for variables selection based on three measurements of methane production: g CH4/d/per animal (animal-based models), g CH4/kg dry matter intake (DMI, DMI- based models), and g CH4/d/kg metabolic bodyweight (MBW, MBW-based models). The results of forward-stepwise selection confirmed the significance of RSA of MV in the methane modelling. Forward-stepwise selection based on BIC successfully selected 3 parsimonious subsets for methane production by sheep. The results of three models based on those variable selections had low mean squared prediction error (MSPE) and parsimonious pool of predictor 41 variables. This study suggests that prediction models of methane production can be improved when ruminal microbial data are included in the models.

Introduction

Based on the datasets collected and the exploratory data analysis described in chapter 3, microbial data are high dimensional because of the large number of genera and species (OTUs) and the complexity of microbiota (substantial variation in relative sequence abundance, RSA).

Simply including all these microbial data as variables will lead to prediction models that are overfitting and uninterpretable. When encountering a large number of predictors, the ordinary linear regression relying on all these predictor variables can produce a linear regression with low bias but large variance on parameter estimates. This is so-called “overfitting”. For example, in the model of methane production, when one fits a simple linear regression only between methane production and DMI as shown in Figure 4.1, the model ignores any curve trends in the relation and always assumes this relationship is linear. The problem is so-called underfitting or high bias, which means that the hypothesis of linear relationship between methane production and DMI will never change after fitting this model. One potential improvement to this model is to add another predictor, such as the RSA of Ruminococcaceae (Figure 4.2), resulting in a better curve to describe this relationship. When one adds more predictors in a model, such as all different genera of Bacteria and feed ingredients, a prefect curve might go virtually through all points observed because of the immense amount of predictor variables. This over-fitted model, however, will not perform well with new data because it contains large variance, and the relationship between predictors and response is not clear. Most of previous models reviewed in chapter 2 are underfitted or high bias. Therefore, simply including all these possible data will lead to model overfitting.

42

To solve the overfitting problem, the prediction model can be improved by selecting a set of significant variables and determine a trade-off between model fit accuracy and the number of predictor variables. The subset selection can search and ignore some variables while maintaining prediction accuracy because prediction accuracy can often be determined by a reduced set of predictor variables (Friedman et al., 2001). Two subset selections can be used: best subset selection and forward-stepwise selection. In best subset selection, all possible models with a given number of variables and their MSPE are evaluated to make a trade-off between the number of variables and the prediction error. The mechanism of the framework of best-subset selections was simply building all possible models and calculating the prediction error, the

Akaike information criterion (AIC), and the Bayesian information criterion (BIC). Theoretically, best-subset selection can generate the best models because it searches for all possible models.

However, if the number of variables is large (푝 > 50 ), the evaluation of 2푝 − 1 different models is time-consuming and computationally intensive, even for a supercomputer. Currently, there are three R packages, (“leaps”, “bestglm” and “BeSS”) that can be used to complete efficient best- subset selection based on Linear Models and Generalized Linear Models, but the best-subset selection based on Linear Mixed Models (LMMs) are limited and still under development. An alternative subset selection is the Forward-stepwise selection. Forward-stepwise selection follows a forward searching order to seek a path through a reduced set of all possible subsets.

The principle of forward-stepwise selection is to add individual predictors one at a time to the null model that contains only an intercept and select the subset of explanatory predictors by information criterion. The forward-stepwise selection has the potential to reduce the number of predictor variables and make models interpretable. For instance, the objective of BIC is to find an optimal model in the sense that it fits the data well while penalizing for model complexity.

43

Equation 4.3 defines BIC (Schwarz, 1978). The principle of BIC is to reward better minimization of cost function but penalize the more complex models (Ihaka & Gentleman, 1996). Adding a parameter to the model increases the regularization of cost function, but the model is also penalized for increased complexity.

Equation 4.1: Generic form to define the BIC:

BIC = 퐼표푔(푛)푘 − 2 퐼푛(퐿̂)

퐿̂ : the maximized value of the likelihood function of the model.

푛 : the sample sizes.

푘 : the number of predictors.

The objective of this study was to use forward-stepwise selection to select rumen MVs for methane prediction models

Material and Method:

Framework Development:

Best Subset selection:

Here we proposed two simple algorithms and R scripts to automatically complete the best-subset selection based on LMMs. The first algorithm (Figure 4.4) was built to list and construct all possible models and fit these models using the “lmer” function from the package

‘lme4’. The second algorithm (Figure 4.5) was set as a “tracker” to record the complexity of the variables, and it was able to gradually increase the number of all possible models for a given number of variables. When the number of variables was large, the second algorithm could

44 significantly reduce the need for computer RAM by a recursion technique that operates a function to call itself within function's code.

Forward-stepwise selection:

As shown in Figure 4.6, we proposed and built the function of “Forward.lmer” around the package “lme4”, and this algorithm in linear mixed models was inspired by the work of

Nieuwenhuis (2018). We fixed several bugs of his work in Z test and F test, and our algorithm and functions provided a procedure of the forward-stepwise selection for mixed effect models based on BIC. The function started with a null model and a “track” to record the size of the model.

Then the model was applied to the “lmer” function to produce a LMM and the summary of this model. In the next iteration, a variable was temporally added into the model and our function compared the summary with the first model. If the result met the settings of criteria, for example, if it reduced BIC, the variable was kept in the function. Otherwise, the variable was discarded.

Our function had four information critions: p-value for Z or F test, BIC, and AIC. After the size of model indicated by the “track” met the setting we desired, the function was ended.

Our forward-stepwise function had several parameters that needed to be specified: the starting model (e.g., a model only has the intercept), variable names that were iteratively added to the starting model, the maximum size of the model, and the information criterion of this selection procedure (e.g., p-value for Z or F test, BIC, and AIC). In our study, the selection process started with the intercept and then sequentially expanded the model by selecting and adding the predictor, eventually leading to a parsimonious model. By gradually increasing the number of predictors included in the model, forward-stepwise selection made the restrictedly optimal choice at each size of the model and anticipated the finding of a global optimum.

45

Variable Selections of Methane Prediction Models:

Predictors were standardized and centered to have a mean value of 0. Specifically, we subtracted the column means (ignoring unknow data) from their corresponding columns to center the data. Then we divided the centered columns by their standard deviation to scale the data.The three dependent responses based on three measurements of methane production: g CH4/d/per animal (animal-based models), g CH4/kg dry matter intake (DMI, DMI-based models), and g

CH4/d/kg metabolic bodyweight (MBW, MBW-based models). We first used two frameworks of best-subset selection. Then we used Z test as criterion for the forward-stepwise selection.

Iteration times of forward-stepwise were set as the size of potential predictors. The forward- stepwise selection started with the intercept, and then sequentially expanded the size of a model to select predictors. The thredhold of significance of Z test was set as the p < 0.01. Then, we used BIC as information criterion to select varaibles. Iteration times of forward-stepwise were set as the size of potential predictors. Because the size of potential predictors in dataset was relatively large (310 in total), we filtered out the MV that had a relative sequence abundance lower than 0.01%, and we also tested the profermoranc of forward-stepwise selection when all potential predictor were added into the selection. After we found a proper subset of logical predictors that had influential correlation with methane production, we built LMMs with those predictors. Ten-fold cross-validation was used to test the performance. Because we also considered the animal data in the models, the training subdataset and the testing subdataset did not contain any data from the same animal. We tested 2 types of test data and training data: one with the log transformation of the relative sequence abundances of microbes and one without the log transformation. We examined those models by the mean squared prediction error (MSPE) and the

46 variance inflation factor (VIF) of models. MSPE, RMSPE, and RMSPE% were calculated using equations 4.2, 4.3, and 4.4, respectively.

Equation 4.2: Calculation of MSPE:

푀푆푃퐸 = 퐸[(푔(푥푖) − 푔̂(푥푖))^2]

퐸 : Mean Value

푔(푥푖) : True Value

푔̂(푥푖) : Prediction Value

Equation 4.3: Calculation of RMSPE:

푅푀푆푃퐸 = 푀푆푃퐸1/2

Equation 4.4: Calculation of RMSPE%:

푅푀푆푃퐸 푅푀푆푃퐸 (%) = 100% × 푀푒푎푛 표푓 푅푒푠푝표푛푠푒

We set 1000 times of 10-fold cross-validation. Inputs of the model assessment were the original variable data. The R scripts were executed on the computer cluster at Ohio Agricultural

Research and Development Center (OARDC) via PuTTY (COPLIN et al., 2012).

Results and Discussion

Forward-stepwise selection based on the BIC:

47

The first algorithm of best-subset selection did not work properly because of the lack of sufficient RAM available at the OARDC computer cluster. The second best-subset selection worked, but not smoothly, because the evaluation was very time consuming.

The forward-stepwise selection based on the significant level of Z test produced 8 variable selections, and the size of variable selections was 53, 25, 23, 20, 18, 43, 16 and 26.

Those selections were aborted from the further model building because they contained more than

18 predictors. One variable selection had 16 variables, but we also aborted it because it contained feed additives. The sizes of the variable selections by forward-stepwise selection based on Z test and BIC were summarized in Table 4.1. The resulting models selected with Z test were not parsimonious, but the procedure confirmed that several MV had a p-value below 0.01 in the selected model. The BIC-based forward-stepwise selection constantly decreased the BIC with iterations of variable selection (Figure 4.8) and successfully produced 8 variable selections that had smaller size than the models resulting from selection based on Z tests. The sizes of the set of predictor variables were 6, 9, 12, 84, 33, 27, 22 and 20. Variable selection 1, 2 and 3 were sparse.

The size of other selections was over 20. Linear Mixed Model 1, 2 and 3 were built based on variable selection 1, 2, and 3 for the three different response. The total number of MV contained in the three models were 13 (Table 4.5). All the models contained one unknown genus in the

Ruminococcaceae Family. Ruminococcaceae is a group of Gam-positive bacteria predominant in the rumen and produces hydrogen during fermentation of carbohydrates. Models 1 and 2 both had Parabacteroides. The major end product of glucose fermentation by Parabacteroides has been described as acetic and succinic acids (Barnesiella, 2015). Models 1 and 3 both contained one unknown genus in the Coriobacteriaceae Family, which is a family of the phylum

Actinobacteria. The family Coriobacteriaceae contains 14 genera (Clavel et al., 2014), some of

48 which have been detected in the rumen, such as Atopobium, Denitrobacterium, Olsenella, and

Slackia. Models 2 and 3 had 4 common MV that were all bacteria. The first predictor of model 1 was acetate: propionate ratio, and the remaining 6 predicators were MVs. Five of the MV were bacteria, and one MV was fungi. The mean RMSPE of 1000 cross-validations was 2.126 g/d·DMI, and the BIC was 871.6. The variance of random effect from animals was 1.435, and the variance from random error was 1.434. Models 2 had 9 predictors (Table 4.3). The first two predictors were from animal data including: acetate: propionate ratio and gender (coded as 0 for female and 1 for male), and the remaining 7 predicators are MV. Six of the MV were bacterial taxa, while the other MV was an archaeal taxa. The mean RMSPE was 0.1483 g/d·MBW, and the BIC was 810.4. The variance of random effect from animals was 0.0262, and the variance from random error was 0.0264. Model 3 contained 11 variables, 3 of them were animal variables and 8 of them were MV. The mean RMSPE was 2.881 g/d·animal, and BIC was 1035.254. The variance of random effect from animals was 1.141, and the variance from random error was

1.442. The range of the three models’ RMSPE as percentage of response were from 0.132 to 0.141.

However, the framework produced large sizes of variable selections for the cow dataset (data not shown).

Summary

We built and developed three frameworks of subset-selection that can automatically select variables based on information criteria and cross-validation. We recommend the second framework for best subset selection because it does not require lots of computer random memory and because it fixed several bugs and can automatically complete model section for LMMs.

49

We used those frameworks for variables selection. The execution of the R scripts of the best subset selection was time-consuming and practically infeasible to be completed even with a computer cluster. Rather than searching through all possible subsets, we used forward-stepwise selection to sequentially add variables and developed R scripts (function: “forward.lmer” ) for the forward- stepwise selection of LMMs based on significant level of Z test or T test and BIC as the information criterion. The framework of forward-stepwise selection successfully selected significant variables based on the BIC in the sheep dataset. For the sheep dataset, three selections of predictors had a

RMSPE (%) below 2%, which indicates accurate predictions. The BIC and the sizes of the three variables selections were also small, indicating that the frameworks reduced the overfitting problem. However, the forward-stepwise selection showed a drawback, which is that the sizes of other selections were not parsimonious. The forward-stepwise selection uses an algorithm to sequentially add variables into a prediction model, and the order of adding the variables follows a specific path. This path could affect the finding of the optimal model because it could hide a significant variable or add an unnecessary variable into the model.

In summary, the forward-stepwise selection successfully solved the overfitting problem and provided good performance in the sheep data. It confirmed that the BIC of methane prediction models could be improved if microbial data are included as variables, and it also confirmed that large sizes of

MV had strong significance of Z test in methane models. The three models of sheep had relatively low

RMSPE. However, the selection did not perform well in the cow dataset when a large number of potential predictors were added to the model selection procedure. The forward-stepwise selection discretely removes predictors, probably making models with large variance not accurate, even if a different searching path is chosen. Unsatisfactory results motivated us to build models using more computational advantage.

50

Figure 4.1. An example of underfitted linear model of methane production by including only DMI as the sole variable.

Figure 4.2. An example of proper linear model of methane production by including DMI and Ruminococcaceae as variables.

51

Figure 4.3. An example of overfitted linear model of methane production by including DMI and 10 different types of bacteria and feed ingredients as variables.

Figure 4.4. The flowchart showing of the no-tracker algorithm to automatically complete the best-subset selection based on LMMs.

52

Figure 4.5. The flowchart showing the tracker algorithm to automatically complete the best- subset selection based on LMMs.

53

Figure 4.6. The flowchart showing the algorithm for the forward-stepwise selection.

54

Figure 4.7. The change of BIC with forward-stepwise iterations

55

Table 4.1. Sizes of variable selections by forward-stepwise selection based on Z test and BIC.

Response Datasets Size of predictors in the Size of variable Size of variable dataset selections made by selection by forward-stepwise based forward-stepwise on Z test based on BIC CH4 (g/kg DMI) Sheep 132 53 6

CH4 (g/d·kg MBW) Sheep 25 9

CH4 (g/d·Animal) Sheep 23 12

CH4 (g/kg DMI) Sheep 310 20 84

CH4 (g/d·kg MBW) Sheep 18 33

CH4 (g/d·Animal) Sheep 43 27

CH4 (g/kg DMI) Cow 35 16 22

CH4 (g/d·Animal) Cow 26 20

56

Table 4.2. For the sheep dataset, model 1 selected by the forward-stepwise selection.

Variables VIF Coefficients Std. Error T value

Intercept 9.230 0.784 11.787 Acetate: propionate ratio 1.109 1.104 0.191 5.784 Ruminococcaceae (bacterial family) 1.249 15.761 4.102 3.848 YS2 (a candidate bacterial order) 1.073 62.524 14.341 4.361 The genus Olsenella 1.212 -47.740 9.861 -4.852 Coriobacteriaceae (bacterial family) 1.137 109.783 31.686 3.470 The genus Parabacteroides 1.031 -96.270 27.622 -3.494 Cyllamyces 2 (a candidate fungal genus) 1.001 -972.234 287.934 -3.388 Random effect from animal variance 1.435 1.296

Random error variance 1.434 1.295

MSPE 4.521 1.997

RMSPE 2.126 RMSPE % 0.141

57

Table 4.3. For the sheep dataset, model 2 selected by the forward-stepwise selection based on BIC.

Variables VIF Coefficients Std. Error t value Intercept 7.030 0.748 9.401 Acetate: propionate ratio 1.281 1.190 0.181 6.569 Gender as Male 1.094 1.427 0.313 4.564 Ruminococcaceae (a bacterial family) 2.487 26.900 3.860 6.968 Prevotella 1.358 3.165 0.992 3.192 Z20 (a candidate bacterial order) 1.201 -252.273 45.093 -5.613 The genus Eubacterium 2.245 -86.053 23.208 -3.711 The genus Parabacteroides 1.069 -85.431 23.655 -3.618 The genus Ruminococcus 1.364 -44.631 13.067 -3.421 The genus Methanobrevibacter 1.137 -2.585 0.888 -2.913 Random effect from animal variance 0.0262 0.0256 Random error variance 0.0264 0.0257 0.022 MSPE 0.14832397 RMSPE RMSPE % 0.118172871

58

Table 4.4. For the sheep dataset, model 3 selected by the forward-stepwise selection based on BIC.

Variables VIF Coefficients Std. Error t value

Intercept -0.823 0.187 -4.407 Acetate: propionate ratio 1.109 0.216 0.027 7.925 DMI 1.249 0.844 0.124 6.833 BW 1.073 0.020 0.004 5.522 Z20 (a candidate bacterial order) 1.212 -41.281 6.888 -6.027 Ruminococcaceae (a bacterial family) 1.137 3.554 0.622 5.721 The genus Prevotella 1.031 -1.564 0.408 -3.855 The genus Oxalobacter 1.001 -132.957 47.021 -2.839 Coriobacteriaceae(a bacterial family) 1.238 13.986 4.392 3.194 The genus Ruminococcus 1.554 -19.076 6.592 -5.077 The genus Eubacterium 1.344 -15.390 3.804 -4.050

Random effect from animal variance 1.141 0.714 Random error variance 1.442 1.035

MSPE 8.303

RMSPE 2.881

RMSPE % of Response 0.131

59

Table 4.5. Details of Selected MV’ Taxa

Model Domain Phylum Class Order Family Genus Number 1, 2, 3 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Unknown 1, 2 Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae Parabacteroid es 1, 3 Bacteria Actinobacteria Actinobacteria Coriobacteriales Coriobacteriaceae Unknown 2, 3 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella 2, 3 Bacteria Lentisphaerae Lentisphaerae Z20 Unknown Unknown 2, 3 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Eubacterium 2, 3 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Ruminococcus 1 Bacteria Cyanobacteria 4C0d-2 YS2 Unknown Unknown 1 Bacteria Actinobacteria Actinobacteria Coriobacteriales Coriobacteriaceae Olsenella 1 Fungi Neocallimastigales Neocallimastigaceae Cyllamyces Cyllamyces 2 Unknown 2 Archaea Euryarchaeota Methanobacteria Methanobacteriales Methanobacteriaceae Methano- brevibacter 3 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Oxalobacteraceae Oxalobacter 3 Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae Ruminococcus

60

Chapter 5. Variable Selections by GLMMLASSO and Model Development

Abstract

Although the forward-stepwise selection partially addressed the “overfitting” and

“uninterpretable” problem and three models had good performance for the sheep data, other models selected still had poor results. To make parsimonious variable selection when the size of predictors was over 300 and develop better models taking advantage of more computational advantages, we used a machine-learning method: GLMMLASSO (generalized linear mixed models using L1-Penalization). The computational advantage of GLMMLASSO is that it allows adding a penalty concept to continuously select coefficients and solve the overfitting problem

(Friedman et al., 2001). The variables selection of GLMMLASSO was performed when all potential predictors were available and when MV that had a RSA greater than 0.01% were available. In the second scenario, the high collinearity can be harmful for the GLMMLASSO selection. Thus, we reduced the variance inflation factor (VIF) by discarding a fixed number of variables created by several variable choices before the GLMMLASSO selection. Responses were set as: g CH4/d/per animal (animal-based models), g CH4/kg DMI (DMI-based models), and g CH4/d/kg metabolic bodyweight (MBW, MBW-based models). The optimal penalty of

GLMMLASSO was selected based on BIC, and cross-validation (CV) was used to identify the subset of microbial variables (MV) that resulted in the lowest mean square prediction error

(MSPE). When the penalty parameter lambda was set between 0 to 1000, GLMMLASSO successfully converged and selected 8 to 13 variables for each type of responses. Of the 54

61 variable selections in the sheep dataset, one unknown genus of the candidate order Z20 had the largest negative t-value. The challenge of predictions in the cow dataset was described in chapter

3, and one of the animal-based selection had low MSPE and did not contain any dietary treatment. Thus, the effect of the dietary treatments was explained by the combination of animal variables and MVs. For linear models, we tested the performance of the models that contained variables selected by GLMMLASSO with CV or forward-stepwise selection, and traditional model that only contained DMI, acetate: propionate ratio, and BW as predictor variables. The sheep models built by variables from the GLMMLASSO selection with CV had lower mean squared prediction error (MSPE) than other models. For the sheep dataset, DMI-based models only contained MVs. The animal-based model for the cow dataset had better interpretability and lower

MSPE than other models. However, it had high VIF, probably caused by the high collinearity of variables in archaeal variables. Animal random effects were low for all the models and similar to the random error. We concluded that GLMMLASSO selection with cross-validation are useful methods to extract the meaningful and significant subset of MVs that can improve the accuracy of methane prediction models. Further study can focus on the exploration of the common variables in cow and sheep dataset, interaction terms, and the between animal variation determined by the animal random effects.

Introduction

In chapter 4, we found that the forward-stepwise selection with the Bayesian information criterion (BIC) reduced the number of predictors and made models interpretable. It also confirmed that microbial data or variables could be included in models to predict methane production. However, the including MV in methane prediction models was not successful when a large number of predictors was available in the dataset and the dietary treatments were used as

62 potential predictors of methane in the cow dataset. The main purpose of this study is to use

GLMMLASSO with CV based on BIC to solve the overfitting problem and further explore dietary treatments as potential predictors. GLMMLASSO is an extension of LASSO. LASSO is a shrinkage method that uses an L-1 penalty to control the shrinkage in variable selection. For

Linear Mixed Models, the development of LASSO for generalized linear mixed models, called

“GLMMLASSO”, is a powerful machine-learning algorithm to make variable selections in mixed models (Schelldorfer et al., 2014). The penalty concept improves prediction accuracy by shrinking some coefficients and/or setting some coefficients to exactly zero. This shrinkage to 0 determines a relatively small subset of predictors that accounts for the important variables in the prediction model. Regression algorithms fit parameters by minimizing the cost function; for example, the cost function widely used in regressions is residual sum of squares (RSS) or J(θ)

(equations 5.1 and 5.2), which does not contain any penalty for parameters 휃. Besides minimizing the RSS, the GLMMLASSO selection also minimizes the parameters by a weighted

L-1 norm penalty of parameters denoted by 휆 ∥ 휃 ∥, which is called “Regularization”. In other words, the GLMMLASSO selection allow two steps of model selection. The first one is smoothly completed by varying the L-1 penalty. The increase of L-1 penalty will decrease the number of predictors. Then, the second-step selection can be completed using information criterions such as BIC as done in the forward-stepwise model selection. Thus, GLMMLASSO has a computational advantage over subset selection by adding an L-1 penalty to the cost function, and L-1 penalty can continuously shrink coefficients of predictor variables. In addition, because the algorithm has already selected the significant predictors before the model selection using information criteria, GLMMLASSO often select models that are more parsimonious than subset selection and prevent hiding significant predictors.

63

Equation 5.1: Calculation of RSS of LMM:

1 RSS = J(θ) = ∑푚 (ℎ (푥(푖)) − 푦(푖))2 2푚 푖=1 휃

where 푦(푖) is the observation in data 𝑖 .

푚 is the size of data.

(푖) ℎ휃(푥 ) is the prediction in data 𝑖.

Equation 5.2: Calculation of J(휃):

1 2 J(θ) = ∑푚 (ℎ (푥(푖)) − 푦(푖)) + 휆 ∥ 휃 ∥ 2푚 푖=1 휃

휆 is the weight of L-1 LASSO penalty

∥ 휃 ∥ is L-1 norm of parameters coefficients.

The optimization of GLMMLASSO can be solved by the gradient ascent algorithm which allows for minimizing the cost function of GLMMLASSO with a certain L-1 penalty to make stable estimations of parameters (Groll and Tutz, 2014). The gradient ascent algorithm calculates the log-likelihood with L-1 penalty at each step of Taylor approximation by a directional second order Taylor approximation. The core issue is to find and detect the correct step size of Taylor approximations. GLMMLASSO finds the optimal step size of Taylor approximations by an

“update” routine (Groll and Tutz, 2014). The R scripts of GLMMLASSO with the gradient ascent optimization have been developed by Groll and Tutz for high-dimensional datasets that contains a larger number of hypothetically effective explanatory predictors (Groll and Tutz,

2014). Bayesian information criterion (BIC) is one criterion available to identify the optimal L-1

64 penalty. The objective of BIC is to find the optimal model that fits the data well while penalizing for model complexity. The principle of BIC is to reward better minimization of cost function but penalize the more complex model (Ihaka and Gentleman, 1996). Adding a parameter to the model will increase the regularization of cost function but will also penalize for more complexity. A high degree of multicollinearity between predictor variables can be harmful for the GLMMLASSO selection procedure. For the development of stable models using

GLMMLASSO, the multicollinearity measured by the variance inflation factor (VIF) should be kept below 10, below which predictors are not highly collinear. VIF measures the collinearity between a set of predictors (equation 5.3). If the predictors are not collinear, then VIF attains its minimum value of 1.

Equation 5.3: Calculation of VIF:

2 ̂ 휎 1 푉(훽푗) = 2 2 ∑(푥푖푗 − ̅푥̅𝑗̅) 1−푅푗

̂ 훽푗 is the j parameter.

푥푖푗 is the 𝑖th row, jth column of data.

푥̅푗 is the mean of the jth column of data.

푅푗 is the square of the multiple correlation from the regression of the jth column of data

on the other columns.

휎2 is the variance of data.

65

Method and Material:

Variable Selection:

Our procedures of variable selection based on GLMMLASSO included 6 steps, and all steps were completed in R. First, original predictor variable data in the training dataset were centered and scaled to have a mean 0 and variance of 1, as inputs of GLMMLASSO. Variable selections were performed when all potential predictors were available and when only the MVs that had a RSA >0.01% were available. When MVs that had a lower RSA were also available, in order to prevent collinearity, we reduced the VIF with centered and scaled data from the training data set by discarding some variables. We built an R function that fitted a linear regression model with all variables to identify variables that had a VIF over 100. Then, it sequentially discarded the desired number of variables to reduce the VIF. As shown in Figure 5.2, we set the threshold of VIF to 30 and the number of discarding variables as 5 in the first stage of VIF reduction. If the VIF were not reduced to below 10, in the second stage of VIF reduction, the threshold of VIF was set to 20, and the number of discarding variables was 5 in the second stage.

Thus, we created several variable choices in the first or second stage VIF reduction and when all potential predictors were available.

After reducing VIF, we set the initialization of GLMMLASSO by calculating the null model’s random effect, random error, and covariance matrix of random effects. Link function was set as an identity function for a Gaussian distribution of the response variable, i.e., specifying the LMM as a special case of a generalized linear mixed model. The animal ID was set as random effect. The R script of initializations was supplied by “glmmPQL” (Venables and

Ripley, 2013), as shown in Figure 5.3. Because each variable choice is distinctive, the number of

66 variables for reducing VIF, the number of parameters needs to be initialized need to be specified.

Thus, we set a function to automatically set the initialization parameters of GLMMLASSO.

After initializations of GLMMLASSO’s parameters, the GLMMLASSO fitted the LMMs based on variable choices from reducing VIF. The penalty parameter lambda was specified to be contained in an interval between 0 to 1000 with an increments of 0.1, controlling the weight of

L-1 penalty. Models with the varying lambdas were collected on a list of BICs, and the optimal lambda was selected the value that minimized the BIC. The threshold of variables selection was p-value below 0.05. To reduce time of computation, we set several multithreading functions in R on the computer cluster at Ohio Agricultural Research and Development Center (OARDC) via

PuTTY (COPLIN et al., 2012). We set three methane production responses based on three measurements of methane production: g CH4/d/per animal (animal-based models), g CH4/kg

DMI (DMI-based models), and g CH4/d/kg metabolic bodyweight (MBW, MBW-based models).

The GLMMLASSO selection created several variable selection procedures. Ten-fold CV was used to separate datasets into training datasets and testing datasets. In particular, because we considered the random effect in our prediction models, the framework automatically selected the training data that did not contain any data from the same experiment in testing data.

Models for the animal datasets were developed using a linear mixed model (LMM) in the

“lmer4” function, and the mixed effect was set as animal ID, where each animal had its own random intercept. We assumed that all predictor variables did not interact, and coefficients associated with the same random-effects term were assumed to be correlated as the default in

“lmer” function. The lowest MSPE indicates acceptable goodness-of-fit. We also desired an interpretable model that had both low MSPE and clearer biological explanations. The best selection was extracted based on MSPE and interpretability.

67

Development of Methane Prediction Models:

Model selections had five types of candidate models: (1) linear models based on variables selected by GLMMLASSO, (2) linear models based on variables selected by GLMMLASSO with log-transformed RSA of microbial data, (3) linear models based on variables selected by

Forward-stepwise selection, (4) linear models based on variables selected by forward-stepwise selection with log-transformed RSA of microbial data, (5) linear models based on tradition model that only contain DMI, acetate: propionate ratio and BW. Model selections were also evaluated by MSPE and RMSPE% based on 1000 CV. The final equations were evaluated using the results of all the available data. MSPE and VIF were calculated in the testing dataset, and the equation MSPE and BIC were described in chapter 4.

Results and Discussion:

Variable Selection:

Sheep dataset:

After data centering and scaling, we fitted a linear regression model with all variables and our framework identified and discarded several variables to reduce VIF. Firstly, for sheep dataset, we identified 25 variables that had high VIF, greater than 100. We sequentially discarded 2 to 5 variables to reduce the VIF, but the VIF could be reduced to only below 30 by discarding 3 variables, which was expected to be below 10. The VIF could be reduced to below

10 by discarding 4 variables, which resulted in 17 potential list of variables. We also added one

68 choice that contained all variables. Thus, the total number of variable choices was 18. All discarded variables were MVs. Detail of the variable choices were collected in Table 5.1. The results of VIF were the same among the 3 responses because the multicollinearity depends only on predictor variables. Based on those choices, our framework automatically initialized the models. Then, we set 10,000 iterations of the penalty parameter lambda from 0 to 1,000 by an increment of 0.1 and fitted linear mixed model with L-1 penalty by GLMMLASSO in the training datasets.

The LASSO successfully converged when Lambda was set between 0 to 1000. The changes of coefficients of variables are briefly shown in Figure 5.4 and Figure 5.5. The optimal lambda was selected by minimizing BIC. The threshold of variables selection was p-value <

0.05. All results of variable selections after log transformation were the same as variables selections before log transformation. For the animal-based response (CH4 g/d per animal), 18 variable selections by GLMMLASSO could be classified into 2 groups. The total number of selected MVs were 14, and details were collected in the Table 5.4. The common three MVs that all selections contained were: one unknown genus of candidate order Z20, Megasphaera and

Oxalobacter. Selection 2 did not contain fungi or protozoa. For the DMI-based response (CH4 g

/kg DMI), 18 variable selections by GLMMLASSO could be classified into 5 groups. The number of selected MVs were 14, and details were collected in the Table 5.4. All the selections contained MVs and they all contained one unknown genus of candidate order in Z20. One selection contained 4 variables which were acetate: propionate ratio and 3 MVs. Other five selections only contained MVs. For the MBW-based response (CH4 g /d/MBW), 18 variable selections by GLMMLASSO could be classified into 3 groups. All the selections selected gender as male, and we found that the gender as male had greater methane production and DMI intake,

69 as shown in Figure 5.6. The total number of selected MVs were 11. Details were collected in

Table 5.4. Common MV were Coribacteriaceae, Olsenella, Parabacteroides, candidate order

YS2, Eubacterium-L, Ruminococcaceae, Megasphaera, candidate order Z20, Oxalobacter, candidate genus Cyllamyces 2 (Fungi).

After variable selections, we built linear mixed models to extract the best subset of MVs for each response. Details of MSPE and BIC were collected in Table 5.3. We tested the performance of the models suing 1000 cross-validation with uncentered and unscaled data. For animal-based response, means of MSPE of 2 models were 4.849 and 5.509, and means of BIC are 933.8 and 938.2. The best selection with the smallest MSPE was selection 1, and the size of the best model was 10. The random effect from animal variance was 2.50, and the random error variance was 2.73. Variables with larger t-value in magnitude have more significance. The lowest t-value was 2.30, which shows that all variables have high significances and diverge to the null hypothesis. The MV with largest t-value was one unknown genus of candidate order Z20 that had a strongly negative relationship with the methane production per animal per day. One unknown genus in the Ruminococcaceae family had the strongest positive relationship with the methane production.

For the DMI-based response, the means of MSPE of 5 models were 2.398, 2.392, 2.371,

2.295, 2.446. The best model was variable selection 6 and contained only 10 MVs, as shown in

Table 5.4. Nine of the 10 MV in the best model were bacteria, and the remaining one was archaea. The random effect from animal variance had small significance and similar to the random error variance. One unknown genus in candidate order Z20 still had the largest negative t-value, and the unknown genus in Ruminococcaceae family had the largest positive t-value.

70

For the MBW-based response, the means of MSPE of 3 models were 0.0155, 0.0159 and

0.0159. The best selection was selection 8, and its size was 15. Gender as male, DMI, BW and acetate:propionate ratio were significant among the animal variables. One unknown genus of candidate order Z20 still had the largest negative t-value, and the unknown genus in

Ruminococcaceae family had strongest positive t-value.

Cow dataset:

For the cow dataset, we also tested three responses in the combined dataset. The combined dataset is challenging for two reasons. First, the combined data were two independent studies about the effects of 8 different dietary treatments on methane mitigation. Second, experiment animals were different (cows vs. steers). After data centering and scaling, we found that 2 different variables choices to reduce VIF by discarding one variable. The VIF of the combined dataset with all available data was reduced to below 10 by discarding 1 variable. The discarded variables were Methanobrevibacter, and the maximum VIF without discarding MVs was over 50. The variable choices that reduced VIF were listed in Table 5.2. We set 10,000 iterations of the penalty parameter lambda from 0 to 1,000 by an increment of 0.1 and fitted a linear mixed model with L-1 penalty by GLMMLASSO in the training datasets. We calculated and collected a list of BICs and selected the optimal lambda as the minimizer of the BIC. The threshold of variables selection was p-value < 0.05. GLMMLASSO selection successfully made parsimonious variables selections, and the size of the set of predictor variables in the

GLMMLASSO selection was much smaller lower than the selection made by forward-stepwise selection. The combined dataset might have more complex local minimums; thus, the global minimum was hard to find by the forward-stepwise selection. For the animal-based response, the variable selections by GLMMLASSO could be classified into 2 models, and details were collected in

71

Table 5.6. Variable selection number 1 only contained the animal variables and MVs. Eight dietary treatments as factorial variables were not significant. The total number of MVs were 8. The size of first selection was 12. Five of them were animal variables: DMI, total VFA, acetate molar proportion, propionate proportion, and acetate:propionate ratio. MVs were Bifidobacterium,

Prevotella, Clostridium, Ruminococcus, Desulfovibrio, Synergistes, Methanobrevibacter, and

Methanosphaera. The random effect of animal was 113.033, and the random error was 112.940.

The p-values of Clostridium, Desulfovibrio, and Methanobrevibacter were lower than the p-value of DMI and acetate:propionate ratio. The variable with the largest t-value was DMI and the

Methanobrevibacter had the largest t-value among the MVs. This selection is valuable because it can substitute the eight dietary treatments. For thet other two responses, the results of variable selection were also parsimonious, but we aborted those selections because they contained dietary treatments as variables and our study was looking for the explanation of the effect of MVs and animal data.

Model Selection:

Except models 11 and 16, log-transformed relative sequence abundances decreased the mean MSPE. The best animal-based and DMI-based models were generated with log- transformed data, and this may be caused the log transformation increased the linearity of those models and the normality of microbial data as described in chapter 3. However, log-transformed data did not always decrease the MSPE of the animal-based models, which indicates complexity of data’s monotonicity. Original or log-transformed microbial data did not always keep monotonous.

72

We fitted the “traditional variables”: DMI, BW and the ratio of acetate and propionate to models. The traditional model had 13.0%, 12.9% and 13.1% of RMSPE. For the sheep dataset, all best models were selected from GLMMLASSO, and they reduced RMSPE to2.0 %, 1.5% and

2.5%. All the largest VIF from the 3 models are below 3, which indicates that the models keep low multicollinearity. For the cow dataset, we tested the performance of two GLMMLASSO selections. The means of RMSPE% of 2 selections with original data were 9.11% and 9.01%.

Thus, MSPE of the two selections were similar, but selection 1 had better interpretability because it only contained animal variables and MV. Because the dataset was combined by two studies, we also ploted the normal QQ plot to test the heteroscedasticity of linear mixed selection. As shown in Figure 5.12, there is no obveious violation of the normality distributed residual.

Summary

We developed basic frameworks and R scripts for model selection and model fitting and combined them with model validation. We selected models by minimizing BIC of a set of models with different L-1 penalty parameters. There is an alternative way to select the model by minimizing the cross-validation errors, which may improve the prediction accuracy, but this method is time-consuming. We utilized GLMMLASSO regression with linear mixed models to fit predictions models for the methane production of cow and sheep. GLMMLASSO selection produced parsimonious models in the two datasets for all responses. Generally, the results of

GLMMLASSO selection were better than the forward-stepwise selection and showed the computational advantages of GLMMLASSO selection. Inclusion of MVs in the form of RSA reduced RMSPE % from 2.0 to 2.53. T tests of variable effects indicated the significance of RSA of MVs in the prediction models. Further studies can focus on the exploration of the common

73 variables in the cow and sheep dataset, interaction terms and the between animal variability described by the animal random effects.

74

Figure 5.1. Flowchart showing the framework of the GLMMLASSO.

75

Figure 5.2. Flowchart showing the VIF reduction before GLMMLASSO selection.

76

Figure 5.3. Flowchart showing initializations before GLMMLASSO selection.

77

Figure 5.4. Regulations performed by the GLMMLASSO selection in sheep dataset according to methane production per sheep.

78

Figure 5.5. Observe methane production vs. predicted methane production using models 1 to 5

*Red line is the theoretically perfect prediction. Blue line is each models’ prediction.

79

Figure 5.6. Observed methane production vs. predicted methane production using models 6 to 10

*Red line is the theoretically perfect prediction. Blue line is each models’ prediction.

80

Figure 5.7. Observed methane production vs. predicted methane production using models 11 to 15

*Red line is the theoretically perfect prediction. Blue line is each models’ prediction.

81

Figure 5.8. Observed methane production vs. predicted methane production using model 16 to 22

*Red line is the theoretically perfect prediction. Blue line is each models’ prediction.

82

Figure 5.9. QQ Plot of the Residual of animal-based Models from the cos dataset – according to g methane per kg MBW per day

83

Table 5.1. From the sheep dataset, discarding 4 potential variables reduced VIF to below 10 and 17 different combinations of variables resulted.

The Number of Discarded Variables Max VIF Variable Choices Number 4 Below 10 1 to 17 0 Over 100 18

84

Table 5.2. Cross-validation of Variables Selections Made by GLMMLASSO of Three Responses in the Sheep Dataset.

Variable Variable Random effect Random error selection MSPE RMSPE% BIC Response Choices Variance Variance number Number 1* 4.84883119 0.100358005 933.8157 2.509345 2.732336 Animal-based 1 to 17 2 5.50854484 0.106967521 938.2285 3.436321 2.759862 Animal-based 18(FULL) 3 2.39758762 0.102549715 783.9113 1.436005 1.237723 DMI-based 1, 4 to 6, 16,17 4 2.39151539 0.102419772 793.5167 1.465951 1.245492 DMI-based 2,3,7 to 9, 5 2.37140162 0.101988163 793.2583 1.369193 1.268671 DMI-based 10 to 12 6* 2.29482066 0.100327869 797.0942 1.467845 1.243705 DMI-based 13 to 15 7 2.44574515 0.103574492 852.9684 1.646857 1.490878 DMI-based 18 8* 0.01553768 0.099311498 -225.7005 0.006571867 0.008979877 MBW-based 1 to 12 9 0.01595354 0.100631739 -229.9842 0.006667373 0.009048924 MBW-based 13 10 0.01595497 0.100636249 -233.8937 0.007307834 0.008888008 MBW-based 14 to 18 *The Selection had the lowest MSPE. Details of each variable were shown in Table5.4. **Lower MSPE, RMSPE%, and BIC were preferred.

Table 5.3. From the cow dataset, discarding 1 potential variable reduced VIF to below 10.

The Number of Discarded Variables Max VIF Variable Choices Number 0 55.4 19 1 Below 10 20

85

Table 5.4. Details of the Variables Selected of Three Responses in the Sheep Dataset.

Variable Variables Selected (p-value < 0.05) Response Variable Selection Choices Number Number 1 DMI., BW, Coribacteriaceae (a bacterial family), YS2 (a candidate bacterial order), Animal- 1 to 17 Pseudoramibacter, Eubacterium, Ruminococcaceae (a bacterial family), Megasphaera, Z20 (a based candidate bacterial order), Oxalobacter, Dasytricha (protozoa), Cyllamyces 2 (fungi) 2 DMI., BW, Collinsella, Eggerthella, Caminicella, Ruminococcaceae (a bacterial family), Animal- 18(FULL) Megasphaera, Z20 (a bacterial order), Oxalobacter, Neisseria based 3 Methanosphaera (methanogens), YS2 (a candidate bacterial order), Catabacteriaceae (a DMI- 1, 4 to 6, bacterial family), Pseudoramibacter, Blautia, Eubacterium, Ruminococcaceae (a bacterial based 16,17 family), Ruminococcus-R, Megasphaera, Z20 (a candidate bacterial order), Oxalobacter, Neisseriaceae (a bacterial family) 4 Methanosphaera (methanogens), YS2 (a candidate bacterial order), Catabacteriaceae (a DMI- 2,3,7 to 9, bacterial family), Pseudoramibacter, Blautia, Eubacterium, Ruminococcaceae (a bacterial based family), Ruminococcus-R, Z20 (a candidate bacterial order), Oxalobacter, Neisseriaceae (a bacterial family) 5 Methanobrevibacter (methanogens), YS2 (a candidate bacterial order), Catabacteriaceae (a DMI- 10 to 12 bacteria family), Pseudoramibacter, Blautia, Eubacterium, Ruminococcaceae (bacteria based family), Ruminococcus-R, Megasphaera, Z20 (a candidate bacterial order), Oxalobacter 6 Methanosphaera (methanogens), YS2 (a candidate bacterial order), Catabacteriaceae (a DMI- 13 to 15 bacterial family), Pseudoramibacter, Eubacterium, Ruminococcaceae (a bacterial family), based Ruminococcus-R, Megasphaera, Z20 (a candidate bacterial order), Oxalobacter, Neisseriaceae (a bacterial family) 7 A:P. ratio, Olsenella, YS2 (a candidate bacterial order), Ruminococcaceae (a bacterial family) DMI- 18 based 8 Gender, DMI., BW, RCC and relatives (archaea), Coribacteriaceae (a bacterial family), MBW- 1 to 12 Olsenella, Parabacteroides, YS2 (a candidate bacterial order), Eubacterium, Ruminococcaceae based (a bacterial family), Megasphaera, Z20 (a candidate bacterial order), Oxalobacter, Dasytricha (protozoa), Cyllamyces 2 (fungi)

86

9 Gender, DMI., BW., Coribacteriaceae, Olsenella, Parabacteroides, YS2, Eubacterium, MBW- 13 Ruminococcaceae, Megasphaera, Z20, Oxalobacter, Dasytricha (protozoa), Cyllamyces 2 based

10 Gender, DMI., BW., Coribacteriaceae, Olsenella, Parabacteroides, YS2, Eubacterium, MBW- 14 to 17 Ruminococcaceae, Megasphaera, Z20, Oxalobacter, Cyllamyces 2 based

87

Table 5.5. Two Variables Selections of Animal-based Response in the Combined Cow Dataset.

Variable Variables Selected (p-value < 0.05) Response Variable Selection Choices Number Number 11 DMI(Kg/day), Total VFA (mmol/L), Acetate Ratio, Propionate Ratio Animal- 19 Acetate Propionate Ratio, Bifidobacterium, Prevotella, Clostridium, Ruminococcus, Desulfovibrio, based Synergistes, Methanobrevibacter, Methanosphaera 12 Additive 1, Additive 2, Additive 3, Additive 4, Additive 5, Additive 6, Additive 7, Additive 8, Animal- 20 DMI(Kg/day), Total VFA (mmol/L), Acetate Ratio, Propionate Ratio … based

88

Table 5.6. T-value of the Best Selection of Animal-based Response in the Sheep Dataset.

t value Pr(>|t|) (Intercept) -5.50119 1.12E-07 DMI 7.004823 3.52E-11 BW 4.44579 1.44E-05 A:P. ratio 8.407453 7.19E-15 Coribacteriaceae (a bacterial family) 2.715777 0.007179 YS2 (a bacterial order) 2.47709 0.014059 Pseudoramibacter -2.56821 0.010937 Eubacterium (a bacterial genus) -4.07405 6.61E-05 Ruminococcaceae (a bacterial family) 4.288119 2.78E-05 Megasphaera (a bacterial genus) -2.67139 0.008164 Z20 (a candidate bacterial order) -4.73206 4.15E-06 Oxalobacter (a bacterial genus) -2.91872 0.003909 Dasytricha (a protozoal genus) -3.07113 0.002423 Cyllamyces 2 (a candidate fungal genus) -2.30624 0.022102 **Higher T-values were more significant.

89

Table 5.7. T-value of the Best Selection of DMI-based Response in the Sheep Dataset.

t value Pr(>|t|) (Intercept) 19.18696 8.05E-48 Methanosphaera (archaea) -1.16994 0.243369 YS2 3.56379 0.000454 Catabacteriaceae 0.437353 0.662311 Pseudoramibacter -2.32913 0.020817 Eubacterium -2.05973 0.040675 Ruminococcaceae 5.511258 1.05E-07 Ruminococcus-R (a bacteria genus) -2.1847 0.030033 Z20 -2.99401 0.003089 Oxalobacter -1.75727 0.080351 Neisseriaceae (a bacteria family) -0.87922 0.380301

**Higher T-values were more significant.

90

Table 5.8. T-value of the Best Selection of MBW-based Response in the Sheep Dataset.

t value Pr(>|t|) (Intercept) 4.129266 5.33E-05 Gender as Male 2.618438 0.009506 DMI 4.164402 4.63E-05 BW -3.38003 0.000871 A:P. ratio 7.535473 1.63E-12 RCC and relatives (a group of archaea) 1.685491 0.093446 Coribacteriaceae 3.597226 0.000405 Olsenella -2.9458 0.003601 Parabacteroides -3.2844 0.001206 YS2 2.357931 0.019337 Eubacterium -3.39237 0.000834 Ruminococcaceae 4.465094 1.33E-05 Megasphaera -2.92618 0.003826 Z20 -5.89838 1.53E-08 Oxalobacter -2.52233 0.012433 Dasytricha -2.82087 0.00527 Cyllamyces 2 -3.36632 0.000913 **Higher T-values were more significant.

91

Table 5.9. T-value the Best Selection of Animal-based Response in the Cow Dataset.

Variables Estimate Stand Errors T-value Coefficient (Intercept) -132.785 95.98132 -1.38345

DMI(Kg/day) 14.61953 1.220177 11.99127

Total VFA (mmol/L) 0.121349 0.218205 0.559662

Acetate Ratio -0.23811 1.837925 -0.13005

Propionate Ratio 1.432274 2.889 0.494335

Acetate Propionate Ratio 32.79266 20.25559 1.622946

Bifidobacterium 1243.677 1116.153 1.115272

Prevotella -20.7142 57.76627 -0.36038

Clostridium -276.978 681.6019 -0.41094

Ruminococcus 438.6303 422.114 1.034125

Desulfovibrio 7698.302 4436.473 1.751235

Synergistes 1252.842 11625.65 0.110978

Methanobrevibacter 150.7443 49.04012 3.076043

Methanosphaera 165.7874 131.453 1.262825

MSPE 951.113 487.2895

RMSPE 0.0911 0.0377

BIC 811.63 4.886977

Max VIF 49.320

2 Random effect from animal variance 휎푏 113.033 112.940

Random error variance 휎2 943.4875 942.5699

**Higher T-values were more significant.

92

Table 5.10. Cross-validations of Five Models for Each Response in the Sheep Dataset.

Model Number MSPE RMSPE% Random error Model Variable Selection Categories 1 5.799 11.0 2.268 Animal-based GLMMLASSO 2* 5.579 10.8 2.292 GLMMLASSO with Log Transformation 3 8.186 13.0 2.827 Traditional 4 9.614 14.1 2.561 Forward-stepwise 5 10.614 14.8 2.311 Forward-stepwise with Log Transformation 6 2.973 11.4 1.629 DMI-based GLMMLASSO 7* 2.906 11.3 1.632 GLMMLASSO with Log Transformation 8 3.810 12.9 1.907 Traditional 9 3.898 13.1 1.4125 Forward-stepwise 10 3.886 13.1 1.653 Forward-stepwise with Log Transformation 11* 0.0175 10.6 0.1241 MBW-based GLMMLASSO 12 0.0178 10.6 0.1666 GLMMLASSO with Log Transformation 13 0.027 13.1 0.1617 Traditional 14 0.018 11.6 0.1254 Forward-stepwise 15 0.023 12.0 0.1661 Forward-stepwise with Log Transformation *The Model Selection had Lowest MSPE. Details of each model were collected in table. **Lower MSPE, RMSPE% and BIC were Preferred.

93

Table 5.11. Equations of the Best Model for the Sheep Methane Production.

Response Equation Animal- CH4 g per day = −9.716 + 9.544 × DMI + 2.372 × Acetate Propionate Ratio + 134.188 × based* 퐶표푟𝑖푏푎푐푡푒푟𝑖푎푐푒푎푒 + 55.913 × 푌푆 + 29.662 × 푅푢푚𝑖푛표푐표푐푐푎푐푒푎푒 − 639.252 × 푃푠푒푢푑표푟푎푚𝑖푏푎푐푡푒푟 − 163.232 × 퐸푢푏푎푐푡푒푟𝑖푢푚퐿 − 581.573 × 푀푒푔푎푠푝ℎ푎푒푟푎 − 357.467 × 푍20 − 1476.127 × 푂푥푎푙표푏푎푐푡푒푟 − 4.994 × 퐷푎푠푦푡푟𝑖푐ℎ푎 − 1041.872 × 퐶푦푙푙푎푚푦푐푒푠 2

DMI- CH4 g per DMI = 14.585 + 14.585 × 푀푒푡ℎ푎푛표푠푝ℎ푎푒푟푎 + 2.838 × 푌푆2 + 62.278 × 퐶표푟𝑖푏푎푐푡푒푟𝑖푎푐푒푎푒 + based* 3.966 × 푃푠푒푢푑표푟푎푚𝑖푏푎푐푡푒푟 + 23.311 × 퐸푢푏푎푐푡푒푟𝑖푢푚퐿 + 34.128 × 푅푢푚𝑖푛표푐표푐푐푢푠푅 − 432.115 × 퐵푙푎푢푡𝑖푎 − 76.180 × 푅푢푚𝑖푛표푐표푐푐푎푐푒푎푒 − 25.677 × 푀푒푔푎푠푝ℎ푎푒푟푎 − 189.327 × 푍20 − 731.127 × 푂푥푎푙표푏푎푐푡푒푟 − 477.139 × 푁푒𝑖푠푠푒푟𝑖푎푐푒푎푒

MBW- CH4 g per MBW = 0.439 + 0.093 × Gender as Male − 0.402 × DMI − 0.008 × BW − 2.704 × 푂푙푠푒푛푒푙푙푎 − based 8.026 × 푃푎푟푎푏푎푐푡푒푟표𝑖푑푒푠 − 7.586 × 퐸푢푏푎푐푡푒푟𝑖푢푚퐿 + 1.785 × 푅푢푚𝑖푛표푐표푐푐푎푐푒푎푒 − 34.892 × 푀푒푔푎푠푝ℎ푎푒푟푎 − 26.444 × 푍20 − 71.223 × 푂푥푎푙표푏푎푐푡푒푟 − 0.257 × 퐷푎푠푦푡푟𝑖푐ℎ푎 − 84.090 × 퐶푦푙푙푎푚푦푐푒s 2 + 0.126 × Acetate Propionate Ratio + 0.272 × 푅퐶퐶 푎푛푑 푟푒푙푎푡𝑖푣푒푠 + 10.033 × 퐶표푟𝑖푏푎푐푡푒푟𝑖푎푐푒푎푒 + 2.934 × 푌푆2

*Log Transformation = log (relative sequence abundance of Microbes + 1). ** Relative sequence abundance is in unit of proportion.

94

Table 5.12. Cross-validations of Seven Models for Animal-based Response in the Cow Dataset.

Model Number Mean Max VIF RMSPE %CH4 Mean Response Variable Selection (g/d·Animal) mean BIC

16* 49.32 9.11 826.925 Animal- GLMMLASSO (Variable Selection 1) 17 49.32 9.15 811.63 based GLMMLASSO with Log Transformation 18 1.944 9.01 721.435 GLMMLASSO (Variable Selection 2) 19 1.944 9.01 799.246 GLMMLASSO with Log Transformation 20 24.448 11.08 1,027.52 Traditional 21 3.541 12.67 1007.62 Forward-wise 22 3.421 11.27 1,004.65 Forward-wise with Log Transformation * Best Model Selection. Details of each model were collected in table. **Lower MSPE, RMSPE% and BIC were Preferred.

95

Table 5.13. Equation of Animal-based Model Cow Dataset.

Response Equation Animal- CH4 g per day = −132.785 + 14.620 × DMI + 0.121 × Total VFA + 1243.677 × 퐵𝑖푓𝑖푑표푏푎푐푡푒푟𝑖푢푚 + based 438.630 × 푅푢푚𝑖푛표푐표푐푐푢푠 + 7698.302 × 퐷푒푠푢푙푓표푣𝑖푏푟𝑖표 + 1252.842 × 푆푦푛푒푟푔𝑖푠푡푒푠 + 150.744 × 푀푒푡ℎ푎푛표푏푟푒푣𝑖푏푎푐푡푒푟 + 165.787 × 푀푒푡ℎ푎푛표푠푝ℎ푎푒푟푎 − 0.238 × Acetate Propionate Ratio − 20.714 × 푃푟푒푣표푡푒푙푙푎 − 276.978 × 퐶푙표푠푡푟𝑖푑𝑖푢푚

*Log Transformation = log (relative sequence abundance of microbes + 1). ** Relative sequence abundance is in unit of proportion.

96

Chapter 6. Discussion and Conclusion

Discussion

We built four frameworks for model selections. Best subset selection can identify and choose the predictor variables for models, but it is very time-consuming if a dataset has more than 40 variables. With computing power available today, best subset selection is only suitable for datasets that have fewer than 30 variables. With the development of computer power, we may be able to use best subset selections in the future. Forward-stepwise selection is suitable for datasets that have no more than 300 variables. GLMMLASSO combined with cross- validations allowed for prediction models that forward-stepwise selection. However, the multicollinearity should be reduced to less than 10. Primarily, our model selection is based on minimizing BIC because AIC cannot offer protections against overfitting. There is an alternative way to select models with improved prediction accuracy by minimizing cross- validation (CV) errors, but the procedures of cross-validation can be very time-consuming for linear mixed models. Log-transformation of microbial data generally improves prediction accuracy, which indicates better monotonicity in those data. Interestingly, gender was selected as a predictor variable in the MBW-based model selected using GLMMLASSO selection, and this exemplified that our frameworks could be used to detect whether a variable is significant or not in large and comprehensive datasets. Random effects from animals were small and similar to the random error in LMMs. Thus, more data are needed to investigate the random 97 effects in the future to further explore the variance of methane production from different animals.

Conclusion

In conclusion, four frameworks were built for variables selections. GLMMLASSO selection was preferred for variable selections and shown to be useful to detect significant variables in a comprehensive dataset. Results of GLMMLASSO selection combined with CV improved methane production models compared with forward-stepwise selections in a sheep dataset. Our new models contain multiple taxa of rumen microbes. Future research can investigate the interactions of these microbial taxa with other factors that directly or indirectly contribute to methane production, such as feed intake and digestion, rumen methanogens, hydrogen-producing and -consuming rumen microbes. Methane production and quantitative analysis of these microbial taxa in animal fed different diets will also further validate the

“methane predictor” variable status of these microbial taxa.

98

Reference

Baldwin, R.L. 1995. Modeling Ruminant Digestion and Metabolism. Springer Science &

Business Media.

Baldwin, R.L., L.J. Koong, and M.J. Ulyatt. 1977. A dynamic model of ruminant

digestion for evaluation of factors affecting nutritive value. Agric. Syst. 2:255–288.

Bannink, A., M.W. Van Schijndel, and J. Dijkstra. 2011. A model of enteric fermentation

in dairy cows to estimate methane emission for the Dutch National Inventory Report

using the IPCC Tier 3 approach. Anim. Feed Sci. Technol. 166:603–618.

Barnesiella. 2015. . Bergey’s Man. Syst. Archaea Bact..

doi:doi:10.1002/9781118960608.gbm00242.

Bates, D., M. Maechler, B. Bolker, and S. Walker. 2014. lme4: Linear mixed-effects

models using Eigen and S4. R Packag. version 1:1–23.

Beauchemin, K.A., M. Kreuzer, F. O’mara, and T.A. McAllister. 2008. Nutritional

management for enteric methane abatement: a review. Aust. J. Exp. Agric. 48:21–

27.

Benchaar, C., J. Rivest, C. Pomar, and J. Chiquette. 1998. Prediction of methane

production from dairy cows using existing mechanistic models and regression

equations. J. Anim. Sci. 76:617–627.

Blaxter, K.L., and J.L. Clapperton. 1965. Prediction of the amount of methane produced

by ruminants.. Brit. J. Nutr. 19:511–522.

Bratzler, J.W., and E.B. Forbes. 1940. The Estimation of Methane Production by Cattle 99

One Figure. J. Nutr. 19:611–613.

Caporaso, J.G., J. Kuczynski, J. Stombaugh, K. Bittinger, F.D. Bushman, E.K. Costello,

N. Fierer, A.G. Pena, J.K. Goodrich, and J.I. Gordon. 2010. QIIME allows analysis

of high-throughput community sequencing data. Nat. Methods 7:335.

Chapman, R., S. Cook, C. Donough, Y.L. Lim, P.V.V. Ho, K.W. Lo, and T. Oberthür.

2018. Using Bayesian networks to predict future yield functions with data from

commercial oil palm plantations: A proof of concept analysis. Comput. Electron.

Agric. 151:338–348.

COPLIN, D.L., R.D. FREDERICK, D. MAJERCZAK, and E. HAAS. 2012. Ohio

Agricultural Research and Development Center. Page 413 in Plant pathogenic

bacteria: Proceedings of the Sixth International Conference on Plant Pathogenic

Bacteria, Maryland, June 2–7, 1985. Springer Science & Business Media.

Dijkstra, J., H.D.S.C. Neal, D.E. Beever, and J. France. 1992. Simulation of nutrient

digestion, absorption and outflow in the rumen: model description. J. Nutr.

122:2239–2256.

Einasto, M., L.J. Liivamägi, E. Saar, J. Einasto, E. Tempel, E. Tago, and V.J. Martínez.

2011. SDSS DR7 superclusters-Principal component analysis. Astron. Astrophys.

535:A36.

Ellis, J.L., A. Bannink, J. France, E. Kebreab, and J. Dijkstra. 2010. Evaluation of enteric

methane prediction equations for dairy cows used in whole farm models. Glob.

Chang. Biol. 16:3246–3256.

Ellis, J.L., J. Dijkstra, E. Kebreab, A. Bannink, N.E. Odongo, B.W. McBride, and J.

100

France. 2008. Aspects of rumen microbiology central to mechanistic modelling of

methane production in cattle. J. Agric. Sci. 146:213–233.

Ellis, J.L., E. Kebreab, N.E. Odongo, B.W. McBride, E.K. Okine, and J. France. 2007.

Prediction of methane production from dairy and beef cattle. J. Dairy Sci. 90:3456–

3466.

EPA, A. 2011. Inventory of US greenhouse gas emissions and sinks: 1990-2009.

Environ. Prot. Agency2012.

FAO, I. 2016. WFP (2015), The State of Food Insecurity in the World 2015. Meeting the

2015 international hunger targets: taking stock of uneven progress. Food Agric.

Organ. Publ. Rome.

France, J., J.H.M. Thornley, and D.E. Beever. 1982. A mathematical model of the rumen.

J. Agric. Sci. 99:343–353.

Friedman, J., T. Hastie, and R. Tibshirani. 2001. The Elements of Statistical Learning.

Springer series in statistics New York.

Gerber, P.J., H. Steinfeld, B. Henderson, A. Mottet, C. Opio, J. Dijkman, A. Falcucci,

and G. Tempio. 2013. Tackling Climate Change through Livestock: A Global

Assessment of Emissions and Mitigation Opportunities. Food and Agriculture

Organization of the United Nations (FAO).

Groll, A., and G. Tutz. 2014. Variable selection for generalized linear mixed models by L

1-penalized estimation. Stat. Comput. 24:137–154.

Hammond, K.J., L.A. Crompton, A. Bannink, J. Dijkstra, D.R. Yáñez-Ruiz, P. O’Kiely,

E. Kebreab, M.A. Eugène, Z. Yu, and K.J. Shingfield. 2016. Review of current in

101

vivo measurement techniques for quantifying enteric methane emission from

ruminants. Anim. Feed Sci. Technol. 219:13–30.

Hobson, P.N., and C.S. Stewart. 2012. The Rumen Microbial Ecosystem. Springer

Science & Business Media.

Holter, J.B., and A.J. Young. 1992. Methane Prediction in Dry and Lactating Holstein

Cows1. J. Dairy Sci. 75:2165–2175.

Houghton, J.T., L.G. Meira Filho, B. Lim, K. Treanton, and I. Mamaty. 1997. Revised

1996 IPCC guidelines for national greenhouse gas inventories. v. 1: Greenhouse gas

inventory reporting instructions.-v. 2: Greenhouse gas inventory workbook.-v. 3:

Greenhouse gas inventory reference manual.

Johnson, K.A., and D.E. Johnson. 1995. Methane emissions from cattle.. J. Anim. Sci.

73:2483–2492.

Kassambara, A., and F. Mundt. 2016. Factoextra: extract and visualize the results of

multivariate data analyses. R Packag. version 1.

Kim, M., T. Park, and Z. Yu. 2017. Metagenomic investigation of gastrointestinal

microbiome in cattle. Asian-Australasian J. Anim. Sci. 30:1515.

Kriss, M. 1930. Quantitative relations of the dry matter of the food consumed, the heat

production, the gaseous outgo, and the insensible loss in body weight of cattle. J.

Agric. Res. 40:283–295.

Liu, Y., and W.B. Whitman. 2008. Metabolic, phylogenetic, and ecological diversity of

the methanogenic archaea. Ann. N. Y. Acad. Sci. 1125:171–189.

Lodge-Ivey, S.L., J. Browne-Silva, and M.B. Horvath. 2009. Bacterial diversity and

102

fermentation end products in rumen fluid samples collected via oral lavage or rumen

cannula. J. Anim. Sci. 87:2333–2337.

Martin, C., D.P. Morgavi, and M. Doreau. 2010. Methane mitigation in ruminants: from

microbe to the farm scale. Animal 4:351–365.

Martinez-Fernandez, G., S.E. Denman, C. Yang, J. Cheung, M. Mitsumori, and C.S.

McSweeney. 2016. Methane inhibition alters the microbial community, hydrogen

flow, and fermentation response in the rumen of cattle. Front. Microbiol. 7:1122.

McSweeney, C.S., S.E. Denman, A. Wright, and Z. Yu. 2007. Application of recent

DNA/RNA-based techniques in rumen ecology. ASIAN Australas. J. Anim. Sci.

20:283.

Meale, S.J., F. Chaucheyras-Durand, H. Berends, and M.A. Steele. 2017. From pre-to

postweaning: Transformation of the young calf’s gastrointestinal tract. J. Dairy Sci.

100:5984–5995.

Meehl, G.A., T.F. Stocker, W.D. Collins, P. Friedlingstein, T. Gaye, J.M. Gregory, A.

Kitoh, R. Knutti, J.M. Murphy, and A. Noda. 2007. Global climate projections.

Mills, J.A.N., J. Dijkstra, A. Bannink, S.B. Cammell, E. Kebreab, and J. France. 2001. A

mechanistic model of whole-tract digestion and methanogenesis in the lactating

dairy cow: model development, evaluation, and application. J. Anim. Sci. 79:1584–

1597.

Mills, J.A.N., E. Kebreab, C.M. Yates, L.A. Crompton, S.B. Cammell, M.S. Dhanoa,

R.E. Agnew, and J. France. 2003. Alternative approaches to predicting methane

emissions from dairy cows1. J. Anim. Sci. 81:3141–3150.

103

Moe, P.W., and H.F. Tyrrell. 1979. Methane production in dairy cows. J. Dairy Sci.

62:1583–1586.

Moraes, L.E., A.B. Strathe, J.G. Fadel, D.P. Casper, and E. Kebreab. 2014. Prediction of

enteric methane emissions from cattle. Glob. Chang. Biol. 20:2140–2148.

Moraes, L.E., J.E. Wilen, P.H. Robinson, and J.G. Fadel. 2012. A linear programming

model to optimize diets in environmental policy scenarios. J. Dairy Sci. 95:1267–

1282.

Morgavi, D.P., E. Forano, C. Martin, and C.J. Newbold. 2010. Microbial ecosystem and

methanogenesis in ruminants. Animal 4:1024–1036.

Moss, A.R., J.-P. Jouany, and J. Newbold. 2000. Methane production by ruminants: its

contribution to global warming. Pages 231–253 in Annales de zootechnie. EDP

Sciences.

Nieuwenhuis, R. 2018. Forward.Lmer: Basic Stepwise Function for Mixed Effects in R.

Accessed. http://www.rensenieuwenhuis.nl/r-sessions-32/.

Niu, M., E. Kebreab, A.N. Hristov, J. Oh, C. Arndt, A. Bannink, A.R. Bayat, A.F. Brito,

T. Boland, and D. Casper. 2018. Prediction of enteric methane production, yield and

intensity in dairy cattle using an intercontinental database. Glob. Chang. Biol.

Pachauri, R.K., M.R. Allen, V.R. Barros, J. Broome, W. Cramer, R. Christ, J.A. Church,

L. Clarke, Q. Dahe, and P. Dasgupta. 2014. Climate Change 2014: Synthesis Report.

Contribution of Working Groups I, II and III to the Fifth Assessment Report of the

Intergovernmental Panel on Climate Change. IPCC.

Ramin, M., and P. Huhtanen. 2013. Development of equations for predicting methane

104

emissions from ruminants. J. Dairy Sci. 96:2476–2493.

Rogelj, J., M. Den Elzen, N. Höhne, T. Fransen, H. Fekete, H. Winkler, R. Schaeffer, F.

Sha, K. Riahi, and M. Meinshausen. 2016. Paris Agreement climate proposals need

a boost to keep warming well below 2 C. Nature 534:631.

Schwarz, G. 1978. Estimating the dimension of a model. Ann. Stat. 6:461–464.

Shlens, J. 2014. A tutorial on principal component analysis. arXiv Prepr.

arXiv1404.1100.

Van Soest, P.J. 1994. Nutritional Ecology of the Ruminant. cornell university Press.

Storm, I.M.L.D., A.L.F. Hellwing, N.I. Nielsen, and J. Madsen. 2012. Methods for

measuring and estimating methane emission from ruminants. Animals 2:160–183.

Swift, R.W., J.W. Bratzler, W.H. James, A.D. Tillman, and D.C. Meek. 1948. THE

EFFECT OF DIETARY FAT ON UTILIZATION OF THE ENERGY AND

PROTEIN OF RATIONS BY SHEEP 1. J. Anim. Sci. 7:475–485.

Venables, W.N., and B.D. Ripley. 2013. Modern Applied Statistics with S-PLUS.

Springer Science & Business Media.

Verbeke, G. 1997. Linear mixed models for longitudinal data. Springer.

Wilkerson, V.A., D.P. Casper, and D.R. Mertens. 1995. The Prediction of Methane

Production of Holstein Cows by Several Equations1. J. Dairy Sci. 78:2402–2414.

Woo, P.C.Y., S.K.P. Lau, J.L.L. Teng, H. Tse, and K.-Y. Yuen. 2008. Then and now: use

of 16S rDNA gene sequencing for bacterial identification and discovery of novel

bacteria in clinical microbiology laboratories. Clin. Microbiol. Infect. 14:908–934.

Yu, Z., and M. Morrison. 2004. Improved extraction of PCR-quality community DNA

105

from digesta and fecal samples. Biotechniques 36:808–813.

Yu, Z., M. Yu, and M. Morrison. 2006. Improved serial analysis of V1 ribosomal

sequence tags (SARST‐V1) provides a rapid, comprehensive, sequence‐based

characterization of bacterial diversity and community composition. Environ.

Microbiol. 8:603–611.

Zimmerman, P.R. 1993. System for measuring metabolic gas emissions from animals.

106