bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
1 Complex ecological phenotypes on phylogenetic trees: a Markov process model for comparative
2 analysis of multivariate count data
3
4 Michael C. Grundler1,* and Daniel L. Rabosky1
5
6 1Museum of Zoology and Department of Ecology and Evolutionary Biology, University of
7 Michigan, Ann Arbor, MI USA 48109
8
9 *Corresponding author: [email protected]
10
11
12
13
14
15
16
17
18
19
20
21
22
23 bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. GRUNDLER AND RABOSKY
24 ABSTRACT
25 The evolutionary dynamics of complex ecological traits – including multistate
26 representations of diet, habitat, and behavior – remain poorly understood. Reconstructing the
27 tempo, mode, and historical sequence of transitions involving such traits poses many challenges
28 for comparative biologists, owing to their multidimensional nature. Continuous-time Markov
29 chains (CTMC) are commonly used to model ecological niche evolution on phylogenetic trees
30 but are limited by the assumption that taxa are monomorphic and that states are univariate
31 categorical variables. A necessary first step in the analysis of many complex traits is therefore to
32 categorize species into a pre-determined number of univariate ecological states, but this
33 procedure can lead to distortion and loss of information. This approach also confounds
34 interpretation of state assignments with effects of sampling variation because it does not directly
35 incorporate empirical observations for individual species into the statistical inference model. In
36 this study, we develop a Dirichlet-multinomial framework to model resource use evolution on
37 phylogenetic trees. Our approach is expressly designed to model ecological traits that are
38 multidimensional and to account for uncertainty in state assignments of terminal taxa arising
39 from effects of sampling variation. The method uses multivariate count data for individual
40 species to simultaneously infer the number of ecological states, the proportional utilization of
41 different resources by different states, and the phylogenetic distribution of ecological states
42 among living species and their ancestors. The method is general and may be applied to any data
43 expressible as a set of observational counts from different categories.
44
45 Keywords: Ecological niche evolution; hidden Markov model; macroevolution; comparative
46 methods; Dirichlet-multinomial
2 bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. COMPARATIVE ANALYSIS OF MULTINOMIAL OBSERVATIONS
47 Most species in the natural world make use of multiple, categorically-distinct types of
48 ecological resources. Many butterfly species use multiple host plants, for example (Ehrlich &
49 Raven 196410/7/19 7:40:00 PM; Robinson 1999). Insectivorous warblers in temperate North
50 America use multiple distinct microhabitats and foraging behaviors (MacArthur 1958), as do
51 honeyeaters in mesic and arid Australia (Miller et al. 2017). The evolution of novel patterns of
52 resource use can impact phenotypic evolution (Martin & Wainwright 2011; Davis et al. 2016),
53 diversification (Mitter et al. 1988; Givnish et al. 2014), community assembly (Losos et al. 2003;
54 Gillespie 2004), and ecosystem function (Harmon et al. 2009; Bassar et al. 2010). Consequently,
55 there has been substantial interest in understanding how ecological traits related to resource use
56 evolve and for exploring their impacts on other evolutionary and ecological phenomena (Vrba
57 1987; Futuyma & Moreno 1988; Forister et al. 2012; Price et al. 2012; Burin et al. 2016).
58 Making inferences about the evolutionary dynamics of resource use, however, first
59 requires summarizing the complex patterns of variation observed among taxa into traits that can
60 be modeled on phylogenetic trees. Typically, this is done by explicit or implicit projection of a
61 multidimensional ecological phenotype into a greatly simplified univariate categorical space (e.g.
62 “carnivore”, “omnivore”, “herbivore”). It is widely recognized that the real-world complexities
63 of resource use are not adequately described by a set of categorical variables (Hardy & Linder
64 2005; Hardy 2006). Nonetheless, it is also true that major differences in resource use can
65 sometimes be summed up in a small set of ecological states, a point made by Mitter et al. (1988)
66 in their study of phytophagy and insect diversification. For this reason, continuous-time Markov
67 chain (CTMC) models, which require classifying species into a set of character states, have
68 become commonplace in macroevolutionary studies of ecological trait evolution (Kelley &
69 Farrell 1998; Nosil 2002; Price et al. 2012; Hardy & Otto 2014; Cantalapiedra et al. 2014; Burin
3 bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. GRUNDLER AND RABOSKY
70 et al. 2016). CTMC models describe a stochastic process for evolutionary transitions among a set
71 of character states and are used to infer ancestral states and evolutionary rates, and to perform
72 model-based hypothesis tests (O’Meara 2012). Many complex ecological traits, however, will be
73 poorly characterized by a set of univariate categorical states. Consequently, species classified in
74 single state can nonetheless exhibit substantial differences in patterns of resource use, creating
75 challenges for interpreting evolutionary transitions among character states as well as for
76 understanding links between character state evolution and diversification.
77 An additional limitation of continuous-time Markov chains for modeling resource use
78 evolution emerges from the fact that species are classified into ecological states without regard
79 for the quality and quantity of information available to perform the classification exercise. As an
80 example, species with few ecological observations might be classified as specialists for a
81 particular resource, when their apparent specialization is strictly a function of the small number
82 of ecological observations available for the taxon. More generally, by failing to use a statistical
83 model for making resource state assignments, we neglect a major source of uncertainty in our
84 data: the uneven and incomplete knowledge of resource use across different taxa. This
85 uncertainty, in turn, has substantial implications for how we project patterns of resource use onto
86 a set of resource states. By failing to account for uneven and finite sample sizes characteristic of
87 empirical data on resource use we cannot be certain if state assignments reflect true similarities
88 or differences in resource use or are merely the expected outcome of sampling variation.
89 Consider the simple example in Figure 1, consisting of four species and three resources.
90 Panels (a) and (b) illustrate the true resource states and their phylogenetic distribution across a
91 set of four species and their ancestors. Here, an ancestral specialist evolved a generalist diet via a
92 single transition (panel b), such that there are two extant species with the ancestral specialist diet
4 bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. COMPARATIVE ANALYSIS OF MULTINOMIAL OBSERVATIONS
93 (species X and Y) and two with the derived generalist diet (species P and Q). Panel (c) illustrates
94 the observation process: empirical observations on diet are collected for each taxon, which are
95 influenced by sampling variation. In panel (d), these empirical observations are used to classify
96 each species into one of two diet states, based on the sampled relative importance of the three
97 food resource categories (e.g., panel c). These relative importance estimates are based on uneven,
98 and in some cases quite small, sample sizes, consistent with many empirical datasets (Vitt &
99 Vangilder 1983; Shine 1994; Alencar et al. 2013). In panel (e) we imagine repeating the state
100 assignment process on independent datasets while holding the samples sizes fixed to those in
101 panel (c), which reveals that both the initial state assignments and the number of states from (d)
102 are highly sensitive to real-world levels of sampling variation. This has obvious implications for
103 downstream macroevolutionary analyses. There is a serious risk that incorrect classification, and
104 therefore spurious diet state variation, will emerge from sampling variation alone. In the analyses
105 that underlie Figure 1, we find that more than 70 percent of tip state classifications do not match
106 the true pattern of resource use.
107 Is this a problem in practice? This issue is difficult to assess because few studies provide
108 information about the sample sizes that underlie state assignments. In most cases, ecological
109 states are simply asserted as known. It is also important to emphasize that the specific problem in
110 Figure 1 is an outcome of a more general problem: standard CTMC models have a limited ability
111 to model complex ecological phenotypes because of the assumption that states in the model are
112 univariate categorical variables (Hardy & Linder 2005; Hardy 2006). While it is true that CTMC
113 models operate on a countable state space, it is not true that the states of the system must be
114 univariate categorical variables. In this paper we treat states as multivariate probability
5 bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. GRUNDLER AND RABOSKY
115 distributions to develop a model for studying the evolutionary dynamics of ecological resource
116 use on phylogenetic trees that avoids the need to classify taxa into univariate categorical states.
117 Our approach is explicitly designed to model resource traits that are multidimensional
118 and to account for uncertainty in ecological state assignments of terminal taxa arising from
119 effects of sampling variation. Antecedents to our approach can be found in the use of hidden
120 Markov models in macroevolution (Felsenstein & Churchill 1996; Marazzi et al. 2012; Beaulieu
121 et al. 2013; Beaulieu & O’Meara 2016; Caetano et al. 2018), which assume a deterministic
122 many-to-one mapping from a set of hidden states to a set of observable states. By contrast, we
123 assume that each hidden state is a multinomial distribution and that observed data are sampled
124 outcomes from these distributions (see panels (a) through (c) of Fig. 1). The number of states in
125 the model and the states themselves are not directly observed and are estimated from the data.
126 Using simulations and an empirical dataset of snake diets, we show how the method can use
127 observational counts to simultaneously infer the number of resource states, the proportional
128 utilization of resources by different states, and the phylogenetic distribution of ecological states
129 among living species and their ancestors. The method is general and applicable to any data
130 expressible as a set of observational counts from different resource categories.
131
132 MATERIALS & METHODS
133 General overview
134 We assume there is a discrete set of resource niches, which we refer to as “resource
135 states” or, more simply, as “states”. We use the term resource in a broad sense, with the
136 understanding that it may refer to behavior, habitat, prey, predators, etc. Although the set of
137 resource states is finite, we imagine that a state assigns a taxon a continuously-valued parameter
6 bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. COMPARATIVE ANALYSIS OF MULTINOMIAL OBSERVATIONS
138 that determines its proportional utilization of different resources. Taxa that share a resource state
139 therefore exhibit similar patterns of resource utilization. The data for each taxon consist of a set
140 of counts, with each count recording the number of times a specific taxon was observed to use a
141 specific resource. We refer to different resources as “resource categories”. The set of resource
142 states and their associated distributions over resource categories are therefore not directly
143 observed but must be estimated from empirical count data. We assume that resource states
144 evolve along a phylogenetic tree according to a CTMC process and that empirical counts are
145 observations drawn from a multinomial distribution associated with each resource state. Our
146 approach is similar to phylogenetic threshold models that combine a probability model for the
147 evolution of an unobserved variable and a probability model for sampling the observed data
148 conditioned on the set of unobserved variables (Felsenstein 2012; Revell 2014). The general
149 approach of treating states as probability distributions is due to Baum and Petrie (1966).
150
151 Model description
152 Let � = {� , … , � } denote the data observed for a set of n related taxa. Each datum
153 � = {� , … , � } records the number of times the i-th taxon was observed to use each of J
154 possible resource categories. For example, in MacArthur's (1958) study on foraging behaviors in
155 wood-warblers, he made discrete observations of foraging orientation (radial, tangential, vertical)
156 for five species. For Blackburnian warblers, observations (counts) for each of these behaviors
157 were 11, 1, and 3; the resource vector for this species is thus d = {11, 1, 3}, for a total of 15
158 observations across all three “resource categories”. We assume that each taxon belongs to one of
159 K possible states, and we let � = {� , … , � }, where each � ∈ {1, … , �} denotes the state of a
160 particular taxon. States are unobserved but are assumed to represent distinct resource niches so
7 bioRxiv preprint doi: https://doi.org/10.1101/640334; this version posted October 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. GRUNDLER AND RABOSKY
161 that X partitions D into subsets of taxa with homogeneous patterns of resource utilization. Our
162 target of inference is the posterior distribution of resource state assignments �(�|�) ∝
163 �(�|�)�(�). To that end we define a probability model for the data as