An Evaluation of Latent Dirichlet Allocation in the Context of Plant-Pollinator Networks by Liam Callaghan a Thesis Presented To

An evaluation of latent Dirichlet allocation in the context of plant-pollinator networks

by
Liam Callaghan

A Thesis Presented to
The University of Guelph

In partial fulﬁlment of requirements for the degree of Master of Science in
Mathematics and Statistics

Guelph, Ontario, Canada c
ꢀ Liam Callaghan, December, 2012

ABSTRACT

An evaluation of latent Dirichlet allocation in the context of plant-pollinator networks

Liam Callaghan

Advisors:

University of Guelph, 2012

Dr. A. Ali

Dr. G. Umphrey

There may be several mechanisms that drive observed interactions between plants and pollinators in an ecosystem, many of which may involve trait matching or trait complementarity. Hence a model of insect species activity on plant species should be represented as a mixture of these linkage rules. Unfortunately, ecologists do not always know how many, or even which, traits are the main contributors to the observed interactions. This thesis proposes the Latent Dirichlet Allocation (LDA) model from artiﬁcial intelligence for modelling the observed interactions in an ecosystem as a ﬁnite mixture of (latent) interaction groups in which plant and pollinator pairs that share common linkage rules are placed in the same interaction group. Several model selection criteria are explored for estimating how many interaction groups best describe the observed interactions. This thesis also introduces a new model selection score called “penalized perplexity”. The performance of the model selection criteria, and of LDA in general, are evaluated through a comprehensive simulation study that consider networks of various size along with varying levels of nesting and numbers of interaction groups. Results of the simulation study suggest that LDA works well on networks with mild-to-no nesting, but loses accuracy with increased nestedness. Further, the penalized perplexity tended to outperform the other model selection criteria in identifying the correct number of interaction groups used to simulate the data. Finally, LDA was demonstrated on a real network, the results of which provided insights into the functional roles of pollinator species in the study region.
Keywords: pollination network, latent Dirichlet allocation, linkage rules, perplexity, model selection, BIC, AIC, DIC. iv

Acknowledgments

I would like to thank my advisor Dr. Ayesha Ali for patiently helping me with my research at the University of Guelph. I am grateful for the learning opportunities through the conferences and workshops I have attended, and of course the ﬁnancial aid for which was provided by my advisor through the NSERC-CANPOLIN Canadian Pollination Initiative and Dr. Hermann Eberl. In addition to my advisor, I would like to thank Dr. Gary Umphrey, not only for being on my advisory committee but providing his advice and insight while being a major part of my learning experience at the University of Guelph.
I am thankful to Luisa Carvalheiro for providing the Avon Gorge dataset as well as feedback for my analysis. Also, Peter Kevan and Tom Woodcock for their support, expertise on pollination, and constructive comments.
Furthermore, I would like express my grattitude towards my family and friends whose support made it possible for me to complete my graduate studies.

-Liam v

Table of Contents

List of Figures

vii

x

List of Tables

1 Introduction

1

2 Pollination Networks

6

68
2.1 Deﬁnition of a Pollination network . . . . . . . . . . . . . . . . . . . 2.2 Network terms and structure . . . . . . . . . . . . . . . . . . . . . . 2.3 Methods used to identify compartments . . . . . . . . . . . . . . . . . 10
2.3.1 Trophic similarility . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Simulated annealing algorithm (SA) . . . . . . . . . . . . . . . 11

3 Methodology

13

3.1 Latent Dirichlet allocation . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Kullback-Liebler (KL) divergence and label switching . . . . . . . . . 19 3.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Akaike Information Criterion (AIC) . . . . . . . . . . . . . . . 23 3.3.3 Bayesian Information Criterion (BIC) . . . . . . . . . . . . . . 25 3.3.4 Deviance Information Criterion (DIC) . . . . . . . . . . . . . 26 3.3.5 Information Criterion (IC) . . . . . . . . . . . . . . . . . . . . 27 3.3.6 Penalized Perplexity . . . . . . . . . . . . . . . . . . . . . . . 27

4 Simulation Study

29

4.1 Study design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Data Generation and Model Fitting . . . . . . . . . . . . . . . . . . . 32 4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Parameter estimation Statistics . . . . . . . . . . . . . . . . . 42
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Data Analysis

56

5.1 Description of the Avon Gorge Data . . . . . . . . . . . . . . . . . . 56 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vi

6 Conclusions

71

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A Appendix

77

A.1 Simulation study results . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.1.1 Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.1.2 Scenarios 2 to 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.1.3 Scenarios 5 to 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.1.4 Scenario 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.1.5 Scenarios 13 to 16 . . . . . . . . . . . . . . . . . . . . . . . . . 109 A.1.6 Scenario 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.1.7 Scenarios 18 to 20 . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.1.8 Scenarios 21 to 24 . . . . . . . . . . . . . . . . . . . . . . . . . 137
A.2 Avon Gorge dataset results . . . . . . . . . . . . . . . . . . . . . . . . 148
A.2.1 Avon Gorge data results for analysis 1 using penalized perplexity148 A.2.2 Avon Gorge data results for analysis 1 using IC model selection criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
A.3 The lda package in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 A.4 The bipartite package in R . . . . . . . . . . . . . . . . . . . . . . . 160 vii

List of Figures

2.1 A weighted bipartite graph representing observed interactions within an ecosystem. Circles represent pollinator species; squares represent

plant species. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

3.1 A graphical representation of the LDA model applied to the a^thpollinator species with n_aobserved counts on M plant species. Z and θ^aare K-vectors, Y ^aand β_zare M-vectors and α and η_Zare scalars for Z = 1 − K and a = 1 − N. . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Visualization of a mildly nested visitation web with 20 visitor species
(rows) and 9 plant species (columns). Darker cells represent higher frequencies of interactions between the corresponding plant-visitor pairs. 31
4.2 Stacked bar plots for the identiﬁed interaction groups in scenario 10.
ˆ
The top plots are for the runs with K = K and the bottom row is for

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Presence/absence visualization of Avon Gorge data with rare visits excluded and single visits excluded (N = 85, M = 53). . . . . . . . . 58
5.2 Presence/absence visualization of Avon Gorge data with rare visits included, but plants/visitors with single counts removed for analysis 3 (N = 85, M = 53). . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Estimated visitation distribution by interaction group, averaged over
ˆ
83 runs for K = 2. Refer to Table 5.3 for plant species names. . . . . 63

A.1 Stacked bar plots for the identiﬁed interaction groups in scenario 1.
ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 viii
ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 ix
ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

ˆ

ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.22 Presence/absence visualization of Avon Gorge data with rare visits included, but plants/visitors with single counts removed for analysis 1 (N = 89, M = 54). . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.23 Estimated visitation distribution by interaction group, averaged over
ˆ
84 runs for K = 2 in analysis 1 using penalized perplexity for model

selection. Refer to Table 5.3 for plant species names. . . . . . . . . . 153
A.24 Presence/absence visualization of Avon Gorge data with rare visits included, but plants/visitors with single counts removed for analysis 1 (N = 89, M = 54). . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
A.25 Estimated visitation distribution by interaction group, averaged over
100 runs. Refer to Table 5.3 for plant species names. . . . . . . . . . 158 x

List of Tables

3.1 Notation for the LDA. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Dimensions and the number of interaction groups used to generate the data for the 24 different scenarios of the simulation study. No nesting corresponds to a compartmental model. . . . . . . . . . . . . . . . . . 32
4.2 The test to accept η for a specified level of nesting. . . . . . . . . . . 34 4.3 Number of samples that chose the correct number of groups K out of
500 samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 The discordance ratio calculated as (PP incorrect, PY correct)/(PP correct, PY incorrect) for the penalized perplexity and perplexity model selection criteria and (PP incorrect, AIC correct)/(PP correct, AIC incorrect) for the penalized perplexity and AIC model selection criteria for each scenario of 500 runs. The proportion of the 500 runs choosing an incorrect k for each scenario are also listed for each of the two criteria 46
4.5 The number of groups identified for the scenarios with N = 42, M
= 14 and K = 3 with penalized perplexity (PP) used as the model selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
ˆ
4.6 Top row: The bias and relative bias for β for the scenarios with N =

42,M = 14 and K = 3 using penalized perplexity for model selection. Bottom row: The true β parameter used to generate the data. . . . . 48
ˆ
4.7 The average relative bias for θ for the scenarios with N = 42,M = 14

and K = 3 using the penalized perplexity for model selection. . . . . 49
ˆ
4.8 The average bias for θ for the scenarios with N = 42,M = 14 and K

= 3 using the penalized perplexity for model selection. . . . . . . . . 50
ˆ
4.9 The coeﬃcient of variation (CV) for β for the scenarios with N = 42,M

= 14 and K = 3 using the penalized perplexity for model selection. . 51
ˆ
4.10 The average coeﬃcient of variation (CV) for θ for the scenarios with

N = 42,M = 14 and K = 3 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ˆ
4.11 The average standard deviation for θ for the scenarios with N = 42,M

= 14 and K = 3 using the penalized perplexity for model selection. . 53

5.1 Summary of counts in Avon Gorge data. . . . . . . . . . . . . . . . . 57 5.2 The number of interaction groups associated with the model chosen most often for each score. The number of times this model is selected out of the 100 runs is shown in brackets. . . . . . . . . . . . . . . . . 60 xi
5.3 Estimated plant visitation distributions for each interaction group β_k,
ˆaveraged over runs for K = 2 (83) using analysis 3 of LDA with a

Gibbs sampler and two interaction groups. . . . . . . . . . . . . . . . 62
5.4 Estimated group membership distributions for each visitor species θ^a,
ˆaveraged over 83 independent runs of LDA where K = 2 with a Gibbs

sampler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Estimated group membership distributions for each visitor species θ^a,
ˆaveraged over 83 independent runs of LDA where K = 2 with a Gibbs

sampler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

A.1 The number of groups identiﬁed for scenario 1 with N = 20, M = 9 and K = 2 with penalized perplexity (PP) used as the model selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
ˆ
A.2 Top row: The bias and relative bias for β for scenario 1 with N =

20, M = 9 and K = 2 using penalized perplexity for model selection. Bottom row: The true β parameter used to generate the data. . . . . 78
ˆ
A.3 The average relative bias for θ for the scenarios with N = 20, M = 9

ˆ
A.4 The average bias for θ for the scenarios with N = 20, M = 9 and K

ˆ
A.5 The coeﬃcient of variation (CV) and SD for β for the scenarios with

N = 20, M = 9 and K = 2 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ˆ
A.6 The average coeﬃcient of variation (CV) for θ for the scenarios with

ˆ
A.7 The average standard deviation for θ for the scenarios with N = 20,

M = 9 and K = 2 using the penalized perplexity for model selection.
A.8 The number of groups identiﬁed for the scenarios with N = 20, M = 9 and K = 3 with penalized perplexity (PP) used as the model selection
82 criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
ˆ
A.9 Top row: The bias and relative bias for β for the scenarios with N =

20, M = 9 and K = 3 using penalized perplexity for model selection. Bottom row: The true β parameter used to generate the data. . . . . 87
ˆ
A.10 The average relative bias for θ for the scenarios with N = 20, M = 9

ˆ
A.11 The average bias for θ for the scenarios with N = 20, M = 9 and K

ˆ
A.12 The coeﬃcient of variation (CV) and SD for β for the scenarios with

ˆ
A.13 The average coeﬃcient of variation (CV) for θ for the scenarios with

N = 20, M = 9 and K = 3 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 xii
ˆ
A.14 The average standard deviation (SD) for θ for the scenarios with N =

20, M = 9 and K = 3 using the penalized perplexity for model selection. 91
A.15 The number of groups identiﬁed for the scenarios with N = 20, M = 9 and K = 4 with penalized perplexity (PP) used as the model selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
ˆ
A.16 Top row: The bias and relative bias for β for the scenarios with N =