An Evaluation of Latent Dirichlet Allocation in the Context of Plant-Pollinator Networks by Liam Callaghan a Thesis Presented To
Total Page:16
File Type:pdf, Size:1020Kb
An evaluation of latent Dirichlet allocation in the context of plant-pollinator networks by Liam Callaghan A Thesis Presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Mathematics and Statistics Guelph, Ontario, Canada c Liam Callaghan, December, 2012 ABSTRACT An evaluation of latent Dirichlet allocation in the context of plant-pollinator networks Liam Callaghan Advisors: University of Guelph, 2012 Dr. A. Ali Dr. G. Umphrey There may be several mechanisms that drive observed interactions between plants and pollinators in an ecosystem, many of which may involve trait matching or trait complementarity. Hence a model of insect species activity on plant species should be represented as a mixture of these linkage rules. Unfortunately, ecologists do not always know how many, or even which, traits are the main contributors to the observed interactions. This thesis proposes the Latent Dirichlet Allocation (LDA) model from artificial intelligence for modelling the observed interactions in an ecosys- tem as a finite mixture of (latent) interaction groups in which plant and pollinator pairs that share common linkage rules are placed in the same interaction group. Sev- eral model selection criteria are explored for estimating how many interaction groups best describe the observed interactions. This thesis also introduces a new model se- lection score called \penalized perplexity". The performance of the model selection criteria, and of LDA in general, are evaluated through a comprehensive simulation study that consider networks of various size along with varying levels of nesting and numbers of interaction groups. Results of the simulation study suggest that LDA works well on networks with mild-to-no nesting, but loses accuracy with increased nestedness. Further, the penalized perplexity tended to outperform the other model selection criteria in identifying the correct number of interaction groups used to simu- late the data. Finally, LDA was demonstrated on a real network, the results of which provided insights into the functional roles of pollinator species in the study region. Keywords: pollination network, latent Dirichlet allocation, linkage rules, perplexity, model selection, BIC, AIC, DIC. iv Acknowledgments I would like to thank my advisor Dr. Ayesha Ali for patiently helping me with my research at the University of Guelph. I am grateful for the learning oppor- tunities through the conferences and workshops I have attended, and of course the financial aid for which was provided by my advisor through the NSERC-CANPOLIN Canadian Pollination Initiative and Dr. Hermann Eberl. In addition to my advisor, I would like to thank Dr. Gary Umphrey, not only for being on my advisory com- mittee but providing his advice and insight while being a major part of my learning experience at the University of Guelph. I am thankful to Luisa Carvalheiro for providing the Avon Gorge dataset as well as feedback for my analysis. Also, Peter Kevan and Tom Woodcock for their support, expertise on pollination, and constructive comments. Furthermore, I would like express my grattitude towards my family and friends whose support made it possible for me to complete my graduate studies. -Liam v Table of Contents List of Figures vii List of Tables x 1 Introduction 1 2 Pollination Networks 6 2.1 Definition of a Pollination network . 6 2.2 Network terms and structure . 8 2.3 Methods used to identify compartments . 10 2.3.1 Trophic similarility . 10 2.3.2 Simulated annealing algorithm (SA) . 11 3 Methodology 13 3.1 Latent Dirichlet allocation . 13 3.2 Kullback-Liebler (KL) divergence and label switching . 19 3.3 Model Selection . 22 3.3.1 Perplexity . 22 3.3.2 Akaike Information Criterion (AIC) . 23 3.3.3 Bayesian Information Criterion (BIC) . 25 3.3.4 Deviance Information Criterion (DIC) . 26 3.3.5 Information Criterion (IC) . 27 3.3.6 Penalized Perplexity . 27 4 Simulation Study 29 4.1 Study design . 29 4.2 Data Generation and Model Fitting . 32 4.3 Statistics . 38 4.4 Results . 40 4.4.1 Parameter estimation Statistics . 42 4.5 Discussion . 54 5 Data Analysis 56 5.1 Description of the Avon Gorge Data . 56 5.2 Results . 59 5.3 Discussion . 67 vi 6 Conclusions 71 6.1 Future Work . 73 A Appendix 77 A.1 Simulation study results . 77 A.1.1 Scenario 1 . 77 A.1.2 Scenarios 2 to 4 . 83 A.1.3 Scenarios 5 to 8 . 92 A.1.4 Scenario 9 . 103 A.1.5 Scenarios 13 to 16 . 109 A.1.6 Scenario 17 . 120 A.1.7 Scenarios 18 to 20 . 127 A.1.8 Scenarios 21 to 24 . 137 A.2 Avon Gorge dataset results . 148 A.2.1 Avon Gorge data results for analysis 1 using penalized perplexity148 A.2.2 Avon Gorge data results for analysis 1 using IC model selection criterion . 154 A.3 The lda package in R . 159 A.4 The bipartite package in R . 160 vii List of Figures 2.1 A weighted bipartite graph representing observed interactions within an ecosystem. Circles represent pollinator species; squares represent plant species. 7 3.1 A graphical representation of the LDA model applied to the ath polli- a nator species with na observed counts on M plant species. Z and θ a are K-vectors, Y and βz are M-vectors and α and ηZ are scalars for Z = 1 − K and a = 1 − N......................... 17 4.1 Visualization of a mildly nested visitation web with 20 visitor species (rows) and 9 plant species (columns). Darker cells represent higher frequencies of interactions between the corresponding plant-visitor pairs. 31 4.2 Stacked bar plots for the identified interaction groups in scenario 10. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 43 4.3 Stacked bar plots for the identified interaction groups in scenario 11. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 44 4.4 Stacked bar plots for the identified interaction groups in scenario 12. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 45 5.1 Presence/absence visualization of Avon Gorge data with rare visits excluded and single visits excluded (N = 85, M = 53). 58 5.2 Presence/absence visualization of Avon Gorge data with rare visits included, but plants/visitors with single counts removed for analysis 3 (N = 85, M = 53). 61 5.3 Estimated visitation distribution by interaction group, averaged over 83 runs for K^ = 2. Refer to Table 5.3 for plant species names. 63 A.1 Stacked bar plots for the identified interaction groups in scenario 1. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 78 A.2 Stacked bar plots for the identified interaction groups in scenario 2. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 84 viii A.3 Stacked bar plots for the identified interaction groups in scenario 3. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 85 A.4 Stacked bar plots for the identified interaction groups in scenario 4. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 86 A.5 Stacked bar plots for the identified interaction groups in scenario 5. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 93 A.6 Stacked bar plots for the identified interaction groups in scenario 6. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 94 A.7 Stacked bar plots for the identified interaction groups in scenario 7. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 95 A.8 Stacked bar plots for the identified interaction groups in scenario 8. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 96 A.9 Stacked bar plots for the identified interaction groups in scenario 9. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 103 A.10 Stacked bar plots for the identified interaction groups in scenario 13. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 110 A.11 Stacked bar plots for the identified interaction groups in scenario 14. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 111 A.12 Stacked bar plots for the identified interaction groups in scenario 15. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 112 A.13 Stacked bar plots for the identified interaction groups in scenario 16. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 113 A.14 Stacked bar plots for the identified interaction groups in scenario 17. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 120 A.15 Stacked bar plots for the identified interaction groups in scenario 18. The top plots are for the runs with K^ = K and the bottom row is for K^ 6= K. .................................. 128 A.16 Stacked bar plots for the identified interaction groups in scenario 19.