An Evaluation of Latent Dirichlet Allocation in the Context of Plant-Pollinator Networks by Liam Callaghan a Thesis Presented To

An evaluation of latent Dirichlet allocation in the context of plant-pollinator networks by Liam Callaghan A Thesis Presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Mathematics and Statistics Guelph, Ontario, Canada c ꢀ Liam Callaghan, December, 2012 ABSTRACT An evaluation of latent Dirichlet allocation in the context of plant-pollinator networks <ul style="display: flex;"><li style="flex:1">Liam Callaghan </li><li style="flex:1">Advisors: </li></ul><ul style="display: flex;"><li style="flex:1">University of Guelph, 2012 </li><li style="flex:1">Dr. A. Ali </li></ul>Dr. G. Umphrey There may be several mechanisms that drive observed interactions between plants and pollinators in an ecosystem, many of which may involve trait matching or trait complementarity. Hence a model of insect species activity on plant species should be represented as a mixture of these linkage rules. Unfortunately, ecologists do not always know how many, or even which, traits are the main contributors to the observed interactions. This thesis proposes the Latent Dirichlet Allocation (LDA) model from artificial intelligence for modelling the observed interactions in an ecosystem as a finite mixture of (latent) interaction groups in which plant and pollinator pairs that share common linkage rules are placed in the same interaction group. Several model selection criteria are explored for estimating how many interaction groups best describe the observed interactions. This thesis also introduces a new model selection score called “penalized perplexity”. The performance of the model selection criteria, and of LDA in general, are evaluated through a comprehensive simulation study that consider networks of various size along with varying levels of nesting and numbers of interaction groups. Results of the simulation study suggest that LDA works well on networks with mild-to-no nesting, but loses accuracy with increased nestedness. Further, the penalized perplexity tended to outperform the other model selection criteria in identifying the correct number of interaction groups used to simulate the data. Finally, LDA was demonstrated on a real network, the results of which provided insights into the functional roles of pollinator species in the study region. Keywords: pollination network, latent Dirichlet allocation, linkage rules, perplexity, model selection, BIC, AIC, DIC. iv Acknowledgments I would like to thank my advisor Dr. Ayesha Ali for patiently helping me with my research at the University of Guelph. I am grateful for the learning opportunities through the conferences and workshops I have attended, and of course the financial aid for which was provided by my advisor through the NSERC-CANPOLIN Canadian Pollination Initiative and Dr. Hermann Eberl. In addition to my advisor, I would like to thank Dr. Gary Umphrey, not only for being on my advisory committee but providing his advice and insight while being a major part of my learning experience at the University of Guelph. I am thankful to Luisa Carvalheiro for providing the Avon Gorge dataset as well as feedback for my analysis. Also, Peter Kevan and Tom Woodcock for their support, expertise on pollination, and constructive comments. Furthermore, I would like express my grattitude towards my family and friends whose support made it possible for me to complete my graduate studies. -Liam vTable of Contents <ul style="display: flex;"><li style="flex:1">List of Figures </li><li style="flex:1">vii </li></ul><ul style="display: flex;"><li style="flex:1">x</li><li style="flex:1">List of Tables </li></ul><ul style="display: flex;"><li style="flex:1">1 Introduction </li><li style="flex:1">1</li></ul><ul style="display: flex;"><li style="flex:1">2 Pollination Networks </li><li style="flex:1">6</li></ul>68 2.1 Definition of a Pollination network . . . . . . . . . . . . . . . . . . . 2.2 Network terms and structure . . . . . . . . . . . . . . . . . . . . . . 2.3 Methods used to identify compartments . . . . . . . . . . . . . . . . . 10 2.3.1 Trophic similarility . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Simulated annealing algorithm (SA) . . . . . . . . . . . . . . . 11 <ul style="display: flex;"><li style="flex:1">3 Methodology </li><li style="flex:1">13 </li></ul>3.1 Latent Dirichlet allocation . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Kullback-Liebler (KL) divergence and label switching . . . . . . . . . 19 3.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Akaike Information Criterion (AIC) . . . . . . . . . . . . . . . 23 3.3.3 Bayesian Information Criterion (BIC) . . . . . . . . . . . . . . 25 3.3.4 Deviance Information Criterion (DIC) . . . . . . . . . . . . . 26 3.3.5 Information Criterion (IC) . . . . . . . . . . . . . . . . . . . . 27 3.3.6 Penalized Perplexity . . . . . . . . . . . . . . . . . . . . . . . 27 <ul style="display: flex;"><li style="flex:1">4 Simulation Study </li><li style="flex:1">29 </li></ul>4.1 Study design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Data Generation and Model Fitting . . . . . . . . . . . . . . . . . . . 32 4.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.1 Parameter estimation Statistics . . . . . . . . . . . . . . . . . 42 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 <ul style="display: flex;"><li style="flex:1">5 Data Analysis </li><li style="flex:1">56 </li></ul>5.1 Description of the Avon Gorge Data . . . . . . . . . . . . . . . . . . 56 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vi <ul style="display: flex;"><li style="flex:1">6 Conclusions </li><li style="flex:1">71 </li></ul>6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 <ul style="display: flex;"><li style="flex:1">A Appendix </li><li style="flex:1">77 </li></ul>A.1 Simulation study results . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.1.1 Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.1.2 Scenarios 2 to 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.1.3 Scenarios 5 to 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.1.4 Scenario 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.1.5 Scenarios 13 to 16 . . . . . . . . . . . . . . . . . . . . . . . . . 109 A.1.6 Scenario 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.1.7 Scenarios 18 to 20 . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.1.8 Scenarios 21 to 24 . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.2 Avon Gorge dataset results . . . . . . . . . . . . . . . . . . . . . . . . 148 A.2.1 Avon Gorge data results for analysis 1 using penalized perplexity148 A.2.2 Avon Gorge data results for analysis 1 using IC model selection criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.3 The lda package in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 A.4 The bipartite package in R . . . . . . . . . . . . . . . . . . . . . . . 160 vii List of Figures 2.1 A weighted bipartite graph representing observed interactions within an ecosystem. Circles represent pollinator species; squares represent <ul style="display: flex;"><li style="flex:1">plant species. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . </li><li style="flex:1">7</li></ul>3.1 A graphical representation of the LDA model applied to the ath pollinator species with na observed counts on M plant species. Z and θa are K-vectors, Y a and βz are M-vectors and α and ηZ are scalars for Z = 1 − K and a = 1 − N. . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 Visualization of a mildly nested visitation web with 20 visitor species (rows) and 9 plant species (columns). Darker cells represent higher frequencies of interactions between the corresponding plant-visitor pairs. 31 4.2 Stacked bar plots for the identified interaction groups in scenario 10. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Stacked bar plots for the identified interaction groups in scenario 11. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Stacked bar plots for the identified interaction groups in scenario 12. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Presence/absence visualization of Avon Gorge data with rare visits excluded and single visits excluded (N = 85, M = 53). . . . . . . . . 58 5.2 Presence/absence visualization of Avon Gorge data with rare visits included, but plants/visitors with single counts removed for analysis 3 (N = 85, M = 53). . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Estimated visitation distribution by interaction group, averaged over ˆ 83 runs for K = 2. Refer to Table 5.3 for plant species names. . . . . 63 A.1 Stacked bar plots for the identified interaction groups in scenario 1. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.2 Stacked bar plots for the identified interaction groups in scenario 2. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 viii A.3 Stacked bar plots for the identified interaction groups in scenario 3. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.4 Stacked bar plots for the identified interaction groups in scenario 4. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A.5 Stacked bar plots for the identified interaction groups in scenario 5. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.6 Stacked bar plots for the identified interaction groups in scenario 6. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.7 Stacked bar plots for the identified interaction groups in scenario 7. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.8 Stacked bar plots for the identified interaction groups in scenario 8. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.9 Stacked bar plots for the identified interaction groups in scenario 9. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.10 Stacked bar plots for the identified interaction groups in scenario 13. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 A.11 Stacked bar plots for the identified interaction groups in scenario 14. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A.12 Stacked bar plots for the identified interaction groups in scenario 15. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.13 Stacked bar plots for the identified interaction groups in scenario 16. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.14 Stacked bar plots for the identified interaction groups in scenario 17. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.15 Stacked bar plots for the identified interaction groups in scenario 18. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.16 Stacked bar plots for the identified interaction groups in scenario 19. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 ix A.17 Stacked bar plots for the identified interaction groups in scenario 20. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.18 Stacked bar plots for the identified interaction groups in scenario 21. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.19 Stacked bar plots for the identified interaction groups in scenario 22. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.20 Stacked bar plots for the identified interaction groups in scenario 23. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.21 Stacked bar plots for the identified interaction groups in scenario 24. ˆ The top plots are for the runs with K = K and the bottom row is for ˆK = K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.22 Presence/absence visualization of Avon Gorge data with rare visits included, but plants/visitors with single counts removed for analysis 1 (N = 89, M = 54). . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.23 Estimated visitation distribution by interaction group, averaged over ˆ 84 runs for K = 2 in analysis 1 using penalized perplexity for model selection. Refer to Table 5.3 for plant species names. . . . . . . . . . 153 A.24 Presence/absence visualization of Avon Gorge data with rare visits included, but plants/visitors with single counts removed for analysis 1 (N = 89, M = 54). . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.25 Estimated visitation distribution by interaction group, averaged over 100 runs. Refer to Table 5.3 for plant species names. . . . . . . . . . 158 xList of Tables 3.1 Notation for the LDA. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Dimensions and the number of interaction groups used to generate the data for the 24 different scenarios of the simulation study. No nesting corresponds to a compartmental model. . . . . . . . . . . . . . . . . . 32 4.2 The test to accept η for a specified level of nesting. . . . . . . . . . . 34 4.3 Number of samples that chose the correct number of groups K out of 500 samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 The discordance ratio calculated as (PP incorrect, PY correct)/(PP correct, PY incorrect) for the penalized perplexity and perplexity model selection criteria and (PP incorrect, AIC correct)/(PP correct, AIC incorrect) for the penalized perplexity and AIC model selection criteria for each scenario of 500 runs. The proportion of the 500 runs choosing an incorrect k for each scenario are also listed for each of the two criteria 46 4.5 The number of groups identified for the scenarios with N = 42, M = 14 and K = 3 with penalized perplexity (PP) used as the model selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 ˆ 4.6 Top row: The bias and relative bias for β for the scenarios with N = 42,M = 14 and K = 3 using penalized perplexity for model selection. Bottom row: The true β parameter used to generate the data. . . . . 48 ˆ 4.7 The average relative bias for θ for the scenarios with N = 42,M = 14 and K = 3 using the penalized perplexity for model selection. . . . . 49 ˆ 4.8 The average bias for θ for the scenarios with N = 42,M = 14 and K = 3 using the penalized perplexity for model selection. . . . . . . . . 50 ˆ 4.9 The coefficient of variation (CV) for β for the scenarios with N = 42,M = 14 and K = 3 using the penalized perplexity for model selection. . 51 ˆ 4.10 The average coefficient of variation (CV) for θ for the scenarios with N = 42,M = 14 and K = 3 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 ˆ 4.11 The average standard deviation for θ for the scenarios with N = 42,M = 14 and K = 3 using the penalized perplexity for model selection. . 53 5.1 Summary of counts in Avon Gorge data. . . . . . . . . . . . . . . . . 57 5.2 The number of interaction groups associated with the model chosen most often for each score. The number of times this model is selected out of the 100 runs is shown in brackets. . . . . . . . . . . . . . . . . 60 xi 5.3 Estimated plant visitation distributions for each interaction group βk, âveraged over runs for K = 2 (83) using analysis 3 of LDA with a Gibbs sampler and two interaction groups. . . . . . . . . . . . . . . . 62 5.4 Estimated group membership distributions for each visitor species θa, âveraged over 83 independent runs of LDA where K = 2 with a Gibbs sampler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 Estimated group membership distributions for each visitor species θa, âveraged over 83 independent runs of LDA where K = 2 with a Gibbs sampler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 A.1 The number of groups identified for scenario 1 with N = 20, M = 9 and K = 2 with penalized perplexity (PP) used as the model selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 ˆ A.2 Top row: The bias and relative bias for β for scenario 1 with N = 20, M = 9 and K = 2 using penalized perplexity for model selection. Bottom row: The true β parameter used to generate the data. . . . . 78 ˆ A.3 The average relative bias for θ for the scenarios with N = 20, M = 9 and K = 2 using the penalized perplexity for model selection. . . . . 79 ˆ A.4 The average bias for θ for the scenarios with N = 20, M = 9 and K = 2 using the penalized perplexity for model selection. . . . . . . . . 80 ˆ A.5 The coefficient of variation (CV) and SD for β for the scenarios with N = 20, M = 9 and K = 2 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 ˆ A.6 The average coefficient of variation (CV) for θ for the scenarios with N = 20, M = 9 and K = 2 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 ˆ A.7 The average standard deviation for θ for the scenarios with N = 20, M = 9 and K = 2 using the penalized perplexity for model selection. A.8 The number of groups identified for the scenarios with N = 20, M = 9 and K = 3 with penalized perplexity (PP) used as the model selection 82 criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 ˆ A.9 Top row: The bias and relative bias for β for the scenarios with N = 20, M = 9 and K = 3 using penalized perplexity for model selection. Bottom row: The true β parameter used to generate the data. . . . . 87 ˆ A.10 The average relative bias for θ for the scenarios with N = 20, M = 9 and K = 3 using the penalized perplexity for model selection. . . . . 88 ˆ A.11 The average bias for θ for the scenarios with N = 20, M = 9 and K = 3 using the penalized perplexity for model selection. . . . . . . . . 89 ˆ A.12 The coefficient of variation (CV) and SD for β for the scenarios with N = 20, M = 9 and K = 3 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 ˆ A.13 The average coefficient of variation (CV) for θ for the scenarios with N = 20, M = 9 and K = 3 using the penalized perplexity for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 xii ˆ A.14 The average standard deviation (SD) for θ for the scenarios with N = 20, M = 9 and K = 3 using the penalized perplexity for model selection. 91 A.15 The number of groups identified for the scenarios with N = 20, M = 9 and K = 4 with penalized perplexity (PP) used as the model selection criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 ˆ A.16 Top row: The bias and relative bias for β for the scenarios with N =

An Evaluation of Latent Dirichlet Allocation in the Context of Plant-Pollinator Networks by Liam Callaghan a Thesis Presented To

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support