Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings

Volume 14 Portland, Oregon, USA Article 4

2014 Adding Phylogenies to QGIS and Lifemapper for Evolutionary Studies of Species Diversity Jeffery A. Cavner University of Kansas (USA)

Aimee M. Stewart

Charles J. Grady

James H. Beach

Follow this and additional works at: https://scholarworks.umass.edu/foss4g Part of the Geography Commons

Recommended Citation Cavner, Jeffery A.; Stewart, Aimee M.; Grady, Charles J.; and Beach, James H. (2014) "Adding Phylogenies to QGIS and Lifemapper for Evolutionary Studies of Species Diversity," Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings: Vol. 14 , Article 4. DOI: https://doi.org/10.7275/R5T72FN2 Available at: https://scholarworks.umass.edu/foss4g/vol14/iss1/4

This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Adding Phylogenies to QGIS and Lifemapper

Adding Phylogenies to QGIS and Lifemapper for Evolutionary Studies of Species Diversity regulation of species distributions, should assay the spatial variation of phylogenies by mapping phylo- by Jeffery A. Cavner, Aimee M. Stewart, Charles J. Grady, genetic community values across space and time at James H. Beach different scales using advances in GIS techniques. One such approach would to be bring phyloge- University of Kansas (USA). [email protected] netic data into a GIS environment. We have be- gun to develop such an approach as an addition to Abstract the Lifemapper project (www.lifemapper.org) in a Phylogenetic data from the “Tree of Life” have ex- Lifemapper Range & Diversity (LmRAD) QGIS plug- plicit spatial and temporal components when paired in (Cavner et al. 2014) that provides phylogenetic with species distribution and ecological data for test- visualization and analysis tools for spatially linked ing contributions to biological community assembly range-diversity relationships derived from presence- at different geographic scales of species interaction. absence matrices (PAMs). We developed the tool also Important questions in biology about the degree of hoping to expand it to include historical biogeogra- niche suitability and whether the history of a com- phy meta-community analyses and community as- munity’s assembly for an area can affect whether the sembly analyses focused on phylogenetic-diversity species in a community are more or less phyloge- area relationships where analysis across geographic netically related can be answered using several dif- scale leads some of the most important questions in ferent spatially-filtered measures of phylogenetic di- biodiversity. versity. Phylogenetic analyses which support the de- The LmRAD QGIS plug-in creates, maps and an- scription of ecological processes are usually achieved alyzes presence-absence matrices or PAMs, one of in a handful of software libraries that are narrowly the core data structures for macroecological research. focused on a single set of tasks. Very few applications It links the resulting data to phylogenetic and spa- scale to large datasets and most do not have an ex- tial views of a set of range-diversity statistics de- plicit spatial component without relying on external rived from the PAM. The PAM or visualization packages. This prompted us to explore is a 2-dimensional Boolean matrix constructed from bringing phylogenetic data into an open-source GIS a spatially defined grid of regular polygons where environment. The Lifemapper Macroecology/Range the presence or absence of each species of hundreds & Diversity QGIS plug-in is a custom plug-in which or thousands of species are recorded for each cell. we use to calculate and map biodiversity indices that One axes of the matrix represents species and the describe range-diversity relationships derived from orthogonal axis represents geographic localities de- large multi-species datasets. We describe extensions scribed by the regular polygons. Each geographic to that plug-in which expand the Lifemapper set site is coded for the presence (1) or absence (0) of of ecological tools to link phylogenies to spatially- each species. It summarizes the two fundamental derived ’diversity field’ statistics that describe the units of biogeography, the distributional range of a phylogenetic composition of natural communities. species (both their position and size, range size sim- Keywords: QGIS, WPS, Distributed Comput- ply equals the total of the species axes across sites) ing, Biogeography, Range and Diversity, Lifemapper, and the species diversity of sites or the number of Macroecology, Phylogenetics. different species in each site as summarized by site axes totals. Several mathematical and biological relation- 1. Background ships obtain across the PAM that link spatially de- rived statistics with species based statistics. Of in- Community phylogenetics, the focus on how species terest for phylogenetic relationships are the species relatedness and species traits are associated with based statistics calculated from the PAM that mea- how evolution extends into ecological processes sure the “diversity field” of a species (Arita et al. and spatial patterns, and biogeography or meta- 2008). The diversity field is the set of diversity values community ecology, largely focused on the spatial of sites in which a species occurs. For example, the

OSGEO Journal Volume 14 Page 19 of 48 Adding Phylogenies to QGIS and Lifemapper diversity field volume, i.e. the summation of those as Open Geospatial Consortium (OGC) Web Process- species diversity values within a species’ range di- ing Services (WPS) (Open Geospatial Consortium, vided by the range size of the species allows us to cal- Inc. 2007b) so that larger distributed computing en- culate the average species diversity within the range vironments can be brought to bear on large datasets. of that species. We represent that volume as a pro- The Lifemapper web services are organized as two portion of the total number of species in the study modules, LmSDM, and LmRAD. The LmSDM mod- area. Including the total area of the study area allows ule uses RESTful and OGC specifications to build us to illustrate the proportion of the sites in which species distribution models based on the predicted two species co-occur. The average association of a niche for a species using climate and species occur- species with all of the species in the study area al- rence data. The LmRAD (Range and Diversity) is a lows us to illustrate that there is an inverse relation- multi-species platform for PAM based range and di- ship between the proportional range of a species and versity calculations. Both modules can be accessed the difference between the mean proportional diver- through the plug-in, and outputs from LmSDM can sity within its range and the average proportional be piped into LmRAD as species inputs to PAMs. diversity in the study area (Arita et al. 2008). The This paper will focus on the range and diversity ca- mathematical reciprocal of the average proportional pabilities of the plug-in and how the spatial compo- diversity of the study area is a well-studied measure nent to phylogenetic data recently added to the plug- of species turnover called Whittaker’s beta diversity. in can be used with the biodiversity indices calcu- It is a measure of the ratio between the overall di- lated from the PAM and areas where phylogenetic versity of the study area and the average local di- data can be used to explore other types of diversity versity (Arita et al. 2008). There are closely associ- measures for species communities. This paper will ated beta measures of diversity for several different begin by outlining use cases and common threads types of diversity. Different approaches to species di- that connect them and how we have begun to ad- versity such as phylogenetic diversity – the degree dress them with a focus on new interface capabilities of relatedness of species in a community based on for phylogenetic data and linked data spaces. Next their evolutionary history – abundance and ecosys- we will describe how the Lifemapper plug-in and tem function measures of diversity all can be decom- it’s supporting web services were designed to take posed into measures of local and regional diversity advantage of a client-server architecture in order to ratios that are highly dependent on scale. be able to use geographic processing standards on Analyzing the diversity field within the range of a large datasets. This is followed by a comparison of species is equivalent to studying it’s covariance with related software with a focus on phylogenetic algo- all the species in a study, i.e. the degree of associa- rithms and scripts with a spatial component. We end tion of species within their ranges. We plot this as- by discussing findings, and future directions for the sociation in QGIS through the plug-in in a “range- Lifemapper plug-in. diversity” plot. Curves on the plot for species follow a line defined by the inverse relationship between the range of a species and the difference between the two 2. Use Cases and Capabilities diversity statistics. When plotting the species in this way, species with equal degrees of association with 2.1 Range and Diversity Plots and Maps one another arrange themselves along lines of isoco- with Phylogenetic Trees variance. The Lifemapper plug-in allows the user to “brush” data points along those curves in the interac- Phylogenetic based ecology is a growing field. Its tive range-diversity plot which selects the individual practice both at small scales and larger biogeo- species in the linked data space for the phylogenetic graphic scales – it goes under several names: phylo- tree. In this way the spatially derived statistics for di- geography, ecophylogenetics, or phylogenetic com- versity from the PAM can be compared to the degree munity ecology – share two obvious constraints of phylogenetic relatedness within species commu- for incorporating phylogenetic data into ecology re- nities. search. First, many ecophylogenetic methods are not The plug-in accomplishes this by using QGIS as a available as open-source software packages, and are WPS client to Lifemapper web services (Stewart et therefore not extensible or customizable, and sec- al. 2014) and by using JavaScript based visualiza- ond; the tools are scattered across specialty soft- tion technologies for large phylogenetic trees within ware each with their own learning curve and with the plug-in. Macroecology algorithms are exposed unique data formats (Kembel et al. 2010). When

OSGEO Journal Volume 14 Page 20 of 48 Adding Phylogenies to QGIS and Lifemapper

Figure 1:Figure Lifemapper 1: Lifemapper Range-Diversity Range-Diversity Plug-in in Plug-in QGIS with in QGIS Range-Diversity with Range-Diversity Plot and Tree View. Plot and Tree View we extended Lifemapper to include the multi-species diversity of sites and mean range sizes of species oc- range and diversity2.2. Across experiments Space onand PAMs Time: in the Scale Lm- Considerationscurring in the site; and between the range sizes of RAD module we encountered a third constraint. Un- species and the mean species diversity within those tangling speciesThe associations indices atcurrently large scales calculated remains throughranges. the The plug-in, set of range including statistics ecology for species within a an importantstaples, research such question as beta diversity, (Arita et alalong 2008) with measuressite are described of nestedness by a ’dispersion – the degree field’. to The ana- and these scaleswhich require diversity large loss taxa occurs with by numerous species, leavinglogue isolated to that measure “islands” for of species diversity is the – ’diversity species in singleare all experiments,effected by scale. e.g. approximatelyThe degree to whichfield’ these which indices quantifies are effected the species by diversityscale of sites 3000 + speciesand ofthe birds mechanisms in South involved America. are Each important of in research which a speciesquestions occurs (Arita (Arita et al. et 2008, al. 2008). Range- these species represents some type of geographical diversity plots are produced in the Lifemapper plug- map whichLira-Noriega have to be intersected et al. 2007). with Most large analyses num- ofin scaling that summarize effects on these diversity fields ashave indexes been of site simi- bers of polygonsbased resultingon coarse in input PAM grids. matrices For of example sev- Hawkinslarity and et the al. degree (2003) of based association a diversity of species adding eral millionstudy elements. comparing These the models effect must of alsoscale be usingto 85 our datasets knowledge with ofresolutions species communities. ranging Data permuted,from i.e. randomized 103 to 105 km to2 perform (Lira-Noriega null model et al. 2007).points Lira in the et plotsal. performed are constrained a study by with the association hypothesisfiner testing, PAM resulting resolutions in a largestarting number at 11.4 of km2of and species incrementally and site similarity climbing and to the 2.93 proportional x fill unwieldy data103 km structures.2 for an area Additionally of ~ 138,200 these stud-km2. Theof Lifemapper the matrix, plug-in i.e. the has ratio been of presences used to of species ies need toconstruct be done at PAMs several for different much spatiallarger scalesareas, ~ 24,709,000to their absence. km2 with The slightly plug-in allowslarger cell a researcher to and across time.resolutions Totaling of all 100 of thesekm2,dimensions but with the one recentbuild additions several of models data parallelization across scale, experiment and with can readily see that this presents serious computa- fill, extent and resolution. The dispersion field and tional challenges.portable These instances factors of informed Lifemapper the design we expectdiversity to be able field to measuresproduce PAMs in the range-diversitywith cell plots o of LmRADresolutions as a module lower to the than existing 1/32 Lifemapper for the globe. Weare can interactive also currently and allow test multiplescale related pane selection. computational platform which uses co-located dis- Visualization and data exploration are presented in tributed high-through-put compute resources to cal- both geographic and phylogenetic data spaces. The culate large multi-part ecological modeling jobs, and dispersion field statistics are viewed in an interac- a thick client front-end in QGIS to those services. tive "by-sites" range-diversity plot and are linked ge- Important biological relationships are expressed ographically to a map of the range statistics attached in the PAM as associations between species diversity to the input grid for the PAM so that they can be in communities and the range size of those species. overlaid with other geographic data. For species as- Two important correlations are between the species 6 sociations within communities the data derived from

OSGEO Journal Volume 14 Page 21 of 48 Adding Phylogenies to QGIS and Lifemapper the PAM are depicted in an interactive “by-species” nity assembly, research questions addressed by both range-diversity plot for the diversity field and are species richness based diversity measures, phyloge- linked to a dendrogram that represents their phylo- netic diversity and functional diversity need to ben- genetic relationship. All of the data spaces allow for efit from relative findings and work together to com- ‘brushing’ of datasets by species or location across plement one another (Cianciarus 2011). A common tree data space and geographic data space. Selecting thread connecting different concepts of diversity are species in the viewer select those questions about the evolutionary and biogeographi- species in the “by-species” range-diversity plot. In cal history of a species and how temporal and spatial this way the trees can act as a data exploration tool scales affect the evolutionary relatedness of species against the diversity indices derived from the PAM in a habitat and the degree that those assemblages are providing insight into the phylogenetic composition consistent with environmental filtering or competi- of communities where species co-occur. (see figure tive interaction (Emerson and Gillespie 2008). The 1.) species composition of natural communities is tied to questions of range contraction and local extirpa- tion of species in relation to niche processes like cli- 2.2. Across Space and Time: Scale Consid- mate change. The Lifemapper/QGIS plug-in allows erations the user to build PAMs that describe range and di- versity relationships across time in relation to climate The indices currently calculated through the plug- change by using predicted eco-niches based on cli- in, including ecology staples, such as beta diversity, mate scenarios, derived from LmSDM, as inputs to along with measures of nestedness – the degree to future PAMs. which diversity loss occurs by species, leaving iso- lated “islands” of diversity – are all effected by scale. Phylogenetic data has both spatial and temporal The degree to which these indices are effected by components. Patterns of co-occurrence of species in scale and the mechanisms involved are important re- a spatially defined community is effected over dif- search questions (Arita et al. 2008, Lira-Noriega et al. ferent time and spatial scales by the similarity, and 2007). Most analyses of scaling effects on diversity distance of other habitats, the degree that niches are have been based on coarse input grids. For example filled with current inhabitants and the relative time Hawkins et al. (2003) based a diversity study com- available for colonization or adaptation (Emerson paring the effect of scale using 85 datasets with reso- and Gillespie 2008). Patterns of community struc- lutions ranging from 103 to 105 km2 (Lira-Noriega ture and co-occurrence of species can be summa- et al. 2007). Lira et al. performed a study with rized by two related statistics derived from phyloge- finer PAM resolutions starting at 11.4 km2 and incre- nies for a geographic area, phylogenetic clustering, mentally climbing to 2.93 x 103 km2 for an area of ˜ and phylogenetic over-dispersion/evenness. Phylo- 138,200 km2. The Lifemapper plug-in has been used genetic clustering occurs when co-occurring species to construct PAMs for much larger areas, ˜ 24,709,000 are more closely related than can be expected by km2 with slightly larger cell resolutions of 100 km2, chance. Phylogenetic over-dispersion/evenness oc- but with the recent additions of data parallelization curs when co-occurring species are more distantly re- and portable instances of Lifemapper we expect to lated than can be expected by chance. With the tree be able to produce PAMs with cell resolutions lower viewer these phenomena are easily discernible for than 1/32o for the globe. We can also currently test small trees with species selected that co-occur within scale related hypotheses about range size and diver- a community. Both of these measures will need to be sity such as predictions that for the same kind of or- quantified for larger trees and both require that they ganism, organized by taxa, and their ability to dis- be tested against null models generated from the tree perse across the landscape, stronger negative corre- and its spatial components. Lifemapper currently lations between range size and diversity should ex- implements some very efficient bit-wise operations ist the greater the scale. Several questions that re- for randomizing null models from the PAM. To per- late to spatial scale can also be asked of phylogenetic- mute the tree data, we will in the future build out diversity area relationships, and the extent to which the architecture for encoding the tree topology from speciation and adaptation contribute to community large phylogenies into matrices that will use similar assembly with the incorporation of phylogenetic tree methods for randomization. data into the plug-in. Clade based analyses of traits related to niche oc- Because biogeographers are increasingly inter- cupancy helps us to understand the relative impor- ested in methods in phylogeography and commu- tance of environmental filtering. Using cross scale

OSGEO Journal Volume 14 Page 22 of 48 Adding Phylogenies to QGIS and Lifemapper comparisons in the plug-in with the phylogenetic absence data, including matrix definition, construc- trees could help to tease out effects of both tem- tion, calculation, randomization for null models and poral and spatial scale. Larger extents within an preparation of visualization outputs, trees and maps. LmRAD experiment should show phylogenetic clus- As a job based infrastructure, LmRAD and tering due to environmental filters, and local areas LmSDM algorithms are environmentally agnostic which will naturally contain subsets of the same and are portable across compute environments taxa used in the experiment should show local over- through instances of LmCompute that are deploy- dispersion due to competitive interaction. For tem- able in several types of distributed compute en- poral scale, range and diversity measures from time- vironments. LmCompute is a pluggable, config- stepped PAMs achievable with the recent acquisition urable, open source client that abstracts the details of paleontological climate layers, and the future cli- of the compute job away from the physical system. mate scenario data currently in the plug-in, should LmWebServer contains a Job Server tier that feeds allow us, with the use of the trees be able to look jobs to any compute environment that can sponsor at colonization dynamics, and how over-dispersion an instance of LmCompute. LmCompute is also gen- and cladogenesis become more important over time eralizable, since LmCompute only interacts with the for isolated niches and how species new to a habi- physical system through a mediator designed along tat over large time frames, e.g. island migration, the mediator and facade design patterns (Gamma show shared common traits pre-adapted to a habitat et al. 1994) the compute plug-in expects just a few (Emerson and Gillespie 2008). stock functions. A “request job” method call might just as easily get a local XML job definition or pull a job from the Lifemapper Job Server. An instance 3. Design and Architecture of LmCompute can use a job response to instantiate a Job Runner object and retrieve inputs to the meth- 3.1 Lifemapper Distributed Computa- ods requested. Each of these computational tasks or tional Services group of related tasks is a compute plug-in based on the template method and strategy design pat- The Lifemapper Range and Diversity (LmRAD) tern (Gamma et al. 1994). The compute plug-in is module is an analysis suite that extends the cur- wrapped in a “runner” class that depending on its rent Lifemapper (www.lifemapper.org) platform al- run method can execute an external application or lowing us to leverage the computational power of run custom algorithms like LmRAD algorithms. A distributed computing environments to execute the compute plug-in receives its jobs through a job con- range-diversity analyses as distributed algorithms. troller that acts as a hub for producing job outputs. The algorithms are exposed as Open Geospatial Using the factory method pattern and command pat- Consortium Web Processing Services (WPS) (Open tern (Gamma et al. 1994), the controller sits in front Geospatial Consortium, Inc. 2007b), and RESTful of a compute environment, requests data inputs for web-services for simple data retrieval and viewing. a job, and determines through Python “duck typing” The Lifemapper infrastructure is composed of a cen- which compute plug-in is appropriate for the com- tral management component, LmDbServer, which putation. The pipeline and LmDbServer are respon- manages data and analysis operations with a “data sible for presenting jobs to the Job Server on LmWeb- pipeline” written in Python (www.python.org) and server and moving jobs through the system. At dif- a PostgreSQL/PostGIS database; multiple instances ferent stages in a LmRAD experiment dependen- of LmCompute that can be co-located across insti- cies and statuses are updated by LmCompute which tutions, currently deployed at compute clusters at posts back to the Job Server during the process. Lm- University of Kansas, University of Florida, and RAD PAM operations specifically have been paral- San Diego Supercomputer Center; a continuously lelized across processors on any compute environ- updated species model and species occurrence set ment that receives a PAM job. Data products for large archive based on museum data for species from the PAMs at high resolutions (10 km) with upwards of Global Biodiversity Information Facility (GBIF); and 800 species can be constructed and analyzed in this LmWebServer which manages all communications way with reasonable response times. Results from between the components and client applications. the experiment are then posted back to the Job Server (see Figure 2.) LmRAD specifically is a distributed from the compute environment and are written to the mulit-species modeling module within this system database and file system shared by the LmDbServer with custom algorithms for working with presence- and LmWebserver.

OSGEO Journal Volume 14 Page 23 of 48 construction, calculation, randomization for null models and preparation of visualization outputs, trees and maps.

Adding Phylogenies to QGIS and Lifemapper

Figure 2: Lifemapper Components. Figure 2: Lifemapper Components Data parallelization across multi-core architec- PAM operations to remote compute environments tures in each of the environments hosting Lm- with the use of a thick client inside a feature-rich ComputeAs ahelp job to speed based the infrastructure, PAM matrix con- LmRADopen source and GIS environment. LmSDM The algorithms Lifemapper plug- are struction,environmentally which uses agnostic a combination and are of portable Rtree in across for QGIS compute allows QGIS environments to operate as a web through service (https://pypi.python.org/pypi/Rtree/) and mat- client to the LmSDM services and as a WPS client plotlib’sinstances nxutils of and LmCompute GDAL (GDAL 2013) that for are vector deployableto the LmRAD in several analyses, types edit and of submit distributed data, pa- andcompute raster based environments. intersections, respectively. LmCompute Calcula- is arameterize pluggable, inputs configurable, and request computations open withsource the tions on the matrices use NumPy (Jones 2001) built added feature of being able to pull down statistical withclient the that Basic abstracts Linear Algebra the Subprograms details of (BLAS).the computeresults, job geospatial away from outputs the and physical work with system. phyloge- PermutationsLmWebServer on the PAM contains matrices usea Job methods Server that tiernetic tree that data. feeds It is able jobs to todo this any by interactingcompute are specific to binary matrices, where row and col- with a multi-platform Python client library that ab- umnenvironment totals can both that be can kept intactsponsor while an changing instancestracts of LmCompute. the communication LmCompute layer away from theis user.also thegeneralizable, mix of species in since sites and LmCompute the range size ofonly each interactsThe Lm clientwith librarythe physical can also be system decoupled through from the species.a mediator Data parallelizationdesigned along is not the suited mediator to these andclient facade so that design developers patterns can use it(Gamma to program et a va-al. computations since the entire matrix needs to be riety of standards based clients. The added benefit taken1994) into the account. compute But since plug-in several expects hundred per-just aof few using stock the library functions. within the A QGIS “request plug-in is job” that mutationsmethod maycall be might required just per as experiment, easily get the cur-a localall XML LmRAD job functionality definition for or dealing pull with a job PAM from op- rent job based parallelization across compute nodes erations are wrapped in easy to use, point and click worksthe Lifemapper well for permuting Job Server. several hundred An instance matri- ofoperations LmCompute with results can automatically use a job downloadedresponse to ces.instantiate Another method a Job in Runner LmRAD for object permuting and the retrievea managed inputs workspace to the and methods presented in requested. QGIS. matrixEach isof perfectly these computational suited for both types tasks of paral- or group of related tasks is a compute plug-in lelization. It uses a dye dispersion algorithm which Viewing the phylogenetic trees required that a isbased a 2-Dimensional on the template geometric-constraints method modeland thatstrategyhighly design interactive pattern and lightweight(Gamma interface et al. 1994). be built assumesThe compute range continuity plug-in (Jetz is andwrapped Rahbek 2001).in a “runner”in the plug-in class without that librarydepending dependencies. on its Rather run Since range allocations are reassembled individually than deal with heavy Qt (http://qt-project.org/) so- formethod each species, can those execute data can an be split external across cores applicationlutions for or graphics run custom we decided algorithms to leverage recent like onLmRAD a single machine algorithms. or across nodes.A compute plug-in receivesadvances inits web jobs based through standards a job for visualizationcontroller of phylogenetic trees in the QGIS plug-in using a that acts as a hub for producing job outputs.document Using driventhe factory JavaScript method framework. pattern Tree dataand 3.2command QGIS Plug-in, pattern WPS (Gamma Client, et JavaScript, al. 1994), thefrom controller the phylogenetic sits communityin front of can a take compute any one Plots, Visualization of several forms, phyloXML (www..org), environment, requests data inputs for a job,Newick and determines (http://bit.ly/1n6ELcZ), through NexusPython (Maddison “duck The computational constraints of operating against et al. 1997), and NeXML (http://nexml.org). All of large matrices in current desktop software informed these formats are easily translated into JSON which the design of LmRAD as a client-server architecture maps into Python dictionaries and works well with using web-services to off-load the heavy lifting for web standards based solutions for visualizations

OSGEO Journal Volume 14 Page 24 of 48

9 Adding Phylogenies to QGIS and Lifemapper based on JavaScript. Additionally tree providers, like 4. Comparison of Approaches Open Tree of Life (http://blog.opentreeoflife.org/) are developing NexSON, a badgerfish convention Several phylogenetic analysis software implementa- JSON translation of Newick as a data document tions exist, the number is too daunting to recount for transport from web-services that provide trees them all here and most are implemented in R scripts served from graph databases. Data like these are and free but not necessarily open C++ software. Very perfect for producing a scene graph, can be made few integrated systems exist that address biogeogra- available from web-services, are easily transported phy, species communities, ecological niche and phy- back and forth from LmCompute for analysis and logeny. With the growth in phylogenetic data, web- can be used directly in a document driven visualiza- based solutions for viewing trees are popular, but tion framework. those concentrate on data already analyzed for spe- cific taxa and tend to illustrate simple clade-area re- The tree viewer presents the phylogenetic data lationships. Challenges for both analyzing and ex- as interactive SVG built dynamically from incremen- ploring large phylogenies exist both on the compu- tally loaded JSON data. This is made possible with tation side and the visualization side. We mention the JavaScript library D3.js (Data Driven Documents) some very powerful approaches that contain a spa- (http://d3js.org/). D3 allows the JSON document tial component in relation to phylogenetic analysis to be dynamically bound to the Document Object and compare them to our tool which aims at bring- Model so that data-driven transformations can be ap- ing phylogenetic data into a GIS based tool that is plied to the document with smooth transitions and sustainable and extensible using an analysis, that un- fluid interaction. The data are directly mapped to vi- til now has not been systematized, using PAMs and sual elements in the DOM without an internal or in- their inherent range-diversity relationships termediate representation or abstraction of the DOM. GeoSSE (Geographic State Speciation and Extinc- The document is the scene graph. This allows for tion, Goldberg et al. 2011) is a geographic range/- much better performance since the focus is on trans- phylogeny model. GeoSSE is an an extension of formation of the document (Bostock et al. 2011). Se- the BiSSE (binary state speciation-extinction) model lections against the DOM are declarative in a func- that allows tests for relationships between specia- tional programming style with predicates from the tion or extinction and geographic range. GeoSSE W3C Selectors API similar to jQuery allowing CSS is a method for analyzing the reciprocal influence properties to be specified as functions. Incoming of character traits and speciation/extinction, where data can create new nodes in the DOM, and outgoing character states are defined by spatial distributions. data can remove nodes using Enter and Exit selec- Transitions between states are parametrized in terms tions. This is especially useful when navigating large of range expansion through dispersal and range con- trees, since the large number of nodes and edges for traction through local extirpation. The model has large phylogenies have in the past been hurdles for the liability of requiring fairly large phylogenies with visualizing tree data in a way that is responsive to one or two hundred species at the leaf nodes as a user interaction conditioned to fast response times. minimum. The increasing availability of larger trees shouldn’t make this much of a problem in the future, The D3 based interactive tree is rendered in the but may potentially also require computational so- plug-in through a Qt dialog using QtWebKit. Com- lutions addressed by a distributed or parallel imple- munication between the tree and the rest of the plug- mentation. in is effected by QtWebKit Bridge. The bridge al- Picante (Kembel et al. 2010) is a comprehensive lows the JavaScript and PyQt objects to communicate R package for calculating phylogenetic diversity of with one another. The tree viewer is linked to the in- ecological communities. It contains functions for teractive range-diversity plots in matplotlib (Hunter both local or alpha phylogenetic diversity and beta 2007) by simple PyQt signals and slots. A similar phylogenetic diversity. Local community diversity method connects the range-diversity plots for site- indexes include Faith’s phylogenetic diversity (PD) based statistics to the maps in QGIS based on the (Faith 1992), taxonomic distinctness indexes, mean PAM. Using JavaScript in PyQt dialogs for QGIS al- pairwise phylogenetic distance (MPD) and mean lowed us to achieve fluid visual representations of nearest taxon distance (MNTD) within communities. trees for large clades, e.g. one tree used in testing is Clustering and evenness are represented by several the entire phylogeny for the Phylum Mollusca with measures calculated in Picante. Beta phylogenetic di- over 85,000 nodes. versity is also addressed with MPD and MNTD be-

OSGEO Journal Volume 14 Page 25 of 48 Adding Phylogenies to QGIS and Lifemapper tween communities, Sorenson index and the UniFrac 5. Future Directions and Conclu- phylogenetic distance metric. Picante also has robust null model capability, performing numerous permu- sion tation procedures. Ecological correlation is also in- cluded with species-environmental regressions. Pi- 5.1 Incorporation of R for ad-hoc phylo- cante would be an extremely powerful addition to genetic diversity-area measures against a a workflow involving large matrices using parallel PAM archive methods in R or a framework like Lifemapper. Pi- cante’s methods are staples and starting points for The Lifemapper Project is exploring mapping numerous different analyses that could be performed it’s algorithms into a MapReduce paradigm in QGIS, benefiting from an explicit spatial compo- using an Apache Hadoop-based Architecture nent especially in regards to its ecological links to (HBA) and software-defined systems (SDS) and phylogenetic statistics. Multiple-Domain Distribution/Replication (MDD) of Lifemapper itself as part of a push for invest- ment in sustainable biodiversity cyberinfrastructre. Allowing Lifemapper to live at other institutions through MDD will allow platform owners to define Landis, Matzke, Moore, Huelsenbeck (2013), rec- the types of analyses supported by Lifemapper meet- ognize that the main constraints on using models to ing an ever growing need for more flexible and ad- describe the geographic evolution of species ranges hoc algorithm deployment. Researchers in the areas as processes of dispersal and extinction is the com- of bioinformatics that Lifemapper supports live in a putational limit on the number of areas that can world dominated by R scripting. Parallelizing R for specified. Where Lifemapper choose to leverage Hadoop, using one of several well established meth- distributed computational resources to solve similar ods for this, like R+Hadoop or RHIPE may allow scale problems for large numbers of sites the Landis us to calculate larger jobs in a finer grained manner, et al. method uses a Bayesian approach for infer- allowing code reuse, and uncoupling analyses from ring biogeographic history that allows more realis- siloed stacks in Python on LmCompute. tic problems involving large numbers of geographic sites implemented in BayArea, a free C++ command- A useful application of this would be line program that uses PAMs and phylogenetic data the calculation of phylogenetic-diversity, over- in the Newick format as inputs. Its outputs can dispersion/evenness and clustering for user defined be visualized as tree/map animations in an external subsets of a PAM archive or Global PAM (GPAM). JavaScript web service for filtering phylogenetic re- With the GPAM, PAM construction could be pipe- constructions and mapping them. lined and a continuously updated PAM archive for all the world’s terrestrial species from GBIF could be sub-setted, both taxonomically and spatially, by a user for on-demand data needs. Phylogenetic trees would have to be resolved from tree provider ser- Biodiverse (Laffan, Lubarsky, Rosauer 2010), an vices, now coming on-line, for the species in the open-source project similar to the Lifemapper plug- PAM, and Lifemapper services could enable those in, provides linked visualization across different data data through a phylo-to-matrix module, that would spaces. Biodiverse links species distributions in geo- abstract the phylogenetic topology into a series of graphic, phylogenetic, taxonomic and matrix space. matrices and provide permutations of the phyloge- One advantage of Biodiverse similar to Lifemapper netic data for hypotheses testing. These products is that scale comparison are achieved through a win- would have several over-linking uses across differ- dow analysis for endemism, phylogenetic diversity, ent types of analyses. Such a PAM archive and its and beta diversity. By varying the size of the win- computational architecture for distributing matrix dows one can start to understand the effects of scale math across compute resources could also support on those statistics. Currently the Lifemapper plug- the quantitative evaluation of the joint effects of his- in uses a multi-grid approach where several subsets toric biogeographic events to test whether different at different cell resolution can be built out within the species are more or less constrained by past biogeo- same experiment allowing comparisons across scale graphic events. A meta-community analysis like this for the range and diversity statistics including beta is outlined by Leibold et al. 2010, where the degree diversity. of contingent historical constraint is compared to

OSGEO Journal Volume 14 Page 26 of 48 Adding Phylogenies to QGIS and Lifemapper environmental suitability across a phylogeny using our planet’s health. Lifemapper is a computational correlation matrices derived from several types of platform that answers some of these challenges, it data. The authors of this method point to the need has implemented a suite of range-diversity statistics for addressing issues of range shifts and phyloge- never before formalized in relation to phylogenetic netic adaptation in meta-communities across several data, with a unique interface which scales to large clades requiring extensive phylogenetic information phylogenetic trees, embedded within a rich spatial (Leibold et al. 2010). Adding more robust phylo- GIS environment. genetic based analyses to models in Lifemapper in Acknowledgements: Authors were supported by combination with the niche models in its archive NSF/BIO/AVAToL Award #1208472. We are grateful would be a valuable resource for such an analysis. to our colleagues and collaborators, Jorge Soberon, Andres Lira-Noreiga and Rafe Brown. 5.2 Conclusion We have summarized an on-going effort to incorpo- References rate phylogenetic data into a flexible computational platform for multi-species range and diversity mod- Arita, H. T., Christen, J. A., Rodriguez, P., & Soberón, J. (2008). ’Species diversity and distribution in presence-absence ma- eling in order to bring a more complete history of the trices: mathematical relationships and biological implica- diversity patterns of species’ communities into fo- tions.’ The American Naturalist, 172(4), 519-532. cus. Concentrating on range-diversity relationships Bostock, M., Ogievetsky, V., & Heer, J. (2011). ’Dˇedata-driven and a species ’diversity field’ derived from calcu- documents.’ Visualization and Computer Graphics, IEEE lations on large matrices presented to a thick GIS Transactions on, 17(12), 2301-2309. client in QGIS as web-services allowed us to build Cavner, J.A., Beach, J.H. Stewart, A.M. Grady, C.J (2014) a set of robust tools that leveraged open-software, Lifemapper Macroecology Range and Diversity Tools v. 2.0.1 [QGIS plugin, Computer Software], Lawrence, KS: and exposed those analyses to a larger audience, en- University of Kansas Biodiversity Institute. Available from abling transformative new science. The addition of http://plugins.qgis.org/plugins/lifemapperTools/ phylogenetic data to the range-diversity plots and Cavner, J. A., Stewart, A. M., Grady, C. J., & Beach, J. H. (2012). maps allows a user to explore community assembly ’An innovative Web Processing Services based GIS archi- of species habitats and answer questions about dis- tecture for global biogeographic analyses of species distri- persal, competition and adaptation to the environ- butions’. OSGeo Journal, 10(1), 11. ment. Cianciaruso, M. V. (2011). ’update: Beyond taxonomical space: large-scale ecology meets functional and phylogenetic di- With the explosion of data across all areas of ecol- versity.’ Frontiers of Biogeography, 3(3). ogy and especially in the phylogenetic community, Emerson, B. C., & Gillespie, R. G. (2008). ’Phylogenetic analy- the need for scalable software solutions for dealing sis of community assembly and structure over space and with computationally intensive calculations on large time.’ Trends in Ecology & Evolution, 23(11), 619-630. datasets is increasingly clear. Common to most of the Faith, D. P. (1992). ’Conservation evaluation and phylogenetic methods discussed for analyzing phylogenies is the diversity’. Biological Conservation, 61(1), 1-10. wish to combine them with environmental data and Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). De- species range data. Macroecology and biogeography sign patterns: elements of reusable object-oriented soft- are becoming more cross-disciplinary and are incor- ware. Pearson Education. porating more methods from community phyloge- GDAL (2013) Geospatial Data Abstraction Library: ver- sion 1.10.1 Open Source Geospatial Foundation, netics. As this happens phylogenetic datasets will http://gdal.osgeo.org need to reach across more of the tree of life. Spatially Goldberg, E. E., Lancaster, L. T., & Ree, R. H. (2011). ’Phylo- they will become biogeographical in scale requir- genetic inference of reciprocal effects between geographic ing that researchers have access to computational re- range evolution and diversification.’ Systematic Biology, sources not easily accessible to non-computer spe- 60(4), 451-465. cialists. A set of phylogenetic community ecology Jetz, W. and Rahbek, C. (2001) ’Geometric constraints explain algorithms that leverage those resources through a much of the species richness pattern in African birds.’ Proc. suite of web services with a thick client should be de- Nat. Acad. Sci. USA 98:5661–5666 signed for maximum flexibility allowing code reuse, Jones, E. (2001) SciPy: Open Source Scientific Tools for Python Url: http://www.scipy.org/ and definable by the end user freeing the researcher to concentrate on formulating and testing hypothe- Kembel, S. W., Cowan, P. D., Helmus, M. R., Cornwell, W. K., Morlon, H., Ackerly, D. D., ... & Webb, C. O. (2010). Picante: ses in order to be able to describe the earth’s diver- R tools for integrating phylogenies and ecology. Bioinfor- sity and answer important questions about the fate of matics, 26(11), 1463-1464.

OSGEO Journal Volume 14 Page 27 of 48 Laffan, S. W., Lubarsky, E., & Rosauer, D. F. (2010). ’Biodiverse, a NEXUS: an extensible file format for systematic informa- tool for the spatial analysis of biological and related diver- tion. Systematic Biology, 46(4), 590-621. sity. ’ Ecography, 33(4), 643–647. OGC 2007b OpenGIS Web Processing Service, Version 1.0.0. Landis, M. J., Matzke, N. J., Moore, B. R., & Huelsenbeck, J. P. Wayland, MA, OGC Document No 05-007r7 (2013). ’Bayesian analysis of biogeography when the num- Scheiner, S. M. (2012). ’ A metric of biodiversity that integrates ber of areas is large.’ Systematic biology, 62(6), 789-804. abundance, phylogeny, and function.’ Oikos, 121(8), 1191- Leibold, M. A., Economo, E. P., & Peres-Neto, P. (2010). Meta- 1202. community phylogenetics: separating the roles of environ- Stewart, A.M., Beach, J.H., Grady, C.J., Cavner, J.A. (2014) mental filters and historical biogeography. Ecology Letters, Lifemapper [Computational platform services for species 13(10), 1290-1299. distribution modeling and continental-scale biodiversity Maddison, D. R., Swofford, D. L., & Maddison, W. P. (1997). pattern analyses] Web: www.lifemapper.org

OSGEO Journal Volume 14 Page 28 of 48