<<

Full of beans: a study on the alignment of two flowering classification systems

Yi-Yun Cheng and Bertram Ludäscher

School of Information Sciences, University of Illinois at Urbana-Champaign, USA {yiyunyc2,ludaesch}@illinois.edu

Abstract. Advancements in technologies such as DNA analysis have given rise to new ways in organizing organisms in classification systems. In this paper, we examine the feasibility of aligning two classification systems for flowering plants using a logic-based, Region Connection Calculus (RCC-5) ap- proach. The older “Cronquist system” (1981) classifies plants using their mor- phological features, while the more recent Angiosperm Phylogeny Group IV (APG IV) (2016) system classifies based on many new methods including ge- nome-level analysis. In our approach, we align pairwise concepts X and Y from two taxonomies using five basic set relations: congruence (X=Y), inclusion (X>Y), inverse inclusion (X

Keywords: alignment, KOS alignment, interoperability

1 Introduction

With the advent of large-scale technologies and datasets, it has become increasingly difficult to organize information using a stable unitary classification scheme over time. An ideal work classification system, as noted in [1], should be neither too fine-grained, nor too esoteric, to stand the test of time. However, a real-life knowledge organization system (KOS) oftentimes has trade-offs in its stability and granularity, especially when new developments have been made technologically. For example, the use of DNA anal- ysis has provided new data signals that in turn have changed the way biologists classify organisms—traditionally, they may classify organisms based on similarities of surface- level features, while classifying based on similarities in micro-level DNA analysis has become prevalent now. Therefore, the interoperability among different KOSs over time 2 addressing the same topic has become more and more important in this current era with rapid developments and innovations. In the biodiversity communities, taxonomies, a of KOS under the classification schemes [2], has always been one of the main focuses of research; especially in the field of . As such, similar arguments about maintaining a ‘unitary classifi- cation’ over time were made in [3]. The authors [3] stated that it is common for - omists to contradict each other’s or even their own previous taxonomies. To this end, rather than having a permanent anchor for a specific KOS (taxonomy or classification scheme), a better approach, perhaps, is to embrace the fact that KOS are dynamic, time- specific, and responsive to both empirical signals and human classification interests. Thus, a more principled solution in dealing with the interoperability issues among KOS is perhaps to compare and align the classifications and at the same time presenting the disagreements among them [3][4]. In this paper, we propose the use of a logic-based approach to compare and recon- cile two major classification systems in the field of systematics, with the aim of demonstrating the feasibility of integrating two classifications that result in numerous possible solutions. Further, we demonstrate the computational power that can aid us in aligning KOS which could not have been possible when working with alignments man- ually. Specifically, we consider a use case of the flowering plants classification systems by (1) (1981) [5], and (2) Angiosperm Phylogeny Group IV [6] in to map plant families (concepts) mentioned in both classifications with one of the five Region Connection Calculus relations (RCC-5) using an open source, logic-based tool named Euler/X. The flowering plants we have included in this study are not only some of the most common families that we see in our everyday lives, such as the sun- ( sec. APG IV1), but also those flowering plants by a biologist’s definition such as the beans family (Fabaceae sec. APG IV). We hope that this work will further shed light on the possible alignments of the classifications in the infor- mation science community and bring a novel approach for aligning KOS in the future.

2 Two Classifications

In line with the traditions of Bentham and Hooker, Takhtajan, and Bessy, the Cronquist system [5] is an approach in classifying and identifying flowering plants based on phy- letics (classifying resemblances based on and morphological similarity - sim- ilar characters - of the plants). The Cronquist system divides the whole flowering plant world into two , and , with approximately 300 families included in the former, and 60 families in the latter [5]; this system is said to be the most “fully developed phyletic system” of flowering plant classification systems2 by far [7][8]. However, rapid breakthroughs in DNA studies and technologies have given rise to a more recent camp of approaches in classifying plants based on . Early

1 Taxonomic Concept Labels (TCLs): name sec. source 2 In biodiversity classification systems, the sequence of the concepts is not taken into account. 3 cladistic analysis or phylogenetic systematics, established by Willi Hennig, has put sys- tematics to the task of finding shared, derived character states among any three groups of organisms to find their common ancestors, or [9]. Modern phylogenetics con- sists of the use of “both morphological and molecular data and modern methods of data analyses to study evolutionary relationships among organisms” [7]. The Angiosperm Phylogeny Group system (APG IV) [6], is one classification in this camp that has be- come the de facto standard for the classification of plants of the modern era. The most noticeable differences between the APG IV system and the other plant classification systems are in the higher-level ranks. Instead of using classes or phyla, the APG uses clades (e.g. rosid , asterid clade), or even other higher-level ranks (e.g. monocots, ) [6]. Though the APG IV system has become the most recent major classifi- cation systems, the Cronquist system, established almost 40 ago, still remains highly influential to this date for its completeness and comprehensiveness, and many legacy papers with key plant data still used Cronquist’s classifications.

3 Reasoning about taxonomies and Euler/X

Taxonomies are one type of KOS with hierarchical structures, similar to classification schemes [2]. Taxonomy alignment refers to the mapping and reconciliation of two or more taxonomies. In our approach, we employ a logic-based approach called Region Connection Calculus (RCC-5) to align concepts across taxonomies using five possible relationships: congruence (X=Y), inclusion (X>Y), inverse inclusion (X

= T1.a T2.x T1.b T1.b T2.y T2.z T1.a T1.a Nodes Nodes T2.x T2.x T1.c congruent 3 T1.c congruent 3 T1.b T1.c T2.y T2.z T2.z T2.y Edges Edges Input Taxonomies PW1 is_a (input)PW22 is_a (input) 2 Nodes Nodes

T2.z T1.c congruent T2.z 1 T1.b congruent 1 T1.a T1.a T2 2 T2 T2.y 2 T2.x T2.x T1 2 T1 2 T1.b T2.y T1.c T2.y Edges EdgesT1.c overlaps (inferred) 1 T1.aoverlaps (inferred) 1 PW3 PW5 T2.x Nodes is_a Nodes(inferred) 2 is_a (inferred)Nodes 2 T2.z congruent 1 is_a (input) 2 is_a (input) 2 T2.y T1.c congruent T2.y 1 T1.b congruent 1 T2 2 T1.a T1.a T2 2 T2 T1.b 2 T1 2 T2.x T2.x T1 2 T1 2 T1.b T2.z T1.c T2.z Edges Edges Edges is_a (input) 4 PW4 is_a (inferred)PW62 overlapsPW7 (inferred) 1 overlaps (inferred) 4 is_a (input) 2 is_a (inferred) 2 overlaps (inferred) 1 is_a (input) 2 Fig. 1. A simple taxonomy alignment problem with T1, T2 and 7 possible worlds 4

Given two taxonomies T1 and T2, and a set of articulations (relationships between concept X in T1 and concept Y in T2, defined in RCC-5 relations), the Taxonomy Align- ment Problem (TAP) is to derive a merged taxonomy T3. A TAP may have zero, one, or many solutions – inconsistent, unique, or ambiguous solutions respectively. In pre- vious work [4], we have addressed the challenges of vocabulary confusion and interop- erability between similar but different taxonomies by reconciling the taxonomic - greement via unique Euler/X solutions, i.e., where there is only one “possible world” that includes all logically inferred inter-taxonomy relationships. In this paper, we fur- ther demonstrate features of Euler/X, i.e., its capability to naturally represent ambiguity via multiple possible worlds. A possible world is a consistent solution where there are no contradictions on the ways concepts were aligned, and that each concept in the two taxonomies is “sorted out” with exactly one of the five RCC relations. To show a simple TAP example and the multiple possible worlds that exist, consider two taxonomies T1 and T2, each with two children (Figure 1). Assume that we only know about these that the highest-level nodes are congruent to each other (T1.a == T 2.x); how T1.b, T1.c relate to T2.y, T2.z is unknown. With these underspecified articulations in our TAP, ambigui- ties arise. Therefore, we end up with seven possible worlds where PW1 and PW2 depict how the concepts can be aligned congruently, PW3 to PW6 depict similar but subtle differences on how T1.b, T1.c is included in or includes T2.y, T2.z, while PW7 depicts T1.b, T1.c, T2.y, T2.z all overlapping each other (red dashed lines).

4 Method

It is well-known that the basic taxonomic ranks in biological classification include king- dom, , , order, family, , and . In this research, we are mainly focusing on the alignment of the family-level flowering plants. Among the 295,383 flowering plants species [10] within the Magnoliophyta phylum sec. Cronquist, or the Angiosperms sec. APG IV, we are considering only the 40 most common families or out of a total of 416 families [6]. These families include Magnoli- aceae, , , Cataceae, , Fabaceae, , just to name a few. Both the Cronquist systems and the APG system have these 40 families or some modifications to the names of these families in their classifications. Each family serves as a concept in our alignment. In our first alignment study, if the family in both systems shares the exact same name, we assume (possible incorrectly) that they are congruent to each other. If there are similar but different names, we will leave the concepts un- mapped at first. To be more specific, if concept X in the Cronquist system is exactly the same as concept Y in the APG system, we will mark them as [C.X {=} APG.Y]. See Figure 2 for our initial input taxonomies. The following six articulations are the ones that are uncertain to us because they have different names:

[C.Caesalpiniaceae ? APG.] [C.Mimosaceae ? APG.] [C.Fabaceae ? APG.] [C. ? APG.Sapindaceae] [C.Sapindaceae ? APG.Sapindaceae] [C.Hippocastanaceae ? APG.Sapindaceae] 5

APG.Angiosperms APG.Eudicots APG. APG.RosidClade APG.AsteridClade APG.CoreEudicots APG.BasalEudicots APG.CaryophyllidClade APG. APG. APG. APG. APG. APG. APG. APG.Monocots APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG. APG.Fabaceae APG. APG.Cactaceae APG. APG. APG. APG.Betulaceae APG.Asteraceae APG. APG. APG. APG. APG. APG. APG. APG. APG. APG.Papaveraceae APG. APG.Ranunculaceae APG.Asclepiadaceae APG. APG. APG. ======APG. APG.Rosaceae = = = C.Oleaceae C.Violaceae C.Ericaceae C.Moraceae C.Fagaceae C.Cactaceae C.Malvaceae C.Lamiaceae C.Salicaceae C.Betulaceae C.Asteraceae APG. C.Solanaceae APG. APG. APG. C.Brassicaceae APG.Faboideae APG. C.Magnoliaceae C.Polygonaceae C.Caprifoliaceae C.Portulacaceae C.Cucurbitaceae C.Papaveraceae C.Euphorbiaceae APG. C.Ranunculaceae APG. C.Asclepiadaceae APG.Sapindaceae APG.Mimosoideae C.Caryophyllaceae C.Hamamelidaceae C.Scrophulariaceae APG.Caesalpinioideae ? ? ? ? ? ? ======C.Apiaceae C.Rosaceae C. C.Ericales C.Fagales C.Araceae C. C.Malvales C.Liliaceae C.Poaceae C.Lamiales C.Salicales C.Iridaceae C.Asterales C.Solanales C.Fabaceae C.Arecaceae C.Aceraceae C.Dipsacales C. C.Magnoliales C.Gentianales C.Polygonales C.Onagraceae C.Papaverales C.Mimosaceae C.Orchidaceae C.Sapindaceae C.Ranunculales C.Caryophyllales C. C.Scrophulariales C.Caesalpiniaceae C.Hippocastanaceae C.Apiales C.Rosales C.Euphorbiales C. C.Liliales C.Fabales C.Myrtales C.Arecales C. C.Cyperales C. C.Sapindales C.Magnoliidae C.Hamamelidae C.Caryophyllidae C. C. C. C.Commelinidae C.Liliopsida C.Magnoliopsida

C.Magnoliophyta 81 73 80 72 41 C Nodes APG Edges

Fig. 2. Initial input taxonomies with six unknown relations, ais_a(C) ligning the two classification sys- is_a(APG) articulations tems (green=Cronquist; yellow=APG), with the Fabaceae and the Sapinadaceae relationship unknown on both sides (red dotted lines) 6

Due to the above six uncertain relationships, we expect that the articulations be- tween the concepts of the two classifications in our initial input taxonomies are under- specified and will result in numerous possible worlds. Furthermore, the bold assump- tion to mark names that are spelled the same in both taxonomies as congruent is often not enough, as noted in [12]. This is the reason we conducted a second round of align- ment to seek out consultation from a domain expert to verify our ‘congruent’ align- ments as well as to sort out the underspecified articulations. The possible worlds we have produced in the first stage thus became the minimum viable product for us to com- municate with the expert and let him/her grasp all possible solutions for the alignment problem. The domain expert in our research asserted that the congruent alignments are indeed correct and that the modified articulations for the Fabaceae and the Sapindaceae families are as follows:

[C.Caesalpiniaceae {=} APG.Caesalpinioideae] [C.Mimosaceae {=} APG.Mimosoideae] [C.Fabaceae {=} APG.Faboideae] [C.Aceraceae {<} APG.Sapindaceae] [C.Sapindaceae {<} APG.Sapindaceae] [C.Hippocastanaceae {<} APG.Sapindaceae]

5 Results

The first round of alignment between the two classification systems resulted in 555 possible worlds—meaning that we have 555 different ways for reconciling these two classifications (see Figure 3 for a few examples). The alignment resulted in so many solutions because the articulations for the Fabaceae family (the beans family) and the Sapindaceae family (the maple family) were left ambiguous and underspecified. In the Cronquist system, the three families placed under the order Fabales are:

Caesalpiniaceae sec. Cronquist Mimosaceae sec. Cronquist Fabaceae sec. Cronquist While in the APG IV system, the order Fabales only consists of one family Fabaceae, and under the Fabaceae there are three sub-families:

Caesalpinioideae sec. APG IV Mimosoideae sec. APG IV Faboideae sec. APG IV First, the family/ names between the two systems are not exactly the same, the spelling is different in the suffix ( -ceae vs. -deae); furthermore, the ‘umbrella’ fam- ily Fabaceae in the APG system is spelled the same as one of the three bean families in the Cronquist system (also Fabaceae). Similarly, we raise doubts on whether the Sap- indaceae in the Cronquist system is entirely equivalent to the Sapindaceae sec. APG IV; though they share the exact same name, the Aceraceae sec. Cronquist, and Hippo- castanaceae sec. Cronquist, also in the order of Sapindales, were totally void in the APG system. Therefore, in our initial attempt we could not firmly state what the articulations among these families were, echoing the claim that “names are not enough” in [11]. 7

APG.Liliaceae APG.Liliaceae APG.Poaceae APG.Liliales APG.Poaceae APG.Liliales APG.Poales C.Liliaceae APG.Poales C.Liliaceae PW2 C.Commelinidae C.Commelinidae C.Cyperales C.Liliales PW1 C.Cyperales C.Liliales C.Poaceae C.Iridaceae C.Poaceae C.Iridaceae APG.Iridaceae APG.Iridaceae C.Liliidae APG.Asparagales C.Liliidae APG.Asparagales APG.Orchidaceae APG.Orchidaceae C.Liliopsida C.Orchidaceae C.Liliopsida C.Orchidaceae C.Orchidales C.Orchidales APG.Monocots APG.Alismatales APG.Solanaceae APG.Monocots APG.Alismatales APG.Solanaceae APG.Araceae APG.Solanales C.Arecidae C.Arecidae APG.Araceae APG.Solanales C.Araceae C.Solanaceae C.Araceae C.Solanaceae C.Arales C.Solanales APG.Lamiaceae C.Arales C.Solanales APG.Lamiaceae APG.Papaveraceae C.Lamiaceae Nodes APG.Papaveraceae C.Lamiaceae Nodes C.Papaveraceae C.Lamiales C.Papaveraceae C.Lamiales APG.Arecaceae APG.Arecaceae C.Papaverales APG.Lamiales C.Oleaceae congruent 37 C.Papaverales APG.Lamiales C.Oleaceae congruent 37 APG.Arecales APG.Oleaceae APG.Arecales APG.Oleaceae C.Arecaceae C 20 C.Arecaceae C 20 C.Arecales APG.Ranunculaceae C.Scrophulariales C.Arecales APG.Ranunculaceae C.Scrophulariales APG.Asteraceae APG 14 APG.Asteraceae APG 14 C.Ranunculaceae C.Scrophulariaceae C.Ranunculaceae C.Scrophulariaceae C.Ranunculales APG.Asterales APG.Scrophulariaceae C.Ranunculales APG.Asterales APG.Scrophulariaceae APG.Magnoliaceae C.Asteraceae Edges APG.Magnoliaceae C.Asteraceae Edges APG.Magnoliales C.Asterales APG.Magnoliales C.Asterales APG.Magnoliids is_a (inferred) 18 APG.Magnoliids is_a (inferred) 18 C.Asteridae C.Asteridae C.Magnoliaceae overlaps (inferred) 23 C.Magnoliaceae C.Magnoliophyta C.Magnoliales APG.Caprifoliaceae C.Magnoliales APG.Caprifoliaceae overlaps (inferred) 24 C.Magnoliopsida C.Magnoliidae C.Magnoliophyta C.Magnoliopsida C.Magnoliidae APG.Angiosperms APG.Dipsacales is_a (input) 68 APG.Angiosperms APG.Dipsacales is_a (input) 68 APG.Ericaceae C.Caprifoliaceae APG.Ericaceae C.Caprifoliaceae APG.BasalEudicots APG.Ericales C.Dipsacales APG.BasalEudicots APG.Ericales C.Dipsacales APG.Ranunculales C.Ericaceae APG.Ranunculales C.Ericaceae C.Ericales C.Ericales APG.Asclepiadaceae APG.Asclepiadaceae APG.Gentianales APG.Gentianales APG.Apiaceae C.Asclepiadaceae C.Caryophyllaceae APG.Apiaceae C.Asclepiadaceae C.Caryophyllaceae APG.AsteridClade C.Gentianales APG.Caryophyllaceae APG.AsteridClade APG.Apiales APG.Apiales C.Gentianales APG.Caryophyllaceae C.Apiaceae C.Apiaceae C.Apiales C.Apiales C.Caryophyllales C.Portulacaceae C.Caryophyllales C.Portulacaceae APG.Portulacaceae APG.Portulacaceae APG.Caryophyllales APG.Caryophyllales APG.CaryophyllidClade APG.CaryophyllidClade APG.Polygonaceae C.Cactaceae APG.Polygonaceae C.Cactaceae C.Polygonaceae APG.Cactaceae C.Polygonaceae APG.Cactaceae C.Polygonales C.Polygonales C.Violales C.Violales APG.Eudicots C.Caryophyllidae APG.Eudicots C.Caryophyllidae APG.Cucurbitaceae APG.Cucurbitaceae APG.Cucurbitales APG.Cucurbitales C.Cucurbitaceae C.Cucurbitaceae APG.Malvaceae APG.Malvaceae APG.Malvales C.Violaceae APG.Malvales C.Violaceae C.Malvaceae APG.Violaceae C.Malvaceae APG.Violaceae C.Malvales C.Malvales APG.Salicaceae APG.Salicaceae APG.Brassicaceae APG.Brassicaceae C.Salicaceae C.Salicaceae APG.Brassicales C.Salicales APG.Brassicales C.Salicales C.Brassicaceae C.Brassicaceae C.Capparales C.Caesalpiniaceae C.Capparales C.Mimosaceae

APG.Malpighiales C.Fabaceae APG.Mimosoideae APG.Malpighiales C.Fabaceae C.Caesalpiniaceae APG.RosidClade APG.RosidClade C.Fabales APG.Fabaceae C.Fabales APG.Fabaceae APG.Fabales APG.Faboideae APG.Fabales APG.Mimosoideae APG.Myrtales APG.Myrtales APG.Onagraceae APG.Caesalpinioideae APG.Onagraceae APG.Faboideae C.Rosidae APG.Euphorbiaceae C.Rosidae APG.Euphorbiaceae C.Myrtales C.Myrtales C.Onagraceae C.Euphorbiaceae C.Onagraceae C.Euphorbiaceae C.Euphorbiales C.Euphorbiales C.Mimosaceae APG.Caesalpinioideae

APG.Rosaceae APG.Rosaceae C.Rosaceae C.Rosaceae C.Rosales C.Hippocastanaceae C.Rosales C.Hippocastanaceae APG.Rosales APG.Rosales C.Sapindales C.Aceraceae C.Sapindales C.Aceraceae APG.Sapindaceae APG.Sapindaceae APG.Sapindales C.Fagaceae APG.Sapindales C.Fagaceae APG.Fagaceae C.Sapindaceae APG.Fagaceae C.Sapindaceae C.Fagales C.Fagales APG.Fagales C.Betulaceae APG.Fagales C.Betulaceae C.Hamamelidae APG.Betulaceae C.Hamamelidae APG.Betulaceae

APG.Moraceae APG.Moraceae C.Moraceae C.Moraceae APG.CoreEudicots C.Urticales APG.CoreEudicots C.Urticales APG.Hamamelidaceae APG.Hamamelidaceae APG.Saxifragales APG.Saxifragales C.Hamamelidaceae C.Hamamelidaceae C.Hamamelidales C.Hamamelidales

APG.Liliaceae APG.Poaceae APG.Liliales APG.Liliaceae APG.Poales C.Liliaceae APG.Poaceae APG.Liliales C.Commelinidae APG.Poales C.Liliaceae C.Cyperales C.Liliales C.Commelinidae C.Poaceae C.Iridaceae C.Cyperales C.Liliales APG.Iridaceae C.Poaceae C.Iridaceae PW3 APG.Iridaceae PW4 APG.Asparagales C.Liliidae APG.Orchidaceae C.Liliidae APG.Asparagales C.Orchidaceae APG.Orchidaceae C.Liliopsida APG.Monocots APG.Alismatales C.Orchidales APG.Solanaceae C.Liliopsida C.Orchidaceae APG.Araceae APG.Solanales APG.Monocots APG.Alismatales C.Orchidales APG.Solanaceae C.Arecidae C.Araceae C.Solanaceae C.Arecidae APG.Araceae APG.Solanales C.Arales C.Solanales APG.Lamiaceae C.Araceae C.Solanaceae APG.Papaveraceae C.Lamiaceae Nodes C.Arales C.Solanales APG.Lamiaceae C.Papaveraceae C.Lamiales APG.Papaveraceae C.Lamiaceae Nodes APG.Arecaceae C.Papaverales APG.Lamiales C.Oleaceae congruent 37 C.Papaveraceae C.Lamiales APG.Arecales APG.Oleaceae APG.Arecaceae C.Papaverales APG.Lamiales C.Oleaceae congruent 37 C.Arecaceae C 20 APG.Arecales APG.Oleaceae C.Arecales APG.Ranunculaceae C.Scrophulariales C.Arecaceae C 20 APG.Asteraceae APG 14 C.Arecales C.Scrophulariales C.Ranunculaceae C.Scrophulariaceae APG.Ranunculaceae C.Ranunculales APG.Asterales APG.Scrophulariaceae C.Ranunculaceae APG.Asteraceae C.Scrophulariaceae APG 14 APG.Magnoliaceae C.Asteraceae Edges C.Ranunculales APG.Asterales APG.Scrophulariaceae Edges APG.Magnoliales C.Asterales APG.Magnoliaceae C.Asteraceae APG.Magnoliids is_a (inferred) 18 APG.Magnoliales C.Asterales C.Magnoliaceae C.Asteridae APG.Magnoliids is_a (inferred) 18 overlaps (inferred) 22 C.Asteridae C.Magnoliophyta C.Magnoliales APG.Caprifoliaceae C.Magnoliaceae C.Magnoliopsida C.Magnoliidae APG.Dipsacales C.Magnoliales APG.Caprifoliaceae overlaps (inferred)APG.Angiosperms23 is_a (input) 68 C.Magnoliophyta C.Magnoliopsida C.Magnoliidae APG.Ericaceae C.Caprifoliaceae APG.Angiosperms APG.Dipsacales is_a (input) 68 C.Caprifoliaceae APG.BasalEudicots APG.Ericales C.Dipsacales APG.Ericaceae APG.Ranunculales APG.BasalEudicots APG.Ericales C.Dipsacales C.Ericaceae C.Ericales APG.Ranunculales C.Ericaceae APG.Asclepiadaceae C.Ericales APG.Asclepiadaceae APG.Gentianales APG.Gentianales APG.Apiaceae C.Asclepiadaceae C.Caryophyllaceae APG.AsteridClade APG.Apiaceae C.Asclepiadaceae C.Caryophyllaceae APG.Apiales C.Gentianales APG.Caryophyllaceae APG.AsteridClade APG.Apiales C.Gentianales APG.Caryophyllaceae C.Apiaceae C.Apiales C.Apiaceae C.Portulacaceae C.Apiales C.Caryophyllales C.Portulacaceae APG.Portulacaceae C.Caryophyllales APG.Portulacaceae APG.Caryophyllales APG.Caryophyllales APG.CaryophyllidClade APG.CaryophyllidClade APG.Polygonaceae C.Cactaceae APG.Polygonaceae C.Cactaceae C.Polygonaceae APG.Cactaceae C.Polygonales C.Polygonaceae APG.Cactaceae C.Violales C.Polygonales C.Violales APG.Eudicots C.Caryophyllidae APG.Cucurbitaceae APG.Eudicots C.Caryophyllidae APG.Cucurbitaceae APG.Cucurbitales APG.Cucurbitales C.Cucurbitaceae C.Cucurbitaceae APG.Malvaceae APG.Malvaceae APG.Malvales C.Violaceae APG.Malvales C.Violaceae C.Malvaceae APG.Violaceae C.Malvaceae APG.Violaceae C.Malvales C.Malvales APG.Salicaceae APG.Salicaceae APG.Brassicaceae APG.Brassicaceae C.Salicaceae C.Salicaceae APG.Brassicales C.Salicales APG.Brassicales C.Salicales C.Brassicaceae C.Brassicaceae C.Capparales APG.Mimosoideae C.Capparales C.Caesalpiniaceae

APG.Malpighiales C.Fabaceae APG.Faboideae APG.Malpighiales C.Fabaceae APG.Faboideae APG.RosidClade APG.RosidClade C.Fabales APG.Fabaceae C.Fabales APG.Fabaceae APG.Fabales APG.Caesalpinioideae APG.Fabales APG.Mimosoideae APG.Myrtales APG.Myrtales APG.Onagraceae C.Mimosaceae APG.Onagraceae APG.Caesalpinioideae C.Rosidae APG.Euphorbiaceae C.Rosidae APG.Euphorbiaceae C.Myrtales C.Myrtales C.Onagraceae C.Euphorbiaceae C.Onagraceae C.Euphorbiaceae C.Euphorbiales C.Euphorbiales C.Caesalpiniaceae C.Mimosaceae

APG.Rosaceae APG.Rosaceae C.Rosaceae C.Rosaceae C.Hippocastanaceae C.Rosales C.Hippocastanaceae C.Rosales

APG.Rosales APG.Rosales C.Sapindales C.Aceraceae C.Sapindales C.Aceraceae

APG.Sapindaceae APG.Sapindaceae APG.Sapindales APG.Sapindales C.Fagaceae C.Fagaceae C.Sapindaceae APG.Fagaceae C.Sapindaceae APG.Fagaceae C.Fagales C.Fagales APG.Fagales C.Betulaceae APG.Fagales C.Betulaceae C.Hamamelidae APG.Betulaceae C.Hamamelidae APG.Betulaceae

APG.Moraceae APG.Moraceae C.Moraceae C.Moraceae APG.CoreEudicots C.Urticales APG.CoreEudicots C.Urticales APG.Hamamelidaceae APG.Hamamelidaceae APG.Saxifragales APG.Saxifragales C.Hamamelidaceae C.Hamamelidaceae C.Hamamelidales C.Hamamelidales

APG.Liliaceae APG.Liliaceae APG.Poaceae APG.Liliales APG.Poaceae APG.Poales APG.Liliales C.Liliaceae APG.Poales C.Liliaceae C.Commelinidae C.Commelinidae C.Cyperales C.Liliales C.Cyperales C.Liliales C.Poaceae C.Iridaceae C.Poaceae C.Iridaceae PW5 APG.Iridaceae PW6 APG.Iridaceae APG.Asparagales C.Liliidae C.Liliidae APG.Asparagales APG.Orchidaceae APG.Orchidaceae C.Liliopsida C.Orchidaceae C.Orchidaceae C.Orchidales C.Liliopsida APG.Monocots APG.Alismatales APG.Solanaceae APG.Monocots APG.Alismatales C.Orchidales APG.Solanaceae C.Arecidae APG.Araceae APG.Solanales APG.Araceae APG.Solanales C.Araceae C.Solanaceae C.Arecidae C.Araceae C.Solanaceae C.Arales C.Solanales APG.Lamiaceae C.Arales C.Solanales APG.Lamiaceae APG.Papaveraceae C.Lamiaceae Nodes APG.Papaveraceae C.Lamiaceae Nodes C.Papaveraceae C.Lamiales C.Papaveraceae C.Lamiales APG.Arecaceae C.Papaverales C.Oleaceae APG.Lamiales congruent 37 APG.Arecaceae C.Papaverales C.Oleaceae congruent 37 APG.Arecales APG.Oleaceae APG.Arecales APG.Lamiales C.Arecaceae C 20 APG.Oleaceae C.Scrophulariales C.Arecaceae C 20 C.Arecales APG.Ranunculaceae C.Arecales C.Scrophulariales APG.Asteraceae APG 14 APG.Ranunculaceae C.Ranunculaceae C.Scrophulariaceae C.Ranunculaceae APG.Asteraceae C.Scrophulariaceae APG 14 C.Ranunculales APG.Asterales APG.Scrophulariaceae Edges C.Ranunculales APG.Asterales APG.Scrophulariaceae APG.Magnoliaceae C.Asteraceae APG.Magnoliaceae C.Asteraceae Edges APG.Magnoliales C.Asterales is_a (inferred) 18 APG.Magnoliales C.Asterales APG.Magnoliids APG.Magnoliids is_a (inferred) 18 C.Asteridae C.Asteridae C.Magnoliaceae overlaps (inferred) 23 C.Magnoliaceae C.Magnoliophyta C.Magnoliales APG.Caprifoliaceae C.Magnoliales APG.Caprifoliaceae overlaps (inferred) 22 C.Magnoliopsida C.Magnoliidae APG.Dipsacales C.Magnoliophyta C.Magnoliopsida C.Magnoliidae APG.Angiosperms is_a APG.Angiosperms(input) 68 APG.Dipsacales is_a (input) 68 APG.Ericaceae C.Caprifoliaceae APG.Ericaceae C.Caprifoliaceae APG.BasalEudicots APG.Ericales C.Dipsacales APG.BasalEudicots APG.Ericales C.Dipsacales APG.Ranunculales C.Ericaceae APG.Ranunculales C.Ericaceae C.Ericales C.Ericales APG.Asclepiadaceae APG.Asclepiadaceae APG.Gentianales APG.Gentianales APG.Apiaceae C.Asclepiadaceae C.Caryophyllaceae C.Asclepiadaceae C.Caryophyllaceae APG.AsteridClade APG.Apiaceae APG.Apiales C.Gentianales APG.Caryophyllaceae APG.AsteridClade APG.Apiales C.Gentianales APG.Caryophyllaceae C.Apiaceae C.Apiaceae C.Apiales C.Apiales C.Caryophyllales C.Portulacaceae C.Portulacaceae APG.Portulacaceae C.Caryophyllales APG.Portulacaceae APG.Caryophyllales APG.Caryophyllales APG.CaryophyllidClade APG.CaryophyllidClade APG.Polygonaceae C.Cactaceae APG.Polygonaceae C.Cactaceae C.Polygonaceae APG.Cactaceae C.Polygonaceae APG.Cactaceae C.Polygonales C.Polygonales C.Violales C.Violales APG.Eudicots C.Caryophyllidae APG.Eudicots C.Caryophyllidae APG.Cucurbitaceae APG.Cucurbitaceae APG.Cucurbitales APG.Cucurbitales C.Cucurbitaceae C.Cucurbitaceae APG.Malvaceae APG.Malvaceae APG.Malvales C.Violaceae APG.Malvales C.Violaceae C.Malvaceae APG.Violaceae C.Malvaceae APG.Violaceae C.Malvales C.Malvales

APG.Salicaceae APG.Salicaceae APG.Brassicaceae C.Salicaceae APG.Brassicaceae APG.Brassicales C.Salicaceae C.Salicales APG.Brassicales C.Salicales C.Brassicaceae C.Brassicaceae C.Capparales C.Mimosaceae C.Capparales C.Caesalpiniaceae

APG.Malpighiales C.Fabaceae C.Caesalpiniaceae APG.Malpighiales C.Fabaceae APG.Mimosoideae APG.RosidClade APG.RosidClade APG.Fabaceae C.Fabales APG.Mimosoideae C.Fabales APG.Fabaceae APG.Fabales APG.Fabales APG.Faboideae

APG.Myrtales APG.Myrtales APG.Faboideae C.Rosidae APG.Onagraceae APG.Onagraceae APG.Caesalpinioideae C.Myrtales APG.Euphorbiaceae C.Rosidae C.Myrtales APG.Euphorbiaceae C.Onagraceae C.Euphorbiaceae C.Onagraceae C.Euphorbiaceae C.Euphorbiales C.Euphorbiales APG.Caesalpinioideae C.Mimosaceae

APG.Rosaceae APG.Rosaceae C.Rosaceae C.Hippocastanaceae C.Rosaceae C.Hippocastanaceae C.Rosales C.Rosales

APG.Rosales APG.Rosales C.Sapindales C.Aceraceae C.Sapindales C.Aceraceae

APG.Sapindaceae APG.Sapindaceae APG.Sapindales C.Fagaceae APG.Sapindales C.Sapindaceae C.Fagaceae C.Sapindaceae APG.Fagaceae APG.Fagaceae C.Fagales C.Fagales APG.Fagales C.Betulaceae APG.Fagales C.Betulaceae C.Hamamelidae APG.Betulaceae C.Hamamelidae APG.Betulaceae

APG.Moraceae APG.Moraceae C.Moraceae C.Moraceae APG.CoreEudicots C.Urticales APG.CoreEudicots C.Urticales APG.Hamamelidaceae APG.Hamamelidaceae APG.Saxifragales APG.Saxifragales C.Hamamelidaceae C.Hamamelidaceae C.Hamamelidales C.Hamamelidales Fig. 3. Six of the 555 possible worlds to model the merged view of the two classification, with the main differences marked in the red boxes. = APG.Poaceae C.Cyperales C.Poaceae APG.Poales C.Commelinidae = C.Arales C.Araceae APG.Araceae APG.Alismatales

= C.Arecidae C.Arecales C.Arecaceae APG.Arecaceae APG.Arecales C.Liliopsida APG.Monocots = APG.Liliaceae APG.Liliales C.Liliaceae

C.Liliidae C.Liliales C.Iridaceae = APG.Iridaceae APG.Asparagales

C.Orchidales = C.Orchidaceae APG.Orchidaceae

= C.Magnoliaceae APG.Magnoliaceae C.Magnoliales APG.Magnoliales = C.Papaverales C.Papaveraceae APG.Papaveraceae C.Magnoliidae C.Ranunculales C.Ranunculaceae = APG.Ranunculaceae APG.Ranunculales

= C.Lamiaceae APG.Lamiaceae APG.BasalEudicots C.Lamiales C.Scrophulariaceae = APG.Scrophulariaceae APG.Magnoliids APG.Lamiales C.Scrophulariales C.Oleaceae = APG.Oleaceae

= C.Gentianales C.Asclepiadaceae APG.Asclepiadaceae APG.Gentianales APG.Angiosperms = C.Asteridae C.Dipsacales C.Caprifoliaceae APG.Caprifoliaceae APG.Dipsacales

C.Solanales C.Solanaceae = APG.Solanaceae APG.Solanales APG.AsteridClade C.Asterales C.Asteraceae = APG.Asteraceae APG.Asterales

APG.Apiales = APG.Apiaceae C.Apiaceae APG.Ericales C.Apiales = APG.Ericaceae C.Ericaceae APG.Eudicots

APG.CaryophyllidClade = 8 C.Ericales C.Cactaceae APG.Cactaceae

C.Caryophyllaceae = APG.Caryophyllaceae APG.Caryophyllales C.Caryophyllales = Nodes After consulting the domain expert on our secondC.Portulacaceae round of alignment, we APG.Portulacaceaewere able

C 81 to refine our alignments betweenC.Polygonales the two taxonomies and make the relationships among C.Polygonaceae = APG.Polygonaceae APG 73 the beans families explicit. Despite the slight differences in the suffix, the three families Edges C.Magnoliophyta C.Magnoliopsida C.Caryophyllidae C.Salicales C.Salicaceae = is_a (C) 80 under Cronquist indeed were congruent to the three sub-families within the APG sys- C.Violales APG.Salicaceae is_a (APG) 72 tem (Figure 4), and that the Sapindaceae sec. CronquistC.Violaceae, Aceraceae sec.= Cronquist, and articulations 41 Hippocastanaceae sec. CronquistC.Capparales are all combined as one as Sapindaceae sec. APG.ViolaceaeAPG IV = APG.Malpighiales C.Malvales C.Cucurbitaceae now. APG.Cucurbitaceae C.Brassicaceae = APG.Cucurbitales C.Euphorbiales APG.Brassicaceae C.Euphorbiaceae = APG.Euphorbiaceae APG.Brassicales C.Mimosaceae = C.Malvaceae =

C.Rosidae C.Fabales C.Fabaceae APG.Malvaceae APG.Malvales = APG.Mimosoideae APG.RosidClade C.Caesalpiniaceae APG.Fabales = APG.Faboideae APG.Fabaceae

C.Hippocastanaceae < APG.Caesalpinioideae APG.Sapindales C.Sapindales < C.Aceraceae APG.Sapindaceae < APG.Myrtales C.Myrtales C.Sapindaceae APG.Onagraceae = APG.CoreEudicots Fig. 4. Refinement of the inputC.Onagraceae alignment to align the Fabaceae and the Sapinadeae family = APG.Betulaceae APG.Fagales C.Betulaceae = APG.Fagaceae C.Fagaceae Given this refinement in theC.Fagales relationship among the beans and the maple families,

instead of having 555 possible worlds, we were able to reduce= theAPG.Rosaceae ambiguities and thus APG.Rosales the number of possibleC.Rosales worlds to only one C.Rosaceaeunique possible world (Figure 5), where we = APG.Moraceae C.Hamamelidae C.Moraceae can see the congruence in families,C.Urticales the inferred relationship among the higher-level APG.Saxifragales = APG.Hamamelidaceae nodes, and the two original classifications.C.Hamamelidales All the C.Hamamelidaceaeresults can be retrieved from our

Github repository (https://github.com/yiyunyc2/NKOS18)= .

6 Conclusion

New developments, whether in technologies or in paradigms, always challenge the views of past major knowledge organization systems. In the biodiversity domain, mo- lecular data when examined under careful sampling, can provide valuable pointers to the classification of plants. However, to quote [12]: “... molecular characters are subject to evolutionary convergence, parallelism, and reversal; therefore, molecular methods are not a panacea. Molecular evidence should be used with, not in place of, morpholog- ical evidence.” Though the APG system has shown more substantial importance in re- cent years due to the advancement in micro-level analysis of the molecular data, the Cronquist system still maintains its esteemed role for its comprehensiveness and pre- ciseness in morphologically classifying the flowering plants. This paper serves as an exploratory research on the comparison and alignment of two KOS, specifically classification schemes. Our approach suggests that classifica- tions can coexist with each other while disambiguating the names among concepts in a merged possible world. Our approach also demonstrates the capability of computation- ally solving complex logic-based alignments for cases where, e.g., due to underspeci- fied relations in the input KOS alignments, manual efforts would likely fail to yield all 555 different ways to merge and reconcile the two KOS. 9

C.Fagaceae APG.Fagaceae

APG.CoreEudicots APG.Hamamelidaceae C.Betulaceae APG.Saxifragales APG.Betulaceae C.Hamamelidaceae C.Hamamelidales C.Fabaceae APG.Faboideae C.Fagales APG.Fagales C.Caesalpiniaceae APG.Caesalpinioideae APG.Fabaceae C.Hamamelidae APG.Fabales C.Mimosaceae C.Fabales APG.Mimosoideae

APG.Moraceae C.Moraceae C.Urticales APG.Rosales APG.Rosaceae C.Rosaceae C.Rosales

APG.Sapindaceae APG.Sapindales C.Hippocastanaceae C.Sapindales

C.Rosidae C.Aceraceae

APG.Myrtales APG.Onagraceae C.Sapindaceae C.Myrtales C.Onagraceae APG.RosidClade APG.Euphorbiaceae APG.Malpighiales C.Euphorbiaceae C.Euphorbiales

APG.Malvaceae APG.Malvales C.Violaceae C.Malvaceae APG.Violaceae C.Malvales

APG.Cucurbitaceae C.Violales APG.Cucurbitales C.Cucurbitaceae

APG.Apiaceae APG.Apiales C.Apiaceae C.Apiales APG.Salicaceae C.Salicaceae C.Salicales APG.Brassicaceae C.Portulacaceae APG.Brassicales APG.Portulacaceae C.Brassicaceae C.Capparales APG.Eudicots C.Caryophyllidae C.Cactaceae C.Caryophyllales APG.Cactaceae

APG.Polygonaceae C.Caryophyllaceae APG.Caryophyllales C.Polygonaceae APG.Caryophyllaceae Nodes APG.AsteridClade APG.CaryophyllidClade C.Polygonales C.Oleaceae C 15 APG.Oleaceae APG.Ericaceae congruent 42 APG.Ericales APG.Lamiales C.Scrophulariales C.Ericaceae C.Scrophulariaceae APG 9 C.Ericales APG.Scrophulariaceae APG.Lamiaceae Edges APG.Asteraceae C.Lamiaceae APG.Asterales C.Lamiales is_a (inferred) 12 C.Asteridae C.Asteraceae C.Asterales overlaps (inferred) 12 is_a (input) 69 APG.BasalEudicots APG.Ranunculaceae C.Ranunculaceae APG.Caprifoliaceae C.Magnoliopsida C.Magnoliidae APG.Ranunculales APG.Dipsacales C.Ranunculales C.Caprifoliaceae C.Dipsacales APG.Magnoliaceae APG.Magnoliales APG.Papaveraceae APG.Magnoliids C.Papaveraceae APG.Asclepiadaceae APG.Poaceae C.Magnoliaceae C.Papaverales APG.Gentianales APG.Poales C.Magnoliales C.Asclepiadaceae C.Commelinidae C.Gentianales C.Cyperales APG.Orchidaceae C.Orchidaceae C.Magnoliophyta C.Liliopsida C.Poaceae APG.Asparagales APG.Angiosperms APG.Monocots C.Orchidales APG.Solanaceae APG.Solanales C.Liliidae C.Solanaceae C.Iridaceae C.Solanales C.Liliales APG.Iridaceae

APG.Arecaceae APG.Liliaceae APG.Arecales APG.Liliales C.Arecidae C.Arecaceae C.Liliaceae C.Arecales

APG.Alismatales APG.Araceae C.Araceae C.Arales

Fig. 5. The final one possible world merged solution for reconciling the two flowering plants classification systems. (green rectangular box=Cronquist; yellow “note” box=APG; grey round box=congruent; red concrete line=inferred parent-child relationship by Euler/X; red dotted line=inferred overlapping relationship from Euler/X) 10

Somewhat ironically, the limitation in this study also lies in the strong inferential power—in which when a parent node only has one child, the RCC reasoner in Euler/X will collapse the concepts and merge them as the same node. For example, Euler/X derived that the family Ericaceae and the order Ericales are exactly the same and merged Cronquist.Ericaceae, APG.Ericaceae, Cronquist.Ericales, and APG.Ericales as congruent. However, we argue that this limitation is due to the fact that we have only chosen some 40 major flower families instead of all 416 families. We could add missing children or artificial children here, or include the full 416 angiosperms families, but this is beyond the scope of our study. For the purposes of demonstrating the logic-based taxonomy alignment approach, our smaller use case is sufficient and more tractable . It is also worth noting that domain expert opinions are still needed to differentiate and lead us to the single solution we are looking for. We foresee our logic-based ap- proach for aligning KOS as an essential preliminary processing steps and a minimum viable product before bringing the KOS interoperability alignment problems to the do- main experts. KOS alignment problems are usually complicated with a slow learning curve; domain experts, in our case, plant systematists, may not fully comprehend at first the reason we need to align different KOS. If we approach them directly with two clas- sifications and ask them one by one what relationships between each concept are, they will probably feel befuddled by the situation. Once we have the Euler/X-generated pos- sible worlds in the first round of alignments, whether a few, tens, hundreds, or even thousands of PWs, the alignment problem will become concrete to the domain experts and consulting them for validation of the articulations would be much easier. This study, we also believe, has further implications on making our classification systems more “full of beans” (here we take on the positive connotation, meaning full of energy), meaning that it may open doors to enable semantic interoperability, and enrich diversity in classification systems when we work with KOS alignments using the logic- based RCC-5 approach. We believe that in the future we can implement this approach for semantic interoperability issues among classifications in the information science community, or even other higher-level KOS such as ontologies.

7 Acknowledgement

The first author wishes to thank Dr. Stephen R. Downie for the wonderful introduc- tion to plant systematics. This research is the outcome of the course project of IB335 and the Independent Study taught by Dr. Downie. The authors also thank Dr. Nico M. Franz and Ms. Ly Dinh for support and kind feedback on this research.

References

1. Bowker, G. C., & Star, S. L. (2000). Sorting things out: Classification and its consequences. MIT press. 2. Hodge, G. (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Tra- ditional Authority Files. Digital Library Federation, Council on Library and Information Resources, 1755 Massachusetts Ave., NW, Suite 500, Washington, DC 20036. 11

3. Franz, N. M., Peet, R. K., & Weakley, A. S. (2008). On the use of taxonomic concepts in support of biodiversity research and taxonomy. Systematics Association Special Volume, 76, 63. 4. Cheng, Y.-Y., Franz, N., Scheider, J., Yu, S., Rodenhasen, T., Ludäscher, B. (2017). Agree- ing to disagree: reconciling conficting taxonomic views using a logic-based approach. In Proceedings of the ASIS&T 2017 Annual Meeting, Washington, D.C., USA, October 27- November 1st. IDEALS: http://hdl.handle.net/2142/97907 5. Cronquist, A. (1981). An integrated system of classification of flowering plants. Columbia University Press. 6. Byng, J. W., Chase, M. W., Christenhusz, M. J., Fay, M. F., Judd, W. S., Mabberley, D. J., ... & Briggs, B. (2016). An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Botanical Journal of the Linnean Society, 181(1), 1-20. 7. Downie, S. R. (2018). Lecture on Historical Systematics. Personal Collection of S. R Downie, University of Illinois at Urbana-Champaign, Champaign IL. 8. Judd, W. S., Campbell, C. S., Kellogg, E. A., & Stevens, P. F. (2016). Plant systematics. A phylogenetic approach. Sinauer Associates, Sunderland, Mass., USA, 464, 3-4. 3rd edition 9. Vane-Wright, R.I. 2013. Taxonomy, Methods of. In Levin S.A. (ed.), Encyclopedia of Bio- diversity (2nd edn) 7: 97–111. Waltham, MA: Academic Press. 10. Christenhusz, M. J., & Byng, J. W. (2016). The number of known plants species in the world and its annual increase. , 261(3), 201-217. 11. Franz, N. M., Chen, M., Kianmajd, P., Yu, S., Bowers, S., Weakley, A. S., & Ludäscher, B. (2016). Names are not good enough: Reasoning over taxonomic change in the complex 1. Semantic Web, 7(6), 645-667. 12. Takhtadzhian, A. L. (1997). Diversity and classification of flowering plants. Columbia Uni- versity Press.