From Textual Checklist to Information Systems

Total Page:16

File Type:pdf, Size:1020Kb

From Textual Checklist to Information Systems

From textual checklist to information systems

Corresponding author: Stefano Martellos Dept. of Life Sciences, University of Trieste, Via L. Giorgieri 10 I-34100 Treiste e-mail: [email protected] tel: +39 (0)40558 3889 fax: +39 (0)40 57 88 55 From textual checklist to information systems: the case study of ITALIC

Stefano Martellos

Abstract

Keywords biodiversity informatics, checklists, database, lichens

Revisori favoriti: de Felici, Holetschek, Guntsch, Attorre; sfavorito: Dave Roberts Introduction Checklists are a fundamental tool for accessing the information which has been prouced during centuries of biological research. They summarise part of the biological diversity in a given area, and provide the basis for specimen revision, critical re-appraisal of poorly-known taxa, and further exploration of under-investigated areas. For this reason, they are an endless work in progress, and could be continuously updated with new information. Checklists are structured in different formats, from lists of names to detailed reports on the distribution and ecology of listed taxa, and are normally published in the form of paper-printed books, or as special issues of scientific journals. For this reason, they cannot be updated without printing a new editions, or publishing notes and/or updates. Paper-printed checklists lack some of the most interesting features of online databases, which 1) are esily accessible through the web, 2) are updated by a continuous flow on new data, and 3) organise data in an effective way, e.g. by returning complex elaborations, other than lists of taxa only. The possibility to convert textual checklists into digital formats have been explored since the beginning of the internet age. An approach consist in publishing the original files - or, when the original files did not exist, or were missing, a digitalised version of a paper-printed book - in the Web (two interesting examples: Wetter & al., 2001, Meades & al, 2000). This approach is effectively used also in important international repositories, e.g. the BHL (Biodiversity Heritage Library, http://www.biodiversitylibrary.org/), which are fundamental in improving the accessibility to scientific literature. However, this approach can not increase the performances of a digital checklist to the level of a database, which require the conversion of the text into structured data. Normally, checklists have a more or less regular structure, in which taxa and their data (distribution in the area, ecology, etc.) are organised in “textual records”. These records can be delimited by symbols, strings or formattations, such as carriage returns, which are normally present in texts. Further delimitators can be found inside the “records”, in order to atomise them into “fields”. This atomisation process can produce structured data files, which can be used in different scenarios: 1) exported to existing facilities, e.g. the GBIF (Global Biodiversity Information Facility, http://www.gbif.org); 2) used in collaborative virtual research environments (e.g. the Scratchpads, http://scratchpads.eu/); 3) organised in information systems developed ad hoc. An interesting example of the first scenario is given by Ramsen & al. (2012), which converted a textual checklist of ca. 4100 vascular plants into structured data following the Darwin Core Archive (DwC-A) standard, which was recently developed by the GBIF and the Biodiversity Information Standards (TDWG). The data were then published online through the GBIF. In this paper, the whole process of conversion of the Checklist of Italian Lichens (Nimis, 1993) into an information system (Nimis & Martellos, 2002) is discussed. The Checklist of Italian Lichens The checklist (Nimis, 1993) was originally written by using Microsoft Word for DOS 5.0 on a personal computer running DOS Operating System, then converted into a Microsoft Word for Winrows format. It is organised in sections, each containing the information on a taxon. A section (fig. 1) is divided into serveral paragraphs, each devoted to different information: 1. taxon name, followed by the author(s); 2. reference to the first paper reporting the taxon, and, when present, its basyonym, followed by the reference to its pubblication 3. synonyms, listed alphabetically 4. distribution in the country, divided into three areas (Northern Italy, N; Central Italy, C; Southern Italy, S). For each area there is a further subdivision into administrative regions (Table 1 – divisione dei tre distretti in regioni amministrative). For each administrative region, all references to scientific papers reporting the presence of a taxon are listed 5. ecology, expressend with a sequence of ecological indicatiors values (Table 2 – indicatori ecologici e loro valori), and a note providing further comments on the presence and distribution of the taxon in the country.

+Hypocenomyce scalaris (Ach.) M.Choisy Bull. Mens. Soc. Linn. Lyon, 22: 103, 1953 - Lichen scalaris [Lilj. ex] Ach., Utkast. Sv. Fl.: 422, 1792. Syn.: Biatora ostreata (Hoffm.) Th.Fr., Lecidea ostreata (Hoffm.) Schaer., Lecidea scalaris (Ach.) Ach., Psora ostreata Hoffm., Psora scalaris (Ach.) Hook. N - VG, Frl, Ven (Nascimbene & Caniglia 1997, 2000b, 2002c, 2003c, Caniglia & al. 1999, Nascimbene & al. 2006e), TAA (Lecid. Exs. 262: Hertel 1992b, Nascimbene & Caniglia 2000b, 2002c, Caniglia & al. 2002, Nascimbene 2005b, 2006b, Nascimbene & al. 2005, 2006, 2006e), Lomb (Alessio & al. 1995, Valcuvia & al. 2003, Nascimbene & al. 2006e), Piem (Caniglia & al. 1992, Isocrono & al. 2004, 2007), VA (Piervittori & Isocrono 1999, Valcuvia & al. 2000b), Emil (Nimis & al. 1996, Tretiach & al. 2008). C - Tosc (Benesperi 2007), Umbr (Ravera 1998, Ravera & al. 2006), Marc (Nimis & Tretiach 1999), Laz (Massari & Ravera 2003), Abr (Nimis & Tretiach 1999), Sar (Zedda 1995, Zedda & Sipman 2001). S - Bas (Potenza 2006), Cal (Puntillo 1996), Si. Sq/ Ch/ A.s/ Epiph-Lign/ 1-2, 3-5, 3-4, 1/ Alt: 2-4/ A: a, A1: vc, B: a, C: rr, D: vr, E: a, F: vr, G: a, H: a/ PF: 1-2/ Note: a temperate to boreal-montane, circumpolar lichen, found on acid bark, esp. of conifers, and on lignum, incl. charred wood, much more common in the north than in the mountains of the south.

The first version of the checklists was printed in 1993, and the original file was continuously updated by the author. In 1997 it was decided to try to convert the text file into the first version of ITALIC, the Information System on Italian Lichens. The process involved the conversion of the text into structured data, their sotrage into an Oracle 8 database, and the development of several query interfaces (Nimis & Martellos, 2002). This first version was then improved during the years, by adding new functions and modernising its layout. A new version of the whole information system is expected for spring 2013. The second version of the checklist (Nimis & Martellos, 2003) originates from the information system, which structured data were converted into a textual format, and published on paper.

Conversion First step: looking for separators among records and blocks of information The checklist is structured in chapters - delimited by two carriage returns – beginning with the name of a taxon, preceded by a + character. Each chapter is divided into sections separated by a single carriage return, and the section devoted to synonyms starts always with the string “Syn.: “. The convertion begins dividing the text in records (one for each taxon), and each record into five fields. The process, for which Microsoft Word for Windows was used, consists in replacing all the carriage returns (^p) with a symbol which is not present in the text, in this case “@”. The result is a text without carriage returns. The double “@” which preced a “+” symbol were then replaced with a single carriage return (^p). At the end, the text is divided in paragraphs separated by carriage returns, and divided into five sections by the “@” symbol.

Hypocenomyce scalaris (Ach.) M.Choisy@Bull. Mens. Soc. Linn. Lyon, 22: 103, 1953 - Lichen scalaris [Lilj. ex] Ach., Utkast. Sv. Fl.: 422, 1792.@Syn.: Biatora ostreata (Hoffm.) Th.Fr., Lecidea ostreata (Hoffm.) Schaer., Lecidea scalaris (Ach.) Ach., Psora ostreata Hoffm., Psora scalaris (Ach.) Hook.@N - VG, Frl, Ven (Nascimbene & Caniglia 1997, 2000b, 2002c, 2003c, Caniglia & al. 1999, Nascimbene & al. 2006e), TAA (Lecid. Exs. 262: Hertel 1992b, Nascimbene & Caniglia 2000b, 2002c, Caniglia & al. 2002, Nascimbene 2005b, 2006b, Nascimbene & al. 2005, 2006, 2006e), Lomb (Alessio & al. 1995, Valcuvia & al. 2003, Nascimbene & al. 2006e), Piem (Caniglia & al. 1992, Isocrono & al. 2004, 2007), VA (Piervittori & Isocrono 1999, Valcuvia & al. 2000b), Emil (Nimis & al. 1996, Tretiach & al. 2008). C - Tosc (Benesperi 2007), Umbr (Ravera 1998, Ravera & al. 2006), Marc (Nimis & Tretiach 1999), Laz (Massari & Ravera 2003), Abr (Nimis & Tretiach 1999), Sar (Zedda 1995, Zedda & Sipman 2001). S - Bas (Potenza 2006), Cal (Puntillo 1996), Si.@Sq/ Ch/ A.s/ Epiph-Lign/ 1-2, 3-5, 3-4, 1/ Alt: 2-4/ A: a, A1: vc, B: a, C: rr, D: vr, E: a, F: vr, G: a, H: a/ PF: 1-2/ Note: a temperate to boreal-montane, circumpolar lichen, found on acid bark, esp. of conifers, and on lignum, incl. charred wood, much more common in the north than in the mountains of the south.

Second step: from Microsoft Word to Microsoft Access The second step converts the text file into a Microsoft Access data table. The processs requires the conversion of the Word file into a Text (.txt) file. The file can then be inported into an Access table with five columns, by using the symbol “@” as column separator. This data table can already be used in an information system to performa simple queries. However, while developing ITALIC, it was decided to continue in the conversion process, trying to obtain a further atomisation of the text, separating taxonomic and distributional information from synonyms and from ecological informations, and splitting the data in three different tables. During the process, three copies of the original table were made. The first, named “taxonomy”, hosted the first two columns and the fourth (name, basionym and istribution). The second, named “synonyms”, hosted the first and the third column (name and synonyms). The third, named “ecology”, hosted the first and the fifth column (name and ecology). Each table underwent further elaboration separately.

Third step: taxonomy and distribution The table “taxonomy” is made of three columns, and do not require any further elaboration. The distribution, ontained as a text in a single column, could have been split into several different columns, one for each administrative region. However, databases can easily perform complex queries in textual columns, and maintaining all the distirbutional information in a single field does not represent a drawback for the functionalities of the infrmation system.

Fourth step: atomising ecology The third table, “ecology”, required complex elaborations. At the beginning, it is made of two columns, the first containing the name of the taxa, the second a complex and long string, which is composed both of texts and numerical data (the ecological indicator values), and of the commonness rarity status of the taxa in 9 bioclimatical belts (Nimis & Martellos, 2003 – Italic, the info system etc.). This column can be divided into several parts, by using two separators: the word “Note: “, which separated author notes from other information, and the slash (“/”) symbol, which was used to separate: 1) growth form, 2) type of photobiont, 3) reproductive strategy, 4) substrata, 5) ecological indicator values, 6) altitudinal range, 7) commonness rarity in the bioclimatic districts, and 8) poleophoby. Practivcally, “Note: “ and “/ ” were replaced with the symbol “@”. At the end, the second column ot the table ecology contained strings like:

Sq@[email protected]@Epiph-Lign@1-2, 3-5, 3-4, 1@Alt: 2-4@A: a, A1: vc, B: a, C: rr, D: vr, E: a, F: vr, G: a, H: a@PF: 1-2@a temperate to boreal-montane, circumpolar lichen, found on acid bark, esp. of conifers, and on lignum, incl. charred wood, much more common in the north than in the mountains of the south.

The table was then exported into a text file, by using “@” as separator. The resulting text file is then re-imported into Access, again by using “@” as separator. The result is a table with ten columns.

Hypocenom Sq Ch A.s Epiph-Lign 1-2, 3-5, Alt: A: a, PF: a temperate to boreal- yce scalaris 3-4, 1 2-4 A1: 1-2 montane, circumpolar (Ach.) vc, B: lichen, found on acid bark, M.Choisy a, C: esp. of conifers, and on rr, D: lignum, incl. charred vr, E: wood, much more a, F: common in the north than vr, G: in the mountains of the a, H: a south.

Furter refinemend is made by removing the “Alt: “ and “PF: “ strings from the seventh and ninth columns, and the codes defining the bioclimatic districts (A:, A1:, B:, etc.) from the eighth. The process continues focusing on the conversion of some fields (ecological indicator values, altitude, commonnes-rarity status and poleophoby) from textual to numerical, in order to permit to the Information System complex elaborations on these information. At the beginning, all the ecological indicator values and the commonness-rarity status of the 9 bioclimatic districts are separated into different columns, by replacing the commas in columns six and eight with the “@”, and exporting and re-importing the table by using “@” as separator. The result is a table with 21 columns.

Hypocen S C A Epi 1-2 3-5 3-4 1 2-4 a vc a rr vr a vr a a 1-2 a temperate to omyce q h .s ph- boreal-montane, scalaris Lig circumpolar lichen, (Ach.) n found on acid bark, M.Chois esp. of conifers, and y on lignum, incl. charred wood, much more common in the north than in the mountains of the south.

Single values (e.g. 1) for ecological indicator values were converted into double values (e.g. 1-1), so that all the ecological indicator values, altitudes and poleophoby scores were espressed as ranges. Then, the “-” symbols were replaced by “@”. This operations was limited to columns 6-10 and 20, because the symbol “-” could be present in other columns (e.g. in the fifth column, “Epiph-Lign”, and in the 21th column). The table was then exported and re-imported again. The result was a table with 27 columns.

Hypoce S C A Epi 1 2 3 5 3 4 1 1 2 4 a vc a rr vr a vr a a 1 2 a temperate to nomyce q h .s ph- boreal-montane, scalaris Lig circumpolar lichen, (Ach.) n found on acid bark, M.Choi esp. of conifers, and sy on lignum, incl. charred wood, much more common in the north than in the mountains of the south.

Each ecological indicator value was represented by two colums, a maximum and a minimum. The commonness rarity status was stored in nine columns, but espressed by textual codes, which needed to be converted into numbers, ranging form 0 (absent, a) to 8 (extermely common, ec). This was dome by a search and replace process column by column. The result is shown in fig. XX

Hypoce S C A Epi 1 2 3 5 3 4 1 1 2 4 0 7 0 4 2 0 2 0 0 1 2 a temperate to boreal- nomyce q h .s ph- montane, circumpolar scalaris Lig lichen, found on acid (Ach.) n bark, esp. of conifers, M.Choi and on lignum, incl. sy charred wood, much more common in the north than in the mountains of the south.

The three tables - “taxonomy”, “ecology” and “synonyms” - were then imported into an Oracle 10g database.

Fifth step: the Information System The Information System (available at the address http://dbiodbs.units.it/) was developed on the data stored in the three tables, and was written in PL/SQL language. It can be queryed by using three query interfaces (Nimis & Martellos, 2002): 1. Taxonomic interface, which permits to retrieve all the information on a taxon, extracting data from all the tables and from all the realted archives (images, maps, etc.) which have been added to the Information System. 2. Floristic interface, which permits to build “virtual” releves of lichen vegetation, by combining ecological indicator values and other data, hence reconstructing certain environmental conditions, and returning lists of taxa which potentially occur under those conditions (Nimis & Martellos, 2001). 3. Statistic interface, which returns matrices of data for two selected parameters. This interface permits complx elaboration, such as returning the matrix of epiphytic lichen occurring in shady situations in the different bioclimatic districts of the country in relation to the eutrophication. The results of this interface were used, as an example, by Nimis & Martellos (2003).

Sixth step: atomisation of synonyms The table “synonyms” underwent a further elaboration, by separating each synonym in a different records, after the Information System was completed. This transformation, while not fundamental for the query systems, was performed to easily return even a single synonym instead of a list when the system is queried for a string in taxon names. Each row ot the table ”synonyms” was extracted and elaborated by using the comma which separated the synonyms. The process created as many records as the synonyms, and inserted them into a new table “synonyms2”. At the end, the table “synonyms2” is used in the information system and the original table “synonyms” is dropped. A serious problem, in this case, could be due to the use of the comma as a separator. In fact, when a taxon name hase several authors, they are separated by a comma. This is rare in lichens, but common e.g. in vascular plants. For this reason, this process required a thorough manual review of the results, and could be not be easy to perform in other checklists. Discussion Nowaday, one of the most challenging tasks in biodiversity informatics is exposing into the digital domain the literature produced in centuries of scientific research. One successful approach to the problem is represented by publishing in large online repositories, e.g. the Biological Heritage Library (******* cita), scanned versions of original papers and books. However this process, sometimes – as in the case of checklists - can go further, strongly enhancing the use of original data. ******While this process can be difficult starting from paper printed texts, when the original digital files are available, bla blabla************** Converting a text into a structured data format can be a difficult process. Even in consistently structured texts, where paragraphs have all the same organisation, some differences can be present, hence creating problems during the process. Separators can be missing, or some imformation can be absent, thus creating gaps in the data structure. For this reason, each conversion should be followed by a careful quality control, to verify the structure of the data. However, the conversion of a checklist into a structured data format is often feasible, and can strongly enhance the usability of the information it contains. Once structured, data can be exported in different standards (e.g., Darwin Core Archive), thus contributing to different projects or repositories of biodiversity information. Furthermore, structured data can be used in complex information systems, hence returning results far more complex than lists of taxa. In the example provided here, structured data deriving from a checklist are used to produce virtual releves of lichen vegetation, predictive distributional maps, data matrices depicting the distribution of lichens in different ecological scenarios, etc. These information systems can be published in the web as stand- alone resources, or be integrated into national (Martellos & al., 2011) and/or international networks of biodiversity data. *** citare tra le risorse in cui possono essere integrati I dati anche biocase e vibrant?? Aknowledgments This research was funded by the Italian Ministry of Environment (MATTM) in the framework of the National project “Sistema Ambiente 2010” for the development of the Italian National Biodiversity Node (NNB). The author is grateful to Prof. P.L. Nimis for his useful comments on the paper. References Martellos S, Attorre F, De Felici S, Cesaroni D, Sbordoni V, Blasi C, Nimis PL. 2011. Plant sciences and the Italian National Biodiversity Network. Plant Biosystems 145(4): 758-761. DOI: 10.1080/11263504.2011.620342 Meades SJ, Stuart G, Broulliet L. 2000. Annotated Checklist of the Vascular Plants of Newfoundland and Labrador. [cited 2012 May 28. Available from: http://www.digitalnaturalhistory.com/meades.htm Nimis PL. 1993. The Lichens of Italy. An Annotated Catalogue. Mus. Reg. Sci. Nat. Torino, Monogr. XII, 897 pp. Nimis PL, Martellos S. 2001. Testing the predictivity of ecological indicator values. A comparison of real and virtual releves of lichen vegetation. Plant Ecology 157: 165-172 Nimis PL, Martellos S. 2002. ITALIC, a database on Italian Lichens Bibliotheca Lichenologica 82: 271-282 Nimis PL, Martellos S. 2003. On the ecology of sorediate lichens in Italy Bibliotheca Lichenologica 86 Nimis PL, Martellos S. 2003. A second checklist of the lichens of Italy, with a thesaurus of synonyms. Mus. Reg. Sci. Nat. Saint-Pierre, Valle d’Aosta, Monogr. 4, 192 pp. Remsen D, Knapp S, Georgiev T, Stoev P, Penev L. 2012. From text to structured data: Converting a word-processed floristic checklist into Darwin Core Archive format. PhytoKeys 9: 1–13. DOI: 10.3897/phytokeys.9.2770 Wetter MA, Cochrane TS, Black MR. Watermolen, Dreux J., Editor. 2001 - Checklist of the vascular plants of Wisconsin (Technical bulletin. (Wisconsin Dept. of Natural Resources), No. 192) Wisconsin Department of Natural Resources, 2001. 258 pgs. [cited 2012 May 28]. Available from: http://digital.library.wisc.edu/1711.dl/EcoNatRes.DNRBull192 Tables Legend to figures Figures Addresses of the author

Stefano Martellos Dept. of Life Sciences University of Trieste, Via L. Giorgieri 10 I-34100 Trieste, Italy

Recommended publications